Something tells me that I haven't trained on 6 TB of code and can meaningfully outperform any AI. That tells me that there's something still structurally missing about the training efficiency. I wonder if this replicates to things like chess/go - for a computer trained on the same number of games that a human is, is the computer still able to outperform a human?
>We inspected StarCoder-generated programs on these benchmarks and found that there were several cases where the model produces what are effectively empty solutions, e.g., pass or a comment Insert code here. We also observed this kind of failure in every model we evaluated.
I'm not sure whether the AI learning that it can just write "#TODO" is a sign our jobs are safe or a sign our jobs are truly in danger.
Could be a sign the thing knows how to break work into multiple pieces. If it wasn’t just 1-pass and you give it a couple turns to document / test / deliver, it definitely can fill in placeholders from the initial generative step when it does refinement. Language chains, not instant zero shot perfection
The biggest interest I have in this, is that I would like to have the ability to ask questions about large code-bases. I think being able to generate small functions or explain single code sections is nice, but being able to ask bigger architectural questions would be really helpful for all kind of engineers (in particular in a large company).
I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?
Steve Yegge's recent blog posts claim that SourceGraph are getting a pretty good result by using embeddings created from their knowledge graph of the code structure. That's still the usual [create embeddings, search against embedding of query, retrieve results and use them as prompt] schlep, so yeah, it isn't really understanding architecture well yet.
I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.
Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's can be expensive but most importantly disturb the usefulness of having all the generalizations baked in from training data [1].
While LLMs can generate code based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. It's just very lossy. Perhaps it wouldn't be the best use for single code base fine tuning right now.
Can you please share more about the merging context across levels? This sounds interesting!
Right now the solution is vector databases; however we could envision a different state representation in the transformer decoder which is the main component of a GPT; for example, you could summarize your architecture and tests and implementation with compressed / smaller vectors for each piece and organize that stuff in a tree structure. Then just concatenate the tree to the context and user query. It’d require you to rewrite the multi head attention function or make a wrapper, and it’d add an ETL step to create the tree, but then you could have that whole compressed representation of your codebase available when you ask a question. It would necessarily be an abstraction and not verbatim copy of the code, otherwise you’d run out of room. Funny how everything runs into Kolmogorov complexity eventually
As usual, concrete results are poor, with best results obtained for Python, reported on HumanEval at 40.8% and 52.7% on MBPP, where previous best was 33.5 and 45.9 respectively (both by original copilot model). Results on DS-100 (simle data science programming tasks) are much more modest, at around 30%.
And all this despite the "pass@k" evaluation metric, which is very misleading: it's clearly selected to make a code generating model look its absolute best. For example, the "pass@1" metric is _estimated_ not by choosing a single solution generated by a model, for a given programming task, and checking whether the solution completes the programming task correctly, but by generating a single solution multiple times (200 or 20, depending on model) and then averaging over them. So while it's called "pass-at-one" the "one" is actually a bunch of randomly drawn samples and not a single solution. Like I say, very misleading. See Section 6.1.1 in the paper.
It’s perhaps in the training dataset but unless your code is extremely common and duplicated, it’s probably not in the final models. They aren’t that big.
(Possibly naive question) This is marketed as open source. Does that mean I can download the model and run it locally? If so, what kind of GPU would I need?
A 3090 (or any GPU with >=20GB VRAM) can run StarCoder with int8 quantization at about 12 tokens per second, 33 with assisted generation -- which will come out for StarCoder in the coming days.
When 4-bit quantization comes out, I would expect a GPU with 12GB VRAM to be able to run it.
Good idea, but pretty sure this is already widely done. For example, Alex Gravely (architect behind copilot) mentioned on No Priors[1] they would generate the implementations for tests in random github projects and check if they passed as feedback.
>"The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process."
>"The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant."
Does it have a view of what licenses can mix, or is it simply disallowed from crossing that boundary and only offer answers sourced entirely within the confines of this or that specific license? The latter poses some interesting scenarios and questions.
Permissively licensed would imply non-copyleft to me. That means only licenses like Apache or MIT would be allowed to be train on, but not licenses like GPL.
What speed should we expect from the model on consumer hardware? I tried a 8 bit quantized version on 4090 and got it to generate 100 tokens for 13 second, which seems a bit slow to me.
All the code generation tools, StarCoder included, still have hallucinations. In this context code that looks good, but doesn't work or has a subtle bug. How do we address that?
> All the code generation tools, StarCoder included, still have hallucinations.
This also includes humans. We "hallucinate" in very similar ways. For example mistaking localhost:8080 for localhost:8008 in a large config file. Attempting to use methods that were deprecated and no longer exist, etc.
IMO there's two ways to prevent this is - one is to make better performing models (architecture/training data/training amount/etc)
The other is the exact same as humans. Compile time tools that let it know immediately if it hallucinated, types, linting, tests, etc.
You just do it as a loop the exact same as a human. You write code, the compiler tells you that method doesn't exist, you adjust your code/consult the documents (also doable with agents).
I've been playing with StarCoder for the last week. It performs great once fine-tuned. Highly recommend people use it as a base model for anything, not just coding.
Has anyone figured out a way to fine tune this with 24gb of vram? I have tried with deepspeed etc but no luck. Seems to be just out of reach for fine tuning requiring 26gb.
This didn't generate anything like actual perl code but the paper did say it wasn't good at perl (relatively) and in their defense my code it was completing was full of regex. What I did enjoy was how it picked up on my style of extremely long variable and subroutine names without spaces. It even named them with swear words like I do.
I thought you didn't need an account to download from HF anymore. You can just do git lfs pull, at least for the stuff I've downloaded.
Personally I'm concerned about how model hosting has been concentrated in one company, and was previously very unhappy that they required accounts, but I think that's past. Let me know if it's still the case for some things.
[+] [-] simonw|2 years ago|reply
Interesting to note that The Stack is 6TB - the whole of the RedPajama LLM training set (a lot more than just code) is only 2.6TB.
To get an idea what that training data looks like, I grabbed the first 300MB SQL file from https://huggingface.co/datasets/bigcode/the-stack/tree/main/... and then dumped the first 1,000 rows from that into JSON and loaded it into Datasette Lite:
https://lite.datasette.io/?json=https://gist.github.com/simo...
Here's a query that shows a random row - hit the blue "Run SQL" button to see another one: https://lite.datasette.io/?json=https://gist.github.com/simo...
[+] [-] vlovich123|2 years ago|reply
[+] [-] ed|2 years ago|reply
[+] [-] Imnimo|2 years ago|reply
I'm not sure whether the AI learning that it can just write "#TODO" is a sign our jobs are safe or a sign our jobs are truly in danger.
[+] [-] bionhoward|2 years ago|reply
[+] [-] fleischhauf|2 years ago|reply
[+] [-] bootloop|2 years ago|reply
I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?
[+] [-] zellyn|2 years ago|reply
I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.
[+] [-] heliophobicdude|2 years ago|reply
Can you please share more about the merging context across levels? This sounds interesting!
1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf
[+] [-] bionhoward|2 years ago|reply
[+] [-] bradleyjg|2 years ago|reply
[+] [-] YeGoblynQueenne|2 years ago|reply
And all this despite the "pass@k" evaluation metric, which is very misleading: it's clearly selected to make a code generating model look its absolute best. For example, the "pass@1" metric is _estimated_ not by choosing a single solution generated by a model, for a given programming task, and checking whether the solution completes the programming task correctly, but by generating a single solution multiple times (200 or 20, depending on model) and then averaging over them. So while it's called "pass-at-one" the "one" is actually a bunch of randomly drawn samples and not a single solution. Like I say, very misleading. See Section 6.1.1 in the paper.
[+] [-] freeqaz|2 years ago|reply
[+] [-] CodeCompost|2 years ago|reply
[+] [-] nr2x|2 years ago|reply
[+] [-] speedgoose|2 years ago|reply
[+] [-] cs702|2 years ago|reply
A big THANK YOU to everyone who made it possible.
I'm looking forward to playing with it -- and also, eventually, inevitably, running a quantized, super-efficient version on my laptop.
[+] [-] jimlongton|2 years ago|reply
[+] [-] pyrophane|2 years ago|reply
https://huggingface.co/docs/transformers/perf_train_gpu_one
[+] [-] joaogante|2 years ago|reply
When 4-bit quantization comes out, I would expect a GPU with 12GB VRAM to be able to run it.
Disclaimer: I work at Hugging Face
[+] [-] heliophobicdude|2 years ago|reply
I don't think StarCoderBase is instruction-tuned off the bat but would serve as a good starting point for a new technique.
RLHF is fine for things that are hard to measure and evaluate, but code is runnable and testable.
I propose we try Reinforcement Learning Machine Feedback or RLMF.
Prompts and responses are evaluated by how accurate the response evaluated to. We can then train a reward model to help refine StarCoder.
[+] [-] grandmczeb|2 years ago|reply
[1] https://open.spotify.com/episode/2a8Rtm4mhjzennOoAByFKx around 15:10
[+] [-] daneel_w|2 years ago|reply
>"The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant."
Does it have a view of what licenses can mix, or is it simply disallowed from crossing that boundary and only offer answers sourced entirely within the confines of this or that specific license? The latter poses some interesting scenarios and questions.
[+] [-] freeqaz|2 years ago|reply
[+] [-] nl|2 years ago|reply
There are 193 licenses in total. v1.0 of The Stack included MPL/EPL/LGPL whereas v1.1+ doesn't include them.
[+] [-] veselin|2 years ago|reply
[+] [-] ofermend|2 years ago|reply
[+] [-] theaiquestion|2 years ago|reply
This also includes humans. We "hallucinate" in very similar ways. For example mistaking localhost:8080 for localhost:8008 in a large config file. Attempting to use methods that were deprecated and no longer exist, etc.
IMO there's two ways to prevent this is - one is to make better performing models (architecture/training data/training amount/etc)
The other is the exact same as humans. Compile time tools that let it know immediately if it hallucinated, types, linting, tests, etc.
You just do it as a loop the exact same as a human. You write code, the compiler tells you that method doesn't exist, you adjust your code/consult the documents (also doable with agents).
[+] [-] arthurcolle|2 years ago|reply
[+] [-] ipsum2|2 years ago|reply
[+] [-] enum|2 years ago|reply
[+] [-] fbodz|2 years ago|reply
[+] [-] csdvrx|2 years ago|reply
What hardware are you using? (CPU,RAM,GPU,VRAM)
Have you considered using llama.cpp for a mixed CPU+GPU use (if you have enough RAM)
[+] [-] mirekrusin|2 years ago|reply
2x 24G - for dual GPU ~ 28B model
1x 24G ~ 14B model
etc.
[+] [-] superkuh|2 years ago|reply
[+] [-] DanielShir|2 years ago|reply
[+] [-] pfd1986|2 years ago|reply
[+] [-] ftxbro|2 years ago|reply
[+] [-] version_five|2 years ago|reply
Personally I'm concerned about how model hosting has been concentrated in one company, and was previously very unhappy that they required accounts, but I think that's past. Let me know if it's still the case for some things.
[+] [-] pabs3|2 years ago|reply
[+] [-] VadimPR|2 years ago|reply
[+] [-] pplanel|2 years ago|reply