StarCoder and StarCoderBase: 15.5B parameter models with 8K context length

[+] simonw|2 years ago|reply

This is trained on The Stack, which is available here: https://huggingface.co/datasets/bigcode/the-stack/

Interesting to note that The Stack is 6TB - the whole of the RedPajama LLM training set (a lot more than just code) is only 2.6TB.

To get an idea what that training data looks like, I grabbed the first 300MB SQL file from https://huggingface.co/datasets/bigcode/the-stack/tree/main/... and then dumped the first 1,000 rows from that into JSON and loaded it into Datasette Lite:

https://lite.datasette.io/?json=https://gist.github.com/simo...

Here's a query that shows a random row - hit the blue "Run SQL" button to see another one: https://lite.datasette.io/?json=https://gist.github.com/simo...

[+] vlovich123|2 years ago|reply

Something tells me that I haven't trained on 6 TB of code and can meaningfully outperform any AI. That tells me that there's something still structurally missing about the training efficiency. I wonder if this replicates to things like chess/go - for a computer trained on the same number of games that a human is, is the computer still able to outperform a human?

[+] ed|2 years ago|reply

Nit, but the 6TB version includes a lot of forks and duplicated code so I assume StarCoder was trained on the deduped version, which is 2.9TB.

[+] Imnimo|2 years ago|reply

>We inspected StarCoder-generated programs on these benchmarks and found that there were several cases where the model produces what are effectively empty solutions, e.g., pass or a comment Insert code here. We also observed this kind of failure in every model we evaluated.

I'm not sure whether the AI learning that it can just write "#TODO" is a sign our jobs are safe or a sign our jobs are truly in danger.

[+] bionhoward|2 years ago|reply

Could be a sign the thing knows how to break work into multiple pieces. If it wasn’t just 1-pass and you give it a couple turns to document / test / deliver, it definitely can fill in placeholders from the initial generative step when it does refinement. Language chains, not instant zero shot perfection

[+] fleischhauf|2 years ago|reply

sounds more like lazyness, I think we might be ok actually.

[+] bootloop|2 years ago|reply

The biggest interest I have in this, is that I would like to have the ability to ask questions about large code-bases. I think being able to generate small functions or explain single code sections is nice, but being able to ask bigger architectural questions would be really helpful for all kind of engineers (in particular in a large company).

I have seen approaches with merging context across multiple levels. But that can only do so much. Is it viable to fine-train a model to a specific code-base so it has knowledge across all files? Does anyone have more info on this kind of problem space?

[+] zellyn|2 years ago|reply

Steve Yegge's recent blog posts claim that SourceGraph are getting a pretty good result by using embeddings created from their knowledge graph of the code structure. That's still the usual [create embeddings, search against embedding of query, retrieve results and use them as prompt] schlep, so yeah, it isn't really understanding architecture well yet.

I too have a job where almost every question is about structural understanding and improvement of a large existing codebase. I'd love to have AI help, but I think it's going to take another iteration or three of model architecture to get there.

[+] heliophobicdude|2 years ago|reply

Refined training is usually updating the weights of usually what's called a foundational model with well structured and numerous data. It's can be expensive but most importantly disturb the usefulness of having all the generalizations baked in from training data [1]. While LLMs can generate code based on a wide range of inputs, they're not designed to retrieve specific pieces of information in the same way that a database or a search engine would. It's just very lossy. Perhaps it wouldn't be the best use for single code base fine tuning right now.

Can you please share more about the merging context across levels? This sounds interesting!

1: "Language Models are Few-Shot Learners" Brown et al. https://arxiv.org/pdf/2005.14165.pdf

[+] bionhoward|2 years ago|reply

Right now the solution is vector databases; however we could envision a different state representation in the transformer decoder which is the main component of a GPT; for example, you could summarize your architecture and tests and implementation with compressed / smaller vectors for each piece and organize that stuff in a tree structure. Then just concatenate the tree to the context and user query. It’d require you to rewrite the multi head attention function or make a wrapper, and it’d add an ETL step to create the tree, but then you could have that whole compressed representation of your codebase available when you ask a question. It would necessarily be an abstraction and not verbatim copy of the code, otherwise you’d run out of room. Funny how everything runs into Kolmogorov complexity eventually

[+] bradleyjg|2 years ago|reply

Exactly. I’d love to be able to ask where and how would I go about adding some new feature to a code base.

[+] YeGoblynQueenne|2 years ago|reply

As usual, concrete results are poor, with best results obtained for Python, reported on HumanEval at 40.8% and 52.7% on MBPP, where previous best was 33.5 and 45.9 respectively (both by original copilot model). Results on DS-100 (simle data science programming tasks) are much more modest, at around 30%.

And all this despite the "pass@k" evaluation metric, which is very misleading: it's clearly selected to make a code generating model look its absolute best. For example, the "pass@1" metric is _estimated_ not by choosing a single solution generated by a model, for a given programming task, and checking whether the solution completes the programming task correctly, but by generating a single solution multiple times (200 or 20, depending on model) and then averaging over them. So while it's called "pass-at-one" the "one" is actually a bunch of randomly drawn samples and not a single solution. Like I say, very misleading. See Section 6.1.1 in the paper.

[+] freeqaz|2 years ago|reply

Looks like the model is on HuggingFace here, for anybody that is curious to play with it. https://huggingface.co/bigcode/starcoder

[+] CodeCompost|2 years ago|reply

Sorry I'm a bit new to this. How does this work? Trying to read the site but have a hard time understanding this.

[+] nr2x|2 years ago|reply

Given some of my own open source code is no doubt in GPT and Bard, which feels wrong given the fees and limitations, I’m VERY VERY excited for this!

[+] speedgoose|2 years ago|reply

It’s perhaps in the training dataset but unless your code is extremely common and duplicated, it’s probably not in the final models. They aren’t that big.

[+] cs702|2 years ago|reply

It's great to see this!

A big THANK YOU to everyone who made it possible.

I'm looking forward to playing with it -- and also, eventually, inevitably, running a quantized, super-efficient version on my laptop.

[+] jimlongton|2 years ago|reply

(Possibly naive question) This is marketed as open source. Does that mean I can download the model and run it locally? If so, what kind of GPU would I need?

[+] pyrophane|2 years ago|reply

Here is a good reference:

https://huggingface.co/docs/transformers/perf_train_gpu_one

[+] joaogante|2 years ago|reply

A 3090 (or any GPU with >=20GB VRAM) can run StarCoder with int8 quantization at about 12 tokens per second, 33 with assisted generation -- which will come out for StarCoder in the coming days.

When 4-bit quantization comes out, I would expect a GPU with 12GB VRAM to be able to run it.

Disclaimer: I work at Hugging Face

[+] heliophobicdude|2 years ago|reply

I think we need a different strategy to instruction tuning for coding LLMs.

I don't think StarCoderBase is instruction-tuned off the bat but would serve as a good starting point for a new technique.

RLHF is fine for things that are hard to measure and evaluate, but code is runnable and testable.

I propose we try Reinforcement Learning Machine Feedback or RLMF.

Prompts and responses are evaluated by how accurate the response evaluated to. We can then train a reward model to help refine StarCoder.

[+] grandmczeb|2 years ago|reply

Good idea, but pretty sure this is already widely done. For example, Alex Gravely (architect behind copilot) mentioned on No Priors[1] they would generate the implementations for tests in random github projects and check if they passed as feedback.

[1] https://open.spotify.com/episode/2a8Rtm4mhjzennOoAByFKx around 15:10

[+] daneel_w|2 years ago|reply

>"The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process."

>"The Stack is a collection of source code from repositories with various licenses. Any use of all or part of the code gathered in The Stack must abide by the terms of the original licenses, including attribution clauses when relevant."

Does it have a view of what licenses can mix, or is it simply disallowed from crossing that boundary and only offer answers sourced entirely within the confines of this or that specific license? The latter poses some interesting scenarios and questions.

[+] freeqaz|2 years ago|reply

Permissively licensed would imply non-copyleft to me. That means only licenses like Apache or MIT would be allowed to be train on, but not licenses like GPL.

[+] nl|2 years ago|reply

Details are here: https://huggingface.co/datasets/bigcode/the-stack

There are 193 licenses in total. v1.0 of The Stack included MPL/EPL/LGPL whereas v1.1+ doesn't include them.

[+] veselin|2 years ago|reply

What speed should we expect from the model on consumer hardware? I tried a 8 bit quantized version on 4090 and got it to generate 100 tokens for 13 second, which seems a bit slow to me.

[+] ofermend|2 years ago|reply

All the code generation tools, StarCoder included, still have hallucinations. In this context code that looks good, but doesn't work or has a subtle bug. How do we address that?

[+] theaiquestion|2 years ago|reply

> All the code generation tools, StarCoder included, still have hallucinations.

This also includes humans. We "hallucinate" in very similar ways. For example mistaking localhost:8080 for localhost:8008 in a large config file. Attempting to use methods that were deprecated and no longer exist, etc.

IMO there's two ways to prevent this is - one is to make better performing models (architecture/training data/training amount/etc)

The other is the exact same as humans. Compile time tools that let it know immediately if it hallucinated, types, linting, tests, etc.

You just do it as a loop the exact same as a human. You write code, the compiler tells you that method doesn't exist, you adjust your code/consult the documents (also doable with agents).

[+] arthurcolle|2 years ago|reply

Verification systems that then feed back into the models and correct hallucinations. It is slow but I think that's the only real way forward

[+] ipsum2|2 years ago|reply

I've been playing with StarCoder for the last week. It performs great once fine-tuned. Highly recommend people use it as a base model for anything, not just coding.

[+] enum|2 years ago|reply

I'm curious--what have you fine-tuned it on?

[+] fbodz|2 years ago|reply

Has anyone figured out a way to fine tune this with 24gb of vram? I have tried with deepspeed etc but no luck. Seems to be just out of reach for fine tuning requiring 26gb.

[+] csdvrx|2 years ago|reply

Have you tried quantization? It's often a cheap and simple way to reduce the VRAM requirements.

What hardware are you using? (CPU,RAM,GPU,VRAM)

Have you considered using llama.cpp for a mixed CPU+GPU use (if you have enough RAM)

[+] mirekrusin|2 years ago|reply

People should be training model sizes that fit-and-fill consumer GPUs, ie:

2x 24G - for dual GPU ~ 28B model

1x 24G ~ 14B model

etc.

[+] superkuh|2 years ago|reply

This didn't generate anything like actual perl code but the paper did say it wasn't good at perl (relatively) and in their defense my code it was completing was full of regex. What I did enjoy was how it picked up on my style of extremely long variable and subroutine names without spaces. It even named them with swear words like I do.

[+] DanielShir|2 years ago|reply

Now I really want to read your code... Anything public? :)

[+] pfd1986|2 years ago|reply

Code LLMs and they didn't call it CLLaMs? :_(

[+] ftxbro|2 years ago|reply

Do I need to make an account on huggingface to get the model? I would prefer not to do it, and just download a zip like you can on github.

[+] version_five|2 years ago|reply

I thought you didn't need an account to download from HF anymore. You can just do git lfs pull, at least for the stuff I've downloaded.

Personally I'm concerned about how model hosting has been concentrated in one company, and was previously very unhappy that they required accounts, but I think that's past. Let me know if it's still the case for some things.

[+] pabs3|2 years ago|reply

Permissive licenses usually have attribution requirements, does using this mean you have to attribute all the projects from The Stack?

[+] VadimPR|2 years ago|reply

This is great - we needed a model where we're sure it won't reproduce someone's code with an incompatible license.

[+] pplanel|2 years ago|reply

It sucks at Rust

  fn convert_ogg_to_w  av(input: Path) -> Result<

162 comments