A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).
First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.
I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).
My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.
Impressive model, thank you for releasing it under a business-friendly license!
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?
The model is way too small, comparing it to Codex feels disingenous. Sure it's 77% smaller, it's also 77% worse. Although, it's a cool project nonetheless.
For instance, even this simple snippet generates wrong inline completions:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) =>
Replit-code-v1:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) => {
return array.filter((item) => item > 10);
};
Gets it wrong, returns odd numbers.
Codeium:
// Only return even numbers bigger than 10 from the array
const arrayFilter = (array) => {
return array.filter((num) => num > 10 && num % 2 === 0);
};
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned:
const arrayFilter = (array) => {
return array.filter(num => num % 2 === 0 && num > 10);
}
Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.
It seems like every week someone comes out with some version of "we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"
And then you dig in, and it's always far behind in some important way.
Not hating here, I love the pace of iteration, just not the hyperbole.
Yeah I tried the demo, it wrote some wrong code with comments in Chinese. I think I'll pass.
It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.
hi HN! back again with an exclusive deep dive with Replit’s head of AI. I attended their developer day last week (https://twitter.com/swyx/status/1650989632413401089) just expecting a regular fundraise announcement and was totally shocked when they annoucned their own LLM and also said they would open source it. so immediately asked them for a podcast interview and this is the result.
my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)
Replit: I tried to write HMM Forward function code:
```
# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,
# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.
#
def HMM_forward() :
# Write your code here
# Hint:
# - Remember that HMMs are stochastic, so you will need to multiply transition matrices by observation probabilities
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by observation probabilities
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices
# - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices
# - Remember that HMMs are stochastic, so
```
It gets stuck in an infinite loop.
ChatGPT gets it write, first shot:
```
import numpy as np
def forward_algorithm(A, B, pi, obs):
"""
Performs the forward algorithm for an HMM.
Parameters:
A: numpy.ndarray, shape (N, N)
Transition matrix of the HMM, where N is the number of states.
B: numpy.ndarray, shape (N, M)
Emission matrix of the HMM, where M is the number of possible observations.
pi: numpy.ndarray, shape (N,)
Initial probability distribution over states.
obs: numpy.ndarray, shape (T,)
Sequence of T observations.
Returns:
alpha: numpy.ndarray, shape (T, N)
Forward probabilities for each state at each time step.
"""
T = obs.shape[0]
N = A.shape[0]
alpha = np.zeros((T, N))
alpha[0] = pi * B[:, obs[0]]
for t in range(1, T):
alpha[t] = np.dot(alpha[t-1], A) * B[:, obs[t]]
return alpha
```
OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday
This is amazing work and bravo on to the people working on redpajama.
This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.
Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).
My first reaction was, "why is replit building LLMs," but I guess it fits their needs to have one optimized for their use. But I wonder, is this the beginning of another wave of "every company is an AI company?" Are we going to see a spike in tech hiring around AI/LLM, money starting to flow again, etc? And how many years until it all blows up and the layoffs start?
No Clojure. No Julia. No Haskell. No Racket. No Scheme. No Common Lisp. No OCaml. And, as much as I despise Microsoft, No C#. No F#. No Swift. No Objective-C. No Perl. No Datalog. A glaringly lacking choice of languages.
Despite the lack of examples, it still completes trivial clojure like "(defn connect [" and other lisp syntax like "(define (hello" which is promising for further refinement training on Lisp languages.
This is a bit hard to believe that the system is decent at producing code which captures complex ideas and higher level structure when the tokens/param value is >30 (it's ~200 here? ) The 'good' models (meaning having lots of 'knowledge' or 'memorization' about the dataset) typically tend to be around 2 tokens/param and models with decent generation of language with less knowledge/memorization are around 30 tokens/param.
Perhaps the domain allows for this, but due to the fact that the linguistic interface on the input is still needed... It's hard to believe.
I genuinely don't understand how anyone can use something like this and seriously think "oh yeah, this is revolutionary." It's almost complete garbage and can't do anything remotely interesting.
# a method that approximates the hyperbolic tangent (clamped tanh)
def rational_tanh(x):
return (x + 1) / (x - 1)
Even gave it the BIG hint of a "clamped" and "rational" tanh, but that ain't it, chief. Forget GPT-4, I would be embarrassed to even show this as a tech demo.
I keep thinking there should be a way to train a copilot against just one set of code libraries. I know LLMs require training against a lot of text to get their smarts, but is there a way to set this up so a model can be created for a specific library by anyone, so it could provide open source support via a transformer + model? Maybe this would be a better approach than a jack of all trades, master of none.
I can barely keep up with this stuff, but quick question. Is there a way to simply change the URL setting of copilot to point to this model? Obviously it needs an endpoint, I could hack something up, but asking if somebody has already done this? Would be nice to cancel my copilot.
Did you mess around with the settings? I'm getting a correct implementation and since it's deterministic (with default settings) it should be the same for you.
I think that 20 years from now, we'll all be sitting around wondering 1) where the fuck are my flying cars, and 2) what were they thinking using computers to write code?
And the reason I say this is because these tools are answering a question that we haven't asked yet: what common problems need to be solved in this programming language, and where do I get code to solve that problem?
These LLM modules are basically telling us how to duplicate code, and what we need is the opposite: how to stop reinventing the wheel for the 100th time.
Instead of writing code for me, tell me if I already have it. If I'm writing it, tell me there's a library for that. If I'm a library writer, give me suggestions for what libraries are missing from the toolkit.
All we've done so far is begun the process of automating the production of duplicate code. With absolutely no way to go back in time and correct bugs introduced in earlier iterations. We are likely, for instance, to see 0 day attacks that affect hundreds of applications, but with no simple way to describe which applications are affected. That's going to be a first rate trainwreck.
Well fwiw, working with GPT 4 it often suggests which libraries to use assuming the question allows for it, so it's not like everyone's writing everything from scratch.
But libraries and especially frameworks as they are these days are also a giant liability more often than not. APIs change for no reason, they can be removed from the package manager at any moment without warning, people may slip malicious code into them past LGTM reviews, have recursive dependencies upon dependencies that bloat and slow down your build process, etc.
Sometimes you don't need the to install the entire damn car manufacturing plant and dealership it comes with just to get that one wheel you needed. And an LLM can just write you the code for a very nicely customized wheel in a few seconds anyway.
> how to stop reinventing the wheel for the 100th time.
The idea of libraries may not have been a good one. It saved human time but no library is perfect because no abstraction is perfect and this causes unnecessary bloat. It seems tha Nature does not use libraries, it uses replication instead, and we can now have that too.
Unfortunately I'm someone who sometimes can't separate the art from the artist. Replit is the company where the founder sent these nasty pompous threats to their ex-employee for their innocent side project and then tried to double talk his way out of it with a bs non-apology when it got exposed in public. I won't support Replit or anything they make.
[+] [-] amasad|2 years ago|reply
- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...
- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b
- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo
- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081
A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).
It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.
Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).
[+] [-] newhouseb|2 years ago|reply
I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).
My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.
[+] [-] kir-gadjello|2 years ago|reply
Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.
Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.
Hope you get to look into this!
[+] [-] sputknick|2 years ago|reply
[+] [-] spenczar5|2 years ago|reply
[+] [-] letitgo12345|2 years ago|reply
[+] [-] pera|2 years ago|reply
1 - Why did you choose Markdown? It seems an odd choice for training a model like this.
2 - Have you tried to train only one single PL and then benchmark it against this more general version?
[+] [-] curiousgal|2 years ago|reply
Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/
[+] [-] gbasin|2 years ago|reply
[+] [-] doodlesdev|2 years ago|reply
For instance, even this simple snippet generates wrong inline completions:
Replit-code-v1: Gets it wrong, returns odd numbers.Codeium:
ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned: Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.[+] [-] amasad|2 years ago|reply
[+] [-] SheinhardtWigCo|2 years ago|reply
And then you dig in, and it's always far behind in some important way.
Not hating here, I love the pace of iteration, just not the hyperbole.
[+] [-] thewataccount|2 years ago|reply
On first look this seems to blow the current llama based models out of the water including the 30B ones.
Pasting what you want + url + example json with no other context and it "knows" what the url and the json is for, without even telling it.
I'm not even saying it's as good as chatGPT, but this is a tenth the size of the best llama models I've seen.
[+] [-] moffkalast|2 years ago|reply
It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.
[+] [-] johnfn|2 years ago|reply
Hehe, yeah, imagine saying you made a new programming language with 77% less lines of code than Python.
[+] [-] swyx|2 years ago|reply
my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)
please AMA!
[+] [-] marcodiego|2 years ago|reply
[+] [-] FanaHOVA|2 years ago|reply
[+] [-] swyx|2 years ago|reply
[+] [-] sashank_1509|2 years ago|reply
```
# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,
# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.
# def HMM_forward() :
``` It gets stuck in an infinite loop. ChatGPT gets it write, first shot:```
import numpy as np
def forward_algorithm(A, B, pi, obs):
``` OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday[+] [-] tyingq|2 years ago|reply
I tried a prompt of:
And it produced: It doesn't add the needed import statement, and I'm unclear why it's "defining the size of the grid".[+] [-] GreedClarifies|2 years ago|reply
This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.
Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).
AMAZING WORK.
[+] [-] laweijfmvo|2 years ago|reply
[+] [-] m3kw9|2 years ago|reply
[+] [-] swyx|2 years ago|reply
[+] [-] waffletower|2 years ago|reply
[+] [-] ubertaco|2 years ago|reply
https://try.ocamlpro.com/#code/type'point'='$4'x:'int;'y':'i...
[+] [-] esjeon|2 years ago|reply
[+] [-] Dayshine|2 years ago|reply
[+] [-] mclide|2 years ago|reply
[+] [-] ebiester|2 years ago|reply
[+] [-] sitkack|2 years ago|reply
[+] [-] chaxor|2 years ago|reply
[+] [-] Imnimo|2 years ago|reply
Prompt:
>def nth_prime(n):
Completion:
> if n == 1:
> return 2
> if n == 2:
> return 3
> if n == 3:
> return 5
> if n == 4
[+] [-] dvt|2 years ago|reply
[+] [-] robby_w_g|2 years ago|reply
[+] [-] protonbob|2 years ago|reply
[+] [-] LightMachine|2 years ago|reply
[+] [-] davidy123|2 years ago|reply
[+] [-] circuit10|2 years ago|reply
[+] [-] fswd|2 years ago|reply
[+] [-] m3kw9|2 years ago|reply
def sieve_eratosthenes(n):
##a function to sort 10 numbers
##a function to sort 10 numbers ##a function to sort 10 numbers[+] [-] ImprobableTruth|2 years ago|reply
[+] [-] m3kw9|2 years ago|reply
[+] [-] hinkley|2 years ago|reply
And the reason I say this is because these tools are answering a question that we haven't asked yet: what common problems need to be solved in this programming language, and where do I get code to solve that problem?
These LLM modules are basically telling us how to duplicate code, and what we need is the opposite: how to stop reinventing the wheel for the 100th time.
Instead of writing code for me, tell me if I already have it. If I'm writing it, tell me there's a library for that. If I'm a library writer, give me suggestions for what libraries are missing from the toolkit.
All we've done so far is begun the process of automating the production of duplicate code. With absolutely no way to go back in time and correct bugs introduced in earlier iterations. We are likely, for instance, to see 0 day attacks that affect hundreds of applications, but with no simple way to describe which applications are affected. That's going to be a first rate trainwreck.
[+] [-] moffkalast|2 years ago|reply
But libraries and especially frameworks as they are these days are also a giant liability more often than not. APIs change for no reason, they can be removed from the package manager at any moment without warning, people may slip malicious code into them past LGTM reviews, have recursive dependencies upon dependencies that bloat and slow down your build process, etc.
Sometimes you don't need the to install the entire damn car manufacturing plant and dealership it comes with just to get that one wheel you needed. And an LLM can just write you the code for a very nicely customized wheel in a few seconds anyway.
[+] [-] seydor|2 years ago|reply
The idea of libraries may not have been a good one. It saved human time but no library is perfect because no abstraction is perfect and this causes unnecessary bloat. It seems tha Nature does not use libraries, it uses replication instead, and we can now have that too.
[+] [-] myroon5|2 years ago|reply
[+] [-] markeibes|2 years ago|reply
[+] [-] user3939382|2 years ago|reply