Replit's new Code LLM: Open Source, 77% smaller than Codex, trained in 1 week

[+] amasad|2 years ago|reply

Some links:

- Repo: https://github.com/replit/ReplitLM/tree/main/replit-code-v1-...

- HuggingFace: https://huggingface.co/replit/replit-code-v1-3b

- Demo: https://huggingface.co/spaces/replit/replit-code-v1-3b-demo

- Early benchmark results: https://twitter.com/amasad/status/1651019556423598081

A lot about this project was surprising. We knew it was going to be good, but didn't expect to be this good -- especially surprising was the finetuned performance boost, and the fact that the model is decent at language tasks and reasoning (in some cases much better than much larger general-purpose models).

It feels like there is a lot more to do with this model, and I have a suspicion you can even make a half-decent chatbot (at least one focused on code) by finetuning it on conversation (and/or instruction) datasets.

Will follow up with a more comprehensive technical report and the UL2R version (fill-in-the-middle support).

[+] newhouseb|2 years ago|reply

First - thank you for open sourcing this! It's a real gift to the community to have a model intended for "commercial use" that's actually licensed as such.

I'd be very interested to hear about the choice/evaluation of the ALiBi approach for positional embedding (perhaps in the technical report).

My intuition suggests that while this allows for better generalizability for longer sequence lengths, it penalizes scenarios where an LLM might need to check for things like a function signature far away from where the next token is generated. My initial testing of this model tracks with this intuition but that's by no means a rigorous evaluation.

[+] kir-gadjello|2 years ago|reply

Impressive model, thank you for releasing it under a business-friendly license!

Have you considered using Google's sparse "scaling transformer" architecture as the base? Even at 3B scale it can generate 3-4x more tokens per FLOP while being competitive at perplexity with a dense transformer. I think OpenAI uses a variant of it in their ChatGPT-3.5-Turbo product.

Here is the paper https://arxiv.org/abs/2111.12763 and the implementation https://github.com/google/trax/blob/master/trax/models/resea... if you are interested.

Hope you get to look into this!

[+] sputknick|2 years ago|reply

What does "fine tuning" mean in this context? Does it mean you fine-tuned it on a specific code repository, or collection of code repositories and then had it do work in those repositories?

[+] spenczar5|2 years ago|reply

How is this code licensed? I didn't see a license in the repo. It looks interesting!

[+] letitgo12345|2 years ago|reply

Doesn't the Stack contain HumanEval? So you're basically comparing numbers on the pretraining data.

[+] pera|2 years ago|reply

Hi there, I have two question:

1 - Why did you choose Markdown? It seems an odd choice for training a model like this.

2 - Have you tried to train only one single PL and then benchmark it against this more general version?

[+] curiousgal|2 years ago|reply

Did any interns help in developing this? If so are you planning on intimidating them as usual? :)

Reference: How Replit used legal threats to kill my open-source project https://intuitiveexplanations.com/tech/replit/

[+] gbasin|2 years ago|reply

Very exciting, thanks for sharing all this

[+] doodlesdev|2 years ago|reply

The model is way too small, comparing it to Codex feels disingenous. Sure it's 77% smaller, it's also 77% worse. Although, it's a cool project nonetheless.

For instance, even this simple snippet generates wrong inline completions:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) =>

Replit-code-v1:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((item) => item > 10);
   };

Gets it wrong, returns odd numbers.

Codeium:

   // Only return even numbers bigger than 10 from the array
   const arrayFilter = (array) => {
     return array.filter((num) => num > 10 && num % 2 === 0);
   };

ChatGPT (GPT-3.5 Turbo) - Code-only, without the rest of the completion since it's instruction-tuned:

   const arrayFilter = (array) => {
     return array.filter(num => num % 2 === 0 && num > 10);
   }

Not comparable at all. For reference if anyone wants to test I ran this through the HuggingFace space using the default parameters, ChatGPT through chat.openai.com, and Codeium through the VSCodium extension on an empty JavaScript file.

[+] amasad|2 years ago|reply

Interesting. This seems like a weakness of natural language understanding. If you rephrase your prompt slightly it would get it right. Try:

  // return even numbers that are also more than 10
  const arrayFilter = (array) =>

It would do the right thing. The fine-tuned version gets your prompt right so maybe it benefited from natural language data. Will look more into it.

[+] SheinhardtWigCo|2 years ago|reply

It seems like every week someone comes out with some version of "we can get results similar to OpenAI's API with our model that you can run on a Commodore 64!"

And then you dig in, and it's always far behind in some important way.

Not hating here, I love the pace of iteration, just not the hyperbole.

[+] thewataccount|2 years ago|reply

I need more time to compare it, the short 128 tokens in the demo is a bit rough but -

On first look this seems to blow the current llama based models out of the water including the 30B ones.

Pasting what you want + url + example json with no other context and it "knows" what the url and the json is for, without even telling it.

I'm not even saying it's as good as chatGPT, but this is a tenth the size of the best llama models I've seen.

[+] moffkalast|2 years ago|reply

Yeah I tried the demo, it wrote some wrong code with comments in Chinese. I think I'll pass.

It's a pretty well accepted fact now that bigger LLM = moar better without exceptions. I'm not sure why there's a race to the bottom of who'll make the most useless model that can run everywhere.

[+] johnfn|2 years ago|reply

> Sure it's 77% smaller, it's also 77% worse.

Hehe, yeah, imagine saying you made a new programming language with 77% less lines of code than Python.

[+] swyx|2 years ago|reply

hi HN! back again with an exclusive deep dive with Replit’s head of AI. I attended their developer day last week (https://twitter.com/swyx/status/1650989632413401089) just expecting a regular fundraise announcement and was totally shocked when they annoucned their own LLM and also said they would open source it. so immediately asked them for a podcast interview and this is the result.

my favorite learning is how they are pushing the state of the art - openai’s HumanEval is the industry standard benchmark for code LLMs, but Reza kindly went above and beyond to show how they use “AmjadEval” - using coder intuition to capture human preference on what output is more helpful to coders (see screenshots https://twitter.com/swyx/status/1653791019421569024?s=20)

please AMA!

[+] marcodiego|2 years ago|reply

Sorry, I have to ask this: how does this compare to ChatGPT?

[+] FanaHOVA|2 years ago|reply

This was a lot of fun to record, and second episode where I get an eval question wrong, I'm going to be demoted to bot soon lol

[+] swyx|2 years ago|reply

we also did an interview with Varun Mohan of Codeium, which is another competing code model trained from complete scratch: https://lspace.swyx.io/p/varun-mohan#details

[+] sashank_1509|2 years ago|reply

Replit: I tried to write HMM Forward function code:

```

# Write a function that performs the forward algorithm of an HMM. Ideally it uses numpy vectorization,

# But if you cannot do that it is fine too. Just perform forward inference, given HMM parameters and Observations.

# def HMM_forward() :

    # Write your code here

    # Hint: 

    # - Remember that HMMs are stochastic, so you will need to multiply transition matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by observation probabilities

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so you will need to multiply emission matrices by transition matrices

    # - Remember that HMMs are stochastic, so

``` It gets stuck in an infinite loop. ChatGPT gets it write, first shot:

```

import numpy as np

def forward_algorithm(A, B, pi, obs):

    """
    Performs the forward algorithm for an HMM.

    Parameters:
        A: numpy.ndarray, shape (N, N)
            Transition matrix of the HMM, where N is the number of states.
        B: numpy.ndarray, shape (N, M)
            Emission matrix of the HMM, where M is the number of possible observations.
        pi: numpy.ndarray, shape (N,)
            Initial probability distribution over states.
        obs: numpy.ndarray, shape (T,)
            Sequence of T observations.
    
    Returns:
        alpha: numpy.ndarray, shape (T, N)
            Forward probabilities for each state at each time step.
    """

    T = obs.shape[0]
    N = A.shape[0]

    alpha = np.zeros((T, N))
    alpha[0] = pi * B[:, obs[0]]

    for t in range(1, T):
        alpha[t] = np.dot(alpha[t-1], A) * B[:, obs[t]]

    return alpha

``` OpenAI managed to do the important but extremely hard, they moved out of the DL benchmark frame and made something that is general purpose useful. Great effort and congrats to Replit team though, hopefully they can keep iterating on this and reach ChatGPT capabilities someday

[+] tyingq|2 years ago|reply

More tools in the field is great! I tried a few things, and it's reasonable, but it does have some quirks that seem to repeat, like:

I tried a prompt of:

  # python function that returns a random integer between min and max

And it produced:

  def random_int(min, max):
      return random.randint(min, max)

  # define the size of the grid
  n = 5

It doesn't add the needed import statement, and I'm unclear why it's "defining the size of the grid".

[+] GreedClarifies|2 years ago|reply

This is amazing work and bravo on to the people working on redpajama.

This is fantastic for the world, this means LLMs will not be controlled by a couple of companies with the associated rents.

Yes, private LLMs will likely be a couple of years ahead of 'free' alternatives, but that's OK, we want to incentivize for profit research so long as the services are low priced in time (and in this case in short order).

AMAZING WORK.

[+] laweijfmvo|2 years ago|reply

My first reaction was, "why is replit building LLMs," but I guess it fits their needs to have one optimized for their use. But I wonder, is this the beginning of another wave of "every company is an AI company?" Are we going to see a spike in tech hiring around AI/LLM, money starting to flow again, etc? And how many years until it all blows up and the layoffs start?

[+] m3kw9|2 years ago|reply

Have you even tried it? It’s pretty bad

[+] swyx|2 years ago|reply

to be clear this work is not based on redpajama - though we did discuss that in the previous episode https://twitter.com/swyx/status/1648080532734087168?s=46&t=9...

[+] waffletower|2 years ago|reply

No Clojure. No Julia. No Haskell. No Racket. No Scheme. No Common Lisp. No OCaml. And, as much as I despise Microsoft, No C#. No F#. No Swift. No Objective-C. No Perl. No Datalog. A glaringly lacking choice of languages.

[+] ubertaco|2 years ago|reply

I fed it some OCaml and it worked, though the example was trivial:

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =

which it completed to

    type point = { x: int; y : int }
    let manhattan_distance (a: point) (b: point) : int =
        abs (a.x - b.x) + abs (a.y - b.y)

...which is a valid and correct OCaml definition of this method:

https://try.ocamlpro.com/#code/type'point'='$4'x:'int;'y':'i...

[+] esjeon|2 years ago|reply

I hate to admit, but Python, C, Java, and JS cover most of the modern programming. But not supporting C# sounds like a bad idea.

[+] Dayshine|2 years ago|reply

C# was available in the dataset they link, and is the most glaring ommission by global usage...

[+] mclide|2 years ago|reply

Despite the lack of examples, it still completes trivial clojure like "(defn connect [" and other lisp syntax like "(define (hello" which is promising for further refinement training on Lisp languages.

[+] ebiester|2 years ago|reply

I'm sure that has to do with the dataset available to them.

[+] sitkack|2 years ago|reply

You could take it and finetune it on a bunch of Lisps, probably cost on the order of 50-500 to do that.

[+] chaxor|2 years ago|reply

This is a bit hard to believe that the system is decent at producing code which captures complex ideas and higher level structure when the tokens/param value is >30 (it's ~200 here? ) The 'good' models (meaning having lots of 'knowledge' or 'memorization' about the dataset) typically tend to be around 2 tokens/param and models with decent generation of language with less knowledge/memorization are around 30 tokens/param. Perhaps the domain allows for this, but due to the fact that the linguistic interface on the input is still needed... It's hard to believe.

[+] Imnimo|2 years ago|reply

Tried it out on the HuggingFace demo, with default settings.

Prompt:

>def nth_prime(n):

Completion:

> if n == 1:

> return 2

> if n == 2:

> return 3

> if n == 3:

> return 5

> if n == 4

[+] dvt|2 years ago|reply

I genuinely don't understand how anyone can use something like this and seriously think "oh yeah, this is revolutionary." It's almost complete garbage and can't do anything remotely interesting.

    # a method that approximates the hyperbolic tangent (clamped tanh)

    def rational_tanh(x):
        return (x + 1) / (x - 1)

Even gave it the BIG hint of a "clamped" and "rational" tanh, but that ain't it, chief. Forget GPT-4, I would be embarrassed to even show this as a tech demo.

[+] robby_w_g|2 years ago|reply

I recognized the name Replit and couldn't remember why. A quick search reminded me: https://news.ycombinator.com/item?id=27424195

[+] protonbob|2 years ago|reply

Darn it doesn't look like it has c sharp.

[+] LightMachine|2 years ago|reply

Any idea how much it cost to train it and how it was trained?

[+] davidy123|2 years ago|reply

I keep thinking there should be a way to train a copilot against just one set of code libraries. I know LLMs require training against a lot of text to get their smarts, but is there a way to set this up so a model can be created for a specific library by anyone, so it could provide open source support via a transformer + model? Maybe this would be a better approach than a jack of all trades, master of none.

[+] circuit10|2 years ago|reply

This probably makes a self-hosted and/or local Copilot a lot more feasible

[+] fswd|2 years ago|reply

I can barely keep up with this stuff, but quick question. Is there a way to simply change the URL setting of copilot to point to this model? Obviously it needs an endpoint, I could hack something up, but asking if somebody has already done this? Would be nice to cancel my copilot.

[+] m3kw9|2 years ago|reply

It just gave me prototypes lol

def sieve_eratosthenes(n):

##a function to sort 10 numbers

    def bubble_sort(a):

##a function to sort 10 numbers

    def insertion_sort(a):

##a function to sort 10 numbers

    def quick_sort(a):

[+] ImprobableTruth|2 years ago|reply

Did you mess around with the settings? I'm getting a correct implementation and since it's deterministic (with default settings) it should be the same for you.

[+] m3kw9|2 years ago|reply

I left the settings. All I added was ##a function to sort 10 numbers. Assuming it would complete it like copilot

[+] hinkley|2 years ago|reply

I think that 20 years from now, we'll all be sitting around wondering 1) where the fuck are my flying cars, and 2) what were they thinking using computers to write code?

And the reason I say this is because these tools are answering a question that we haven't asked yet: what common problems need to be solved in this programming language, and where do I get code to solve that problem?

These LLM modules are basically telling us how to duplicate code, and what we need is the opposite: how to stop reinventing the wheel for the 100th time.

Instead of writing code for me, tell me if I already have it. If I'm writing it, tell me there's a library for that. If I'm a library writer, give me suggestions for what libraries are missing from the toolkit.

All we've done so far is begun the process of automating the production of duplicate code. With absolutely no way to go back in time and correct bugs introduced in earlier iterations. We are likely, for instance, to see 0 day attacks that affect hundreds of applications, but with no simple way to describe which applications are affected. That's going to be a first rate trainwreck.

[+] moffkalast|2 years ago|reply

Well fwiw, working with GPT 4 it often suggests which libraries to use assuming the question allows for it, so it's not like everyone's writing everything from scratch.

But libraries and especially frameworks as they are these days are also a giant liability more often than not. APIs change for no reason, they can be removed from the package manager at any moment without warning, people may slip malicious code into them past LGTM reviews, have recursive dependencies upon dependencies that bloat and slow down your build process, etc.

Sometimes you don't need the to install the entire damn car manufacturing plant and dealership it comes with just to get that one wheel you needed. And an LLM can just write you the code for a very nicely customized wheel in a few seconds anyway.

[+] seydor|2 years ago|reply

> how to stop reinventing the wheel for the 100th time.

The idea of libraries may not have been a good one. It saved human time but no library is perfect because no abstraction is perfect and this causes unnecessary bloat. It seems tha Nature does not use libraries, it uses replication instead, and we can now have that too.

[+] myroon5|2 years ago|reply

Reminds me of Java's debate of autogenerating boilerplate vs using the Lombok library: https://old.reddit.com/r/java/comments/c8oqkq/why_we_removed...

[+] markeibes|2 years ago|reply

I'm always sad to see these things being trained on a tiny number of programming languages. Makes it harder still for the good languages to compete.

[+] user3939382|2 years ago|reply

Unfortunately I'm someone who sometimes can't separate the art from the artist. Replit is the company where the founder sent these nasty pompous threats to their ex-employee for their innocent side project and then tried to double talk his way out of it with a bs non-apology when it got exposed in public. I won't support Replit or anything they make.

220 comments