Magicoder: Source Code Is All You Need

[+] eyegor|2 years ago|reply

For anyone looking to try this out on consumer grade hardware here's a q4 version [0]. From initial testing it's subjectively a bit behind deepseek-coder-instruct [1] at the same size and quantization. Deepseeks model is near magic when it behaves (1-2 tries), spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience. Magicoder can do this too, but it usually takes 3-4 tries and it tends to "opt out" and ask for more info pretty frequently. Of course if you have a ton of vram, use the 33b-instruct variant instead.

After more testing, I think it's a toss up on most coding tasks but Magicoder tends to give subjectively better responses to "bad prompts". That is, prompts where you don't put effort into writing clear instructions. For example, one of my "bad prompt" tests is

> how to enable shared gpu memory in wsl2 docker container

A good response to this would discuss the nvidia container toolkit, maybe something about port forwarding, etc. But this isn't a prompt most models can give good responses to. Both of these models can handle it, even at 7b, but Magicoder gives more information.

[0] https://huggingface.co/LoneStriker/Magicoder-S-DS-6.7B-4.0bp...

[1] https://huggingface.co/bartowski/deepseek-coder-6.7b-instruc...

[+] azeirah|2 years ago|reply

From my experience when using llama.cpp, using min-p sampling gives far better results than the defaults, for example (note this is using deepseek-coder 33b)

Prompt: "Write the fibonacci function in Python3"

min_p 0.05

    def fibonacci(n:int, memo={}) -> int:
        if n <=2 : return 1 # base case 1, 2 => 1 
    
        if n not in memo:  
            memo[n] = fibonacci(n-1, memo) + fibonacci(n-2,memo) # store the result
        
        return memo[n]
    # Test cases 
    print("Test case 1")
    print(f"The 5th Fibonacci is {fibonacci(5)} ")  
    print(f"The 9th Fibonacci is {fibonacci(9)} ")

Default sampling settings

    def fibonacci(n):
      if n == 0:
          return 0
      elif n ==1:
          return 1
      else :
          return (fibonacci(n - 1) + fibonacci(n -2))
    n = int(input())
    print(fibonacci(n))

Seed is the same.

So if you're using these models locally, make sure you're using min-p sampling. All other samplers are genuinely very suboptimal.

For more in-depth info about _why_: https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_se...

[+] tarruda|2 years ago|reply

> spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience.

Not sure if understood, but in my experience almost every 7B instruct model does this if you add something like "respond with markdown" to the system prompt.

Chatbot-UI (A ChatGPT UI clone) handles markdown nicely and does code rendering in real time.

[+] Hedepig|2 years ago|reply

> tends to give subjectively better responses to "bad prompts".

I wonder if a first pass with another model to expand these so called bad prompts into better prompts would work.

[+] 3abiton|2 years ago|reply

How worth it though? From my understanding it is trained on GPT3 code snippet outputs, is the Q4 model any good?

[+] micimize|2 years ago|reply

Any time I see humaneval comparisons I feel the need to point out that humaneval is only 164 questions. 66.5 vs. 65.9% is a difference between 109 and 108 solutions, a single question. Still interesting work though

[+] moyix|2 years ago|reply

At least they used HumanEval+, which adds a bunch more test cases and fixes some errors in the original benchmark!

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] dontupvoteme|2 years ago|reply

Sentiment analysis of movie reviews is not what i would call an inspired or high-quality programming problem. This seems like too common examples from the training data being made manifest.

There's so many more interesting things you could be doing -- Even if you stick with the "use tf-idf on review data" how about sentiment analysis of vacation destination reviews segmented by season posted? Things like that lead directly into other ideas and possible metrics.

Creativity is really a sore spot with these. I suspect more elaborate prompts can suppress the commonalities but gpt-3.5 with a bog standard prompt gives bog standard ideas.

[+] gsuuon|2 years ago|reply

I think these smaller models really struggle with the reasoning aspect of writing decent code, I'm getting pretty nonsensical things out when asking it to fix fizzbuzz (though it made the right fix), like:

  - Replaced "i % 2 === 0" with "i % 3 === 0", because a number is divisible by 2 only if it's divisible by 3.

[+] tarruda|2 years ago|reply

The model might have been trained with code that contained this exact comment

[+] UniverseFly|2 years ago|reply

An author of Magicoder. I really appreciate every discussion (both positive and negative ones)! We release Magicoder with the hope that everyone can reproduce and improve it, by openly sharing all data and every code detail, which none of leading code model does. That’s the main reason we claim it to be fully open-source. Magicoder isn't perfect yet, but we are committed to constantly overcoming these challenges. Through the complete transparency, we believe an increasing number of advanced code models would come out soon.

[+] tarruda|2 years ago|reply

A couple of tests I've done in their live demo.

- Implement the quicksort algorithm in python. I like this test because usually LLMs do the first time using list comprehensions, then I ask to avoid allocations.

- Write the game "snake" in python, and then in JS/HTML.

It succeeded on the simple quicksort implementation with list comprehensions, but failed when I asked for the in-place implementation. With the snake game tests, it didn't get even close.

In my experience these results are worse than non-code top 7B LLMs such as OpenChat/NeuralHermes, and significantly worse than plain Deepseek-coder-instruct 6.7B (which is the only small model I've tested that gets the "snake" game to work). When I ran these tests locally I used Q6 quantization (GGUF), and I assume their live demo is not quantized.

I find the benchmark results surprising, to the point I wonder if the benchmarks didn't leak into the training set.

[+] UniverseFly|2 years ago|reply

Thanks for testing Magicoder! I tried “Implement an in place quicksort” with temperature 0 using the demo and here is the output I get. I haven’t checked every detail but it looks correct:

Here is a Python implementation of an in-place quicksort:

  def partition(arr, low, high):
      i = (low-1)
      pivot = arr[high]
  
      for j in range(low, high):
          if arr[j] <= pivot:
              i = i+1
              arr[i], arr[j] = arr[j], arr[i]
  
      arr[i+1], arr[high] = arr[high], arr[i+1]
      return (i+1)

  def quickSort(arr, low, high):
      if len(arr) == 1:
          return arr
      if low < high:
          pi = partition(arr, low, high)
          quickSort(arr, low, pi-1)
          quickSort(arr, pi+1, high)
  
  arr = [10, 7, 8, 9, 1, 5]
  n = len(arr)
  quickSort(arr, 0, n-1)
  print("Sorted array is:", arr)

This code first defines a partition function that takes an array and two indices, low and high, and rearranges the elements of the array so that all elements less than the pivot (the last element in the array) come before it, and all elements greater than the pivot come after it.

Then it defines a quickSort function that takes an array and two indices, low and high, and sorts the array in-place using the partition function. If the array has more than one element, it recursively sorts the subarrays to the left and right of the pivot.

Finally, it tests the function with an array of integers.

[+] RedXenon|2 years ago|reply

I used the prompt "Write snake game." on the live demo with temperature = 0 and max tokens = 2048. I got the correct implementation of the snake game in Python. But even after a few tries, I could not generate a snake game in HTML/JS.

[+] yawnxyz|2 years ago|reply

Github: https://github.com/ise-uiuc/magicoder

[+] abrookewood|2 years ago|reply

I don't think the Open Source claim is accurate. From their repo "Magicoder models are trained on the synthetic data generated by gpt-3.5-turbo-1106 developed by OpenAI. Please pay attention to OpenAI's terms of use when using the models and the datasets."

[+] PostOnce|2 years ago|reply

OpenAI behaves as though everyone else's outputs are fair game for training LLMs, including GPT, but doesn't want others using GPT's output?

So, when OpenAI does it, it's transformative, but when we do it, it's not?

That's not right, and I don't think the courts will rule in their favor.

[+] Zambyte|2 years ago|reply

> Please pay attention to OpenAI's terms

Just because people at a company tell you how to behave doesn't mean you need to comply

[+] ipsum2|2 years ago|reply

All model outputs (text, image, video) are considered public domain in the US, so they can be used for whatever.

[+] unknown|2 years ago|reply

[deleted]

[+] sagarpatil|2 years ago|reply

Asked it to write code for two python programs (one of them tricky) and it got both of them right in the first pass. Looks promising.

[+] Reubend|2 years ago|reply

I'm guessing this model was made to be small simply to keep costs low and make sure that they could feasibly train the model with the amount of time/effort they had. But to some extent I'm left wondering whether this technique would continue to be fruitful when scaled up to a huge model and with a bigger initial training set.

[+] genidoi|2 years ago|reply

It was made to be small out of necessity. The US government put extensive export controls on many inter-GPU connectivity products last year and expanded those controls recently to include anything above an A100.

Page 9 of this recently published paper[1] is a strong indicator of how far non-US firms go to formally analyze and factor in these bandwidth constraints in building large models.

[1] https://arxiv.org/pdf/2311.15786v2.pdf

[+] unknown|2 years ago|reply

[deleted]

[+] yieldcrv|2 years ago|reply

could have been a blog post

this phase reminds me of when crypto projects all had pseudo academic “white papers” in order to be taken “seriously”

[+] colinsane|2 years ago|reply

nice username for that comment.

i was toying around quite a bit with the ecosystem a couple years ago. particularly, comparing all the different sidechains and "layer 2" networks: those white papers were a godsend, because i could actually understand the _specific_ guarantees and assumptions each one made (and it made spotting the BS ones trivial). it's like `man` for the internet, and i quite like that.

i don't see the parallel between this PDF and cryptocurrency whitepapers.

[+] staflow|2 years ago|reply

Somebody should make a coomer meme but with AI “all you need” papers

[+] Zambyte|2 years ago|reply

All you need is all you need

[+] triyambakam|2 years ago|reply

Seriously! It was clever in the beginning but it's wearing on me seeing it everywhere now.

[+] unknown|2 years ago|reply

[deleted]

[+] qwertox|2 years ago|reply

Just make one which is capable of creating the perfect React / Vue / Angular framework including upgrade paths and while at it using it for us so that we don't have to bother with reinventing the wheel every 3 years.

[+] colinsane|2 years ago|reply

"the solution to runaway complexity is more complexity"

okay, more nuanced than that. but from the perspective of someone who spends far more time reading code (and patching it or packaging it) than writing it, i worry about that mindset.

[+] unknown|2 years ago|reply

[deleted]

[+] 911_hello_dang|2 years ago|reply

[deleted]

[+] thefak111|2 years ago|reply

[deleted]

[+] yawnxyz|2 years ago|reply

Every time one of these come out I hope it DOES take all my dev jobs so I can focus on things that don't require writing code. Every time they fall really short.

[+] marmakoide|2 years ago|reply

I don't see this creating new algorithms (as in, not in the training corpus), but maybe giving the kind of answer you would expect from Stack Overflow, without all the social fluff around it (comments, badges and so on).

The day one of these find new algorithms to solve problems with better complexity or simpler code that state of the art, I'll wake up. When I give a LLM a computational geometry problem, it's exactly like a student trying to bullshit his/her way through an exam without any actual deep understanding.

For example, I ask for an algorithm to compute Laguerre Voronoi diagrams (usually not available in books or code examples), and I get answers for plain Voronoi diagrams, because it's what you will find in many books and code samples. Generating boring but necessary code, in moderation, is a win.

[+] mnky9800n|2 years ago|reply

Dude you could be Harry Kim on the holodeck creating whatever you can think of. What's wrong with that?

[+] femiagbabiaka|2 years ago|reply

?

[+] quickthrower2|2 years ago|reply

Cola wars?

[+] gumballindie|2 years ago|reply

[deleted]

[+] unknown|2 years ago|reply

[deleted]

[+] fizx|2 years ago|reply

This looks like a llama2 finetune, so the dataset (inclusive of llama2) isn't fully open as claimed, and I'd still have to accept the Facebook and possibly OpenAI licenses.

Let alone that clearly the base model was built on non-source-code, so their premise doesn't hold.

Disappointing.

[+] KRAKRISMOTT|2 years ago|reply

The number of entities that are possibly constrained by the llama 2 license can be counted on two hands and all of them have the ability to train models that can match Llama 2's performance.

[+] oneshtein|2 years ago|reply

But it's transformative finetune. Why licenses of sources shouldn't apply to original LLVM, but applied to finetune, for which LLVM is just another source?

67 comments