For anyone looking to try this out on consumer grade hardware here's a q4 version [0]. From initial testing it's subjectively a bit behind deepseek-coder-instruct [1] at the same size and quantization. Deepseeks model is near magic when it behaves (1-2 tries), spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience. Magicoder can do this too, but it usually takes 3-4 tries and it tends to "opt out" and ask for more info pretty frequently. Of course if you have a ton of vram, use the 33b-instruct variant instead.
After more testing, I think it's a toss up on most coding tasks but Magicoder tends to give subjectively better responses to "bad prompts". That is, prompts where you don't put effort into writing clear instructions. For example, one of my "bad prompt" tests is
> how to enable shared gpu memory in wsl2 docker container
A good response to this would discuss the nvidia container toolkit, maybe something about port forwarding, etc. But this isn't a prompt most models can give good responses to. Both of these models can handle it, even at 7b, but Magicoder gives more information.
From my experience when using llama.cpp, using min-p sampling gives far better results than the defaults, for example (note this is using deepseek-coder 33b)
Prompt: "Write the fibonacci function in Python3"
min_p 0.05
def fibonacci(n:int, memo={}) -> int:
if n <=2 : return 1 # base case 1, 2 => 1
if n not in memo:
memo[n] = fibonacci(n-1, memo) + fibonacci(n-2,memo) # store the result
return memo[n]
# Test cases
print("Test case 1")
print(f"The 5th Fibonacci is {fibonacci(5)} ")
print(f"The 9th Fibonacci is {fibonacci(9)} ")
Default sampling settings
def fibonacci(n):
if n == 0:
return 0
elif n ==1:
return 1
else :
return (fibonacci(n - 1) + fibonacci(n -2))
n = int(input())
print(fibonacci(n))
Seed is the same.
So if you're using these models locally, make sure you're using min-p sampling. All other samplers are genuinely very suboptimal.
> spitting out nicely formatted code fence blocks in markdown that I can naively render in real time to get a local chatgpt like experience.
Not sure if understood, but in my experience almost every 7B instruct model does this if you add something like "respond with markdown" to the system prompt.
Chatbot-UI (A ChatGPT UI clone) handles markdown nicely and does code rendering in real time.
Any time I see humaneval comparisons I feel the need to point out that humaneval is only 164 questions. 66.5 vs. 65.9% is a difference between 109 and 108 solutions, a single question. Still interesting work though
Sentiment analysis of movie reviews is not what i would call an inspired or high-quality programming problem. This seems like too common examples from the training data being made manifest.
There's so many more interesting things you could be doing -- Even if you stick with the "use tf-idf on review data" how about sentiment analysis of vacation destination reviews segmented by season posted? Things like that lead directly into other ideas and possible metrics.
Creativity is really a sore spot with these. I suspect more elaborate prompts can suppress the commonalities but gpt-3.5 with a bog standard prompt gives bog standard ideas.
I think these smaller models really struggle with the reasoning aspect of writing decent code, I'm getting pretty nonsensical things out when asking it to fix fizzbuzz (though it made the right fix), like:
- Replaced "i % 2 === 0" with "i % 3 === 0", because a number is divisible by 2 only if it's divisible by 3.
An author of Magicoder. I really appreciate every discussion (both positive and negative ones)! We release Magicoder with the hope that everyone can reproduce and improve it, by openly sharing all data and every code detail, which none of leading code model does. That’s the main reason we claim it to be fully open-source. Magicoder isn't perfect yet, but we are committed to constantly overcoming these challenges. Through the complete transparency, we believe an increasing number of advanced code models would come out soon.
- Implement the quicksort algorithm in python. I like this test because usually LLMs do the first time using list comprehensions, then I ask to avoid allocations.
- Write the game "snake" in python, and then in JS/HTML.
It succeeded on the simple quicksort implementation with list comprehensions, but failed when I asked for the in-place implementation. With the snake game tests, it didn't get even close.
In my experience these results are worse than non-code top 7B LLMs such as OpenChat/NeuralHermes, and significantly worse than plain Deepseek-coder-instruct 6.7B (which is the only small model I've tested that gets the "snake" game to work). When I ran these tests locally I used Q6 quantization (GGUF), and I assume their live demo is not quantized.
I find the benchmark results surprising, to the point I wonder if the benchmarks didn't leak into the training set.
Thanks for testing Magicoder! I tried “Implement an in place quicksort” with temperature 0 using the demo and here is the output I get. I haven’t checked every detail but it looks correct:
Here is a Python implementation of an in-place quicksort:
def partition(arr, low, high):
i = (low-1)
pivot = arr[high]
for j in range(low, high):
if arr[j] <= pivot:
i = i+1
arr[i], arr[j] = arr[j], arr[i]
arr[i+1], arr[high] = arr[high], arr[i+1]
return (i+1)
def quickSort(arr, low, high):
if len(arr) == 1:
return arr
if low < high:
pi = partition(arr, low, high)
quickSort(arr, low, pi-1)
quickSort(arr, pi+1, high)
arr = [10, 7, 8, 9, 1, 5]
n = len(arr)
quickSort(arr, 0, n-1)
print("Sorted array is:", arr)
This code first defines a partition function that takes an array and two indices, low and high, and rearranges the elements of the array so that all elements less than the pivot (the last element in the array) come before it, and all elements greater than the pivot come after it.
Then it defines a quickSort function that takes an array and two indices, low and high, and sorts the array in-place using the partition function. If the array has more than one element, it recursively sorts the subarrays to the left and right of the pivot.
Finally, it tests the function with an array of integers.
I used the prompt "Write snake game." on the live demo with temperature = 0 and max tokens = 2048. I got the correct implementation of the snake game in Python. But even after a few tries, I could not generate a snake game in HTML/JS.
I don't think the Open Source claim is accurate. From their repo "Magicoder models are trained on the synthetic data generated by gpt-3.5-turbo-1106 developed by OpenAI. Please pay attention to OpenAI's terms of use when using the models and the datasets."
I'm guessing this model was made to be small simply to keep costs low and make sure that they could feasibly train the model with the amount of time/effort they had. But to some extent I'm left wondering whether this technique would continue to be fruitful when scaled up to a huge model and with a bigger initial training set.
It was made to be small out of necessity. The US government put extensive export controls on many inter-GPU connectivity products last year and expanded those controls recently to include anything above an A100.
Page 9 of this recently published paper[1] is a strong indicator of how far non-US firms go to formally analyze and factor in these bandwidth constraints in building large models.
i was toying around quite a bit with the ecosystem a couple years ago. particularly, comparing all the different sidechains and "layer 2" networks: those white papers were a godsend, because i could actually understand the _specific_ guarantees and assumptions each one made (and it made spotting the BS ones trivial). it's like `man` for the internet, and i quite like that.
i don't see the parallel between this PDF and cryptocurrency whitepapers.
Just make one which is capable of creating the perfect React / Vue / Angular framework including upgrade paths and while at it using it for us so that we don't have to bother with reinventing the wheel every 3 years.
"the solution to runaway complexity is more complexity"
okay, more nuanced than that. but from the perspective of someone who spends far more time reading code (and patching it or packaging it) than writing it, i worry about that mindset.
Every time one of these come out I hope it DOES take all my dev jobs so I can focus on things that don't require writing code. Every time they fall really short.
I don't see this creating new algorithms (as in, not in the training corpus), but maybe giving the kind of answer you would expect from Stack Overflow, without all the social fluff around it (comments, badges and so on).
The day one of these find new algorithms to solve problems with better complexity or simpler code that state of the art, I'll wake up. When I give a LLM a computational geometry problem, it's exactly like a student trying to bullshit his/her way through an exam without any actual deep understanding.
For example, I ask for an algorithm to compute Laguerre Voronoi diagrams (usually not available in books or code examples), and I get answers for plain Voronoi diagrams, because it's what you will find in many books and code samples. Generating boring but necessary code, in moderation, is a win.
This looks like a llama2 finetune, so the dataset (inclusive of llama2) isn't fully open as claimed, and I'd still have to accept the Facebook and possibly OpenAI licenses.
Let alone that clearly the base model was built on non-source-code, so their premise doesn't hold.
The number of entities that are possibly constrained by the llama 2 license can be counted on two hands and all of them have the ability to train models that can match Llama 2's performance.
But it's transformative finetune. Why licenses of sources shouldn't apply to original LLVM, but applied to finetune, for which LLVM is just another source?
[+] [-] eyegor|2 years ago|reply
After more testing, I think it's a toss up on most coding tasks but Magicoder tends to give subjectively better responses to "bad prompts". That is, prompts where you don't put effort into writing clear instructions. For example, one of my "bad prompt" tests is
> how to enable shared gpu memory in wsl2 docker container
A good response to this would discuss the nvidia container toolkit, maybe something about port forwarding, etc. But this isn't a prompt most models can give good responses to. Both of these models can handle it, even at 7b, but Magicoder gives more information.
[0] https://huggingface.co/LoneStriker/Magicoder-S-DS-6.7B-4.0bp...
[1] https://huggingface.co/bartowski/deepseek-coder-6.7b-instruc...
[+] [-] azeirah|2 years ago|reply
Prompt: "Write the fibonacci function in Python3"
min_p 0.05
Default sampling settings Seed is the same.So if you're using these models locally, make sure you're using min-p sampling. All other samplers are genuinely very suboptimal.
For more in-depth info about _why_: https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_se...
[+] [-] tarruda|2 years ago|reply
Not sure if understood, but in my experience almost every 7B instruct model does this if you add something like "respond with markdown" to the system prompt.
Chatbot-UI (A ChatGPT UI clone) handles markdown nicely and does code rendering in real time.
[+] [-] Hedepig|2 years ago|reply
I wonder if a first pass with another model to expand these so called bad prompts into better prompts would work.
[+] [-] 3abiton|2 years ago|reply
[+] [-] micimize|2 years ago|reply
[+] [-] moyix|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] dontupvoteme|2 years ago|reply
There's so many more interesting things you could be doing -- Even if you stick with the "use tf-idf on review data" how about sentiment analysis of vacation destination reviews segmented by season posted? Things like that lead directly into other ideas and possible metrics.
Creativity is really a sore spot with these. I suspect more elaborate prompts can suppress the commonalities but gpt-3.5 with a bog standard prompt gives bog standard ideas.
[+] [-] gsuuon|2 years ago|reply
[+] [-] tarruda|2 years ago|reply
[+] [-] UniverseFly|2 years ago|reply
[+] [-] tarruda|2 years ago|reply
- Implement the quicksort algorithm in python. I like this test because usually LLMs do the first time using list comprehensions, then I ask to avoid allocations.
- Write the game "snake" in python, and then in JS/HTML.
It succeeded on the simple quicksort implementation with list comprehensions, but failed when I asked for the in-place implementation. With the snake game tests, it didn't get even close.
In my experience these results are worse than non-code top 7B LLMs such as OpenChat/NeuralHermes, and significantly worse than plain Deepseek-coder-instruct 6.7B (which is the only small model I've tested that gets the "snake" game to work). When I ran these tests locally I used Q6 quantization (GGUF), and I assume their live demo is not quantized.
I find the benchmark results surprising, to the point I wonder if the benchmarks didn't leak into the training set.
[+] [-] UniverseFly|2 years ago|reply
Here is a Python implementation of an in-place quicksort:
This code first defines a partition function that takes an array and two indices, low and high, and rearranges the elements of the array so that all elements less than the pivot (the last element in the array) come before it, and all elements greater than the pivot come after it.Then it defines a quickSort function that takes an array and two indices, low and high, and sorts the array in-place using the partition function. If the array has more than one element, it recursively sorts the subarrays to the left and right of the pivot.
Finally, it tests the function with an array of integers.
[+] [-] RedXenon|2 years ago|reply
[+] [-] yawnxyz|2 years ago|reply
[+] [-] abrookewood|2 years ago|reply
[+] [-] PostOnce|2 years ago|reply
So, when OpenAI does it, it's transformative, but when we do it, it's not?
That's not right, and I don't think the courts will rule in their favor.
[+] [-] Zambyte|2 years ago|reply
Just because people at a company tell you how to behave doesn't mean you need to comply
[+] [-] ipsum2|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] sagarpatil|2 years ago|reply
[+] [-] Reubend|2 years ago|reply
[+] [-] genidoi|2 years ago|reply
Page 9 of this recently published paper[1] is a strong indicator of how far non-US firms go to formally analyze and factor in these bandwidth constraints in building large models.
[1] https://arxiv.org/pdf/2311.15786v2.pdf
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] yieldcrv|2 years ago|reply
this phase reminds me of when crypto projects all had pseudo academic “white papers” in order to be taken “seriously”
[+] [-] colinsane|2 years ago|reply
i was toying around quite a bit with the ecosystem a couple years ago. particularly, comparing all the different sidechains and "layer 2" networks: those white papers were a godsend, because i could actually understand the _specific_ guarantees and assumptions each one made (and it made spotting the BS ones trivial). it's like `man` for the internet, and i quite like that.
i don't see the parallel between this PDF and cryptocurrency whitepapers.
[+] [-] staflow|2 years ago|reply
[+] [-] Zambyte|2 years ago|reply
[+] [-] triyambakam|2 years ago|reply
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] qwertox|2 years ago|reply
[+] [-] colinsane|2 years ago|reply
okay, more nuanced than that. but from the perspective of someone who spends far more time reading code (and patching it or packaging it) than writing it, i worry about that mindset.
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] 911_hello_dang|2 years ago|reply
[deleted]
[+] [-] thefak111|2 years ago|reply
[deleted]
[+] [-] yawnxyz|2 years ago|reply
[+] [-] marmakoide|2 years ago|reply
The day one of these find new algorithms to solve problems with better complexity or simpler code that state of the art, I'll wake up. When I give a LLM a computational geometry problem, it's exactly like a student trying to bullshit his/her way through an exam without any actual deep understanding.
For example, I ask for an algorithm to compute Laguerre Voronoi diagrams (usually not available in books or code examples), and I get answers for plain Voronoi diagrams, because it's what you will find in many books and code samples. Generating boring but necessary code, in moderation, is a win.
[+] [-] mnky9800n|2 years ago|reply
[+] [-] femiagbabiaka|2 years ago|reply
[+] [-] quickthrower2|2 years ago|reply
[+] [-] gumballindie|2 years ago|reply
[deleted]
[+] [-] unknown|2 years ago|reply
[deleted]
[+] [-] fizx|2 years ago|reply
Let alone that clearly the base model was built on non-source-code, so their premise doesn't hold.
Disappointing.
[+] [-] KRAKRISMOTT|2 years ago|reply
[+] [-] oneshtein|2 years ago|reply