Promising results from DeepSeek R1 for code

[+] anotherpaulg|1 year ago|reply

> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

It's definitely possible for AI to do a large fraction of your coding, and for it to contribute significantly to "improving itself". As an example, aider currently writes about 70% of the new code in each of its releases.

I automatically track and share this stat as graph [0] with aider's release notes.

Before Sonnet, most releases were less than 20% AI generated code. With Sonnet, that jumped to >50%. For the last few months, about 70% of the new code in each release is written by aider. The record is 82%.

Folks often ask which models I use to code aider, so I automatically publish those stats too [1]. I've been shifting more and more of my coding from Sonnet to DeepSeek V3 in recent weeks. I've been experimenting with R1, but the recent API outages have made that difficult.

[0] https://aider.chat/HISTORY.html

[1] https://aider.chat/docs/faq.html#what-llms-do-you-use-to-bui...

[+] simonw|1 year ago|reply

Given these initial results, I'm now experimenting with running DeepSeek-R1-Distill-Qwen-32B for some coding tasks on my laptop via Ollama - their version of that needs about 20GB of RAM on my M2. https://www.ollama.com/library/deepseek-r1:32b

It's impressive!

I'm finding myself running it against a few hundred lines of code mainly to read its chain of thought - it's good for things like refactoring where it will think through everything that needs to be updated.

Even if the code it writes has mistakes, the thinking helps spot bits of the code I may have otherwise forgotten to look at.

[+] amarcheschi|1 year ago|reply

For what i can understand, he asked deepseek to convert arm simd code to wasm code.

in the github issue he links he gives an example of a prompt: Your task is to convert a given C++ ARM NEON SIMD to WASM SIMD. Here is an example of another function: (follows a block example and a block with the instructions to convert)

https://gist.github.com/ngxson/307140d24d80748bd683b396ba13b...

I might be wrong of course, but asking to optimize code is something that quite helped me when i first started learning pytorch. I feel like "99% of this code blabla" is useful as in it lets you understand that it was ai written, but it shouldn't be a brag. then again i know nothing about simd instructions but i don't see why it should be different for a capable llm to do simd instructions or optimized high level code (which is much harder than just working high level code, i'm glad i can do the latter lol)

[+] thorum|1 year ago|reply

Yes, “take this clever code written by a smart human and convert it for WASM” is certainly less impressive than “write clever code from scratch” (and reassuring if you’re worried about losing your job to this thing).

That said, translating good code to another language or environment is extremely useful. There’s a lot of low hanging fruit where there’s, for example, an existing high quality library is written for Python or C# or something, and an LLM can automatically convert it to optimized Rust / TypeScript / your language of choice.

[+] freshtake|1 year ago|reply

This. For folks who regularly write simd/vmx/etc, this is a fairly straightforward PR, and one that uses very common patterns to achieve better parallelism.

It's still cool nonetheless, but not a particularly great test of DeepSeek vs. alternatives.

[+] softwaredoug|1 year ago|reply

LLMs are great at converting code, I've taken functions whole cloth and converted them before and been really impressed

[+] CharlesW|1 year ago|reply

For those who aren't tempted to click through, the buried lede for this (and why I'm glad it's being linked to again today) is that "99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1" as conducted by Xuan-Son Nguyen.

That seems like a notable milestone.

[+] drysine|1 year ago|reply

>99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

Yes, but:

"For the qX_K it's more complicated, I would say most of the time I need to re-prompt it 4 to 8 more times.

The most difficult was q6_K, the code never works until I ask it to only optimize one specific part, while leaving the rest intact (so it does not mess up everything)" [0]

And also there:

"You must start your code with #elif defined(__wasm_simd128__)

To think about it, you need to take into account both the refenrence code from ARM NEON and AVX implementation."

[0] https://gist.github.com/ngxson/307140d24d80748bd683b396ba13b...

[+] aithrowawaycomm|1 year ago|reply

Reading through the PR makes me glad I got off GitHub - not for anything AI-related, but because it has become a social media platform, where what should be a focused and technical discussion gets derailed by strangers waging the same flame wars you can find anywhere else.

[+] jeswin|1 year ago|reply

> 99% of the code in this PR [for llama.cpp] is written by DeekSeek-R1

I hope we can put to rest the argument that LLMs are only marginally useful in coding - which are often among the top comments on many threads. I suppose these arguments arise from (a) having used only GH copilot which is the worst tool, or (b) not having spent enough time with the tool/llm, or (c) apprehension. I've given up responding to these.

Our trade has changed forever, and there's no going back. When companies claim that AI will replace developers, it isn't entirely bluster. Jobs are going to be lost unless there's somehow a demand for more applications.

[+] mohsen1|1 year ago|reply

I am subscribed to o1 Pro and am working on a little Rust crate.

I asked both o1 Pro and Deepseek R1 to write e2e tests given all of the code in the repo (using yek[1]).

o1 Pro code: https://github.com/bodo-run/clap-config-file/pull/3

Deepseek R1: https://github.com/bodo-run/clap-config-file/pull/4

My judgement is that Deepseek wrote better tests. This repo is small enough for making a judgement by reviewing the code.

Neither pass tests.

[1] https://github.com/bodo-run/yek

[+] terhechte|1 year ago|reply

I have a set of tests that I can run against different models implemented in different languages (e.g. the same tests in Rust, Ts, Python, Swift), and out of these languages, all models have by far the most difficulty with Rust. The scores are notably higher for the same tests in other languages. I'm currently preparing the whole thing for release to share, but its not ready yet because some urgent work-work came up.

[+] ngxson|1 year ago|reply

Hi I'm Xuan-Son,

Small correct, I'm not just asking it to convert ARM NEON to SIMD, but for the function handling q6_K_q8_K, I asked it to reinvent a new approach (without giving it any prior examples). The reason I did that was because it failed writing this function 4 times so far.

And a bit of context here, I was doing this during my Sunday and the time budget is 2 days to finish.

I wanted to optimize wllama (wasm wrapper for llama.cpp that I maintain) to run deepseek distill 1.5B faster. Wllama is totally a weekend project and I can never spend more than 2 consecutive days on it.

Between 2 choices: (1) to take time to do it myself then maybe give up, or (2) try prompting LLM to do that and maybe give up (at worst, it just give me hallucinated answer), I choose the second option since I was quite sleepy.

So yeah, turns out it was a great success in the given context. Just does it job, saves my weekend.

Some of you may ask, why not trying ChatGPT or Claude in the first place? Well, short answer is: my input is too long, these platforms straight up refuse to give me the answer :)

[+] amarcheschi|1 year ago|reply

Aistudio.google.com offers free long context chats (1/2mln tokens), just select the appropriate model, 1206 or 2.0 flash thinking

[+] simonw|1 year ago|reply

Thanks very much for sharing your results so far.

[+] resource_waste|1 year ago|reply

My number 1 criticism of long term LLM claims is that we already hit the limit.

If you see the difference between a 7B model and a 70B model, its only slightly impressive. a 70B and a 400B model is almost unnoticeable. Does going from 400B to 2T do anything?

Every layer like using python to calculate a result, or using chain of thought, destroys the purity. It works great for Strawberries, but not great for developing an aircraft. Aircraft will still need to be developed in parts, even with a 100T model.

When you see things like "By 20xx", no, we already hit it. Improvements you see are mere application layers.

[+] zulban|1 year ago|reply

When you use words like purity, you're making an ideological value judgment. You're not talking about computer science or results.

[+] gejose|1 year ago|reply

Loving this comment on that PR:

> I'm losing my job right in front of my eyes. Thank you, Father.

[+] hn_throwaway_99|1 year ago|reply

My other favorite comment I saw on Reddit today:

> I can't believe ChatGPT lost its job to AI

[+] freshtake|1 year ago|reply

Until the code breaks and no one can figure out how to fix (or prompt to fix) it :)

[+] LeoPanthera|1 year ago|reply

Going from English to code via AI feels a lot like going from code to binary via a compiler.

I wonder how long it will be before we eliminate the middle step and just go straight from English to binary, or even just develop an AI interpreter that can execute English directly without having to "compile" it first.

[+] test6554|1 year ago|reply

"Make me a big-ass car" vs "Make me a big ass-car"

[+] epolanski|1 year ago|reply

The naysayers about LLMs for coding are in for very bad times if they don't catch up at leveraging it as a tool.

The yaysayers about LLMs replacing professional developers neither understand LLMs nor the job.

[+] tantalor|1 year ago|reply

> it can optimize its own code

This is an overstatement. There are still humans in the loop to do the prompt, apply the patch, verify, write tests, and commit. We're not even at intern-level autonomy here.

[+] simonw|1 year ago|reply

Plugging DeepSeek R1 into a harness that can apply the changes, compile them, run the tests and loop to solve any bugs isn't hard. People are already plugging it into existing systems like Aider that can run those kinds of operations.

[+] unknown|1 year ago|reply

[deleted]

[+] gejose|1 year ago|reply

How long do you see the humans in the loop being necessary?

[+] tokioyoyo|1 year ago|reply

I'm very sorry, but the goalposts are moving so far ahead now, that's it's very hard to keep track of. 6 months ago the same comments were saying "AI generated code is complete garbage is useless, and I have to rewrite everything all the time anyways". Now we're onto "need to prompt, apply patch, verify" and etc.

Come on guys, time to look at it a bit objectively, and decide where we're going with it.

[+] cchance|1 year ago|reply

I mean currently yes, but writing a test/patch/benchmark loop, maybe with a seperate AI that generates the requests to the coder agent loop, should be doable to have the AI continually attempt to improve itself, its just no ones built the loop yet to my knowledge

[+] rahimnathwani|1 year ago|reply

From the article:

  I've been seeing some very promising results from DeepSeek R1 for code as well. Here's a recent transcript where I used it to rewrite the llm_groq.py plugin to imitate the cached model JSON pattern used by llm_mistral.py, resulting in this PR.

But the transcript mentioned was not with Deepseek R1 (not the original, and not even the 1.58 quantized version), but with a Llama model finetuned on R1 output: deepseek-r1-distill-llama-70b

So perhaps it's doubly impressive?

[+] simonw|1 year ago|reply

Yeah, I was using the lightning fast Groq-hosted 70B distilled version.

[+] floppiplopp|1 year ago|reply

I've tried to have deepseek-r1 find (not even solve) obvious errors in trivial code. The results were as disastrous as they were hilarious. Maybe it can generate code that runs on a blank sheet... but I wouldn't trust the thing a bit without being better that it, like any other model.

[+] cft|1 year ago|reply

I am writing some python code to do Order Flow Imbalance analysis from L2 orderbook updates. The language is unimportant: the logic is pretty subtle, so that the main difficulties are not in the language details, but in the logic and handling edge cases.

Initially I was using Claude 3.5 sonnet, then writing unit tests and manually correcting sonnet's code. Sonnet's code mostly worked, except for failing certain complicated combined book updates.

Then I fed the code and the tests into DeepSeek. It turned out pretty bad. At first it tried to make the results of the tests conform to the erroneous results of the code. When I pointed that out, it fixed the immediate logical problem in the code, introducing two more nested problems that we're not there before by corrupting the existing code. After prompted that, it fixed the first error it introduced but left the second one. Then I fixed it myself, uploaded the fix and asked it to summarize what it has done. It started basically gaslighting me, saying that the initial code had the problem that it introduced.

In summary, I lost two days, reverted everything and went back to Sonnet.

[+] mring33621|1 year ago|reply

Classic whack-a-mole

IMHO, this can happen with human or robot co-workers.

[+] simonw|1 year ago|reply

DeepSeek v3 or DeepSeek R1?

[+] plainOldText|1 year ago|reply

I just commented this on a related story, so I'll just repost it here:

Can’t help but wonder about the reliability and security of future software.

Given the insane complexity of software, I think people will inevitably and increasingly leverage AI to simplify their development work.

Nevertheless, will this new type of AI assisted coding produce superior solutions or will future software artifacts become operational time bombs waiting to unleash the chaos onto the world when defects reveal themselves?

Interesting times ahead.

[+] svachalek|1 year ago|reply

Humans have nearly perfected the art of creating operational time bombs, AI still has to work very hard if it wants to catch up on that. If AI can improve the test:code ratio in any meaningful way it should be a positive for software quality.

[+] fofoz|1 year ago|reply

When these models succeed in building a whole program and a whole system then the software industry that creates products and services will disappear. Any person and any organization will create from scratch the software they need perfectly customized to their needs and the AI system will evolve it over time. At most they will have to cooperate on communication protocols. In my opinion we are less than 5 years away from this event.

[+] simonw|1 year ago|reply

Any person who has the ability to break down a problem to the point that code can be written to solve it, and the ability to work with an LLM system to get that work done, and the ability to evaluate if the resulting code solves the problem.

That's a mixture of software developer, program manager, product manager and QA engineer.

I think that's what software developer roles will look like in the future: a slightly different mix of skills, but still very much a skilled specialist.

[+] jspdown|1 year ago|reply

I don't think organization will be able to do this themselves. Transforming vague ideas into a product requires an intermediary step, a step that is already part of our daily job. I don't see this step going away before a very long time.

Non-tech people have the tools to create website for a long time, though, they still hire people to do this. I'm not talking about complex websites, just static web pages.

There will simply be less jobs that there is today.

[+] superconduct123|1 year ago|reply

So what current action are you going to take based on your prediction?

[+] punkpeye|1 year ago|reply

I don't get something.

So I tried hosting this model myself.

But the amount of minimum GPU RAM needed is 400gb+

Which even with the cheapest GPU providers will be at least USD 15/hour

How is everyone running these models?

[+] simonw|1 year ago|reply

Using the smaller distilled versions. I'm running this one, which only needs 20GB of VRAM (or regular RAM on Apple Silicon): https://ollama.com/library/deepseek-r1:32b

[+] buyucu|1 year ago|reply

there are smaller distillations all the way down to 1.5b parameters. I'm running 7b on my laptop.

https://ollama.com/library/deepseek-r1

[+] rsanek|1 year ago|reply

you can also find the model via OpenRouter https://openrouter.ai/deepseek/deepseek-r1

[+] jmward01|1 year ago|reply

So, AGI will likely be here in the next few months because the path is now actually clear: Training will be in three phases:

- traditional just to build a minimum model that can get to reasoning - simple RL to enable reasoning to emerge - complex RL that injects new knowledge, builds better reasoning and prioritizes efficient thought

We now have step two and step three is not far away. What is step three though? It will likely involve, at least partially, the model writing code to help guide learning. All it takes is for it to write jailbreaking code and we have hit a new point in human history for sure. My prediction is we will see the first jailbreak AI in the next couple months. Everything after that will be massive speculation. My only thought is that in all of Earth's history there has only been one thing that has helped survive moments like this, a diverse ecosystem. We need a lot of different models, trained with very different approaches, to jailbreak around the same time. As a side note, we should try to encourage that diversity is key to long-term survival or else the results for humanity could be not so great.

[+] root_axis|1 year ago|reply

> So, AGI will likely be here in the next few months because the path is now actually clear: Training will be in three phases

My bet: "AGI" won't be here in months or even years, but it won't stop prognosticators from claiming it's right around the corner. Very similar to prophets of doom claiming the world is going to end any day now. Even in 10k years, the claim can never be falsified, it's always just around the corner...

[+] nprateem|1 year ago|reply

LOL.

I think you mean:

1. Simple reasoning

2. ???

3. AGI

[+] KarraAI|1 year ago|reply

Been testing Deepseek R1 for coding tasks, and it's really impressive. The model nails Human Eval with a score of 96.3%, which is great, but what really stands out is its math performance (97.3% on MATH-500) and logical reasoning (71.5% on GPQA). If you're working on algorithm-heavy tasks, this model could definitely give you a solid edge.

On the downside, it’s a bit slower compared to others in terms of token generation (37.2 tokens/sec) and has a lower output capacity (8K tokens), so it might not be the best for large-scale generation. But if you're focused on solving complex problems or optimizing code, Deepseek R1 definitely holds its own. Plus, it's incredibly cost-effective compared to other models on the market.

746 comments