I'm biased by my preferred style of programming languages but I think that pure statically typed functional languages are incredibly well suited for LLMs. The purity gives you referential transparency and static analysis powers that the LLM can leverage to stay correctly on task.
The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.
Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.
It's not just your bias, I too have found great success with a functional programming style, even from the earliest days of ChatGPT. (Not Haskell, but JS, which the models were always good at.)
I think the underlying reason is that functional programming is very conducive to keeping the context tight and focused. For instance, most logic relevant to a task tends to be concentrated in a few functions and data structures across a smallish set of files. That's all you need to feed into the context.
Contrast that with say, Java, where the logic is often spread across a deep inheritance hierarchy located in bunch of separate files. Add to that large frameworks that encapsulate a whole lot of boilerplate and bespoke logic with magic being injected from arbitrary places via e.g. annotations. You'd need to load all of those files (or more likely, simply the whole codebase) and relevant documentation to get accurate results. And even then the additional context is not just extraneous and expensive, but also polluted with irrelevant data that actually reduces accuracy.
A common refrain of mine is that for the best results, you have to invest a lot of time experimenting AND adapt yourself to figure out what works best with AI. In my case, it was gradually shifting to a functional style after spending my whole career writting OO code.
Realistically, it’s also a function of how many iterations it takes for an AI agent to correctly solve a problem with a given language. I’d imagine most AI agents would frequently have to redo J or F# code, as they are fairly uncommon languages with much smaller training set than JavaScript or Python.
I can say that for F# this has been mostly true up until quite recently. We use F# at work and were mostly unable to use agents like Claude Code up until the release of Opus 4.5, which seems to know F# quite well.
I program mostly in Clojure and I expected it to be near the top, as it tends to be very concise and expressive (qualities I really admire). I am getting excellent results from Claude Code (Opus 4.5), and I think this might be one of the reasons. I'm using Claude with a large code base and the token-efficiency of Clojure might help with fitting more into the context window.
This is kind of just a measurement of how representative a language is in the distribution of the tokenizer training. You could have a single token equal to “public static void main”.
Well, yes, looking beyond token efficiency I find that the more constrained (stronger and richer static typing) the language the better/faster (fewer rounds of editing and debugging, ergo fewer tokens) the LLM deals with it. C is a nightmare.
I knew it without the reading. But having each system call in 2 versions not even closely related to each other (monadic/diadic) requires me to have a hard time doing learning. I very appreciate this language for shortness but this kind of shortness might annoy.
Someone has made a programming language called Sui, which is said to be designed for LLMs.
However, using index-based variable names in order to "avoid typo bugs" makes it more difficult than general-purpose languages, and it also has poor token efficiency :(
It strikes me that more tokens likely give the LLM more time/space to "think". Also that more redundant tokens, like local type declarations instead of type inference from far away, likely often reduce the portion of the code LLMs (and humans) have to read.
So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.
With Chain of Thoughts (text thinking), the models can already use as much compute as they want in any language (determined by reinforcement learning training)
This is interesting research; thank you for doing it.
I am not sure token efficiency is an interesting problem in the long term, though.
And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.
I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).
Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.
On https://danuker.go.ro/programming-languages.html you can find charts of popularity (TIOBE) vs code density for various programming languages together with which programming languages are Pareto-optimal regarding these two criteria.
I suspect DB queries will also benefit from token-efficient query languages as RAG queries grow exponentially. I've been working on one such language that is emitted in a token-efficient IR and compiles to SQL. https://memelang.net/
There is one class of languages missing in the comparison: Programming golf languages: E.g. Japt [1], Pyth [2] or Jelly [3].
Update: I noticed that the author mentions that "APL's famous terseness isn't a plus for LLMs." Isn't that just a design limitation of the LLM tokenizers?
Concatenative languages like Factor and Forth are very token-efficient in theory. Theoretically optimal for raw lexical density. No parentheses, no commas, no argument delimiters, just whitespace-separated words, but stack shuffling can add overhead for complex data flow, unless you use "locals" in Factor, for example.
C is surprisingly efficient as well. Minimal keywords, terse syntax, single-character operators. Not much boilerplate, and the core logic is dense.
I think the worst languages are Java, C#, and Rust (lifetime annotations, verbose generics).
In my opinion, C or Go for imperative code, Factor / Forth if the model knows them well.
Is that statement about C based on anything in particular? C was 18th of all the languages in the article's chart (the worst!), which I'd guess was due to the absence of a standard library.
I understand your logic but I found LLM's to be quite strong at C#. It makes little mistakes and the mistakes seem related to the complexity of what I'm doing, not the language itself.
This confirms my personal experience with switching to Go from C# - despite the e framework and language being MUCH simpler, the code usually ends up the same length.
C# often has a 'nice' and 'performant' way of doing things (for example, strings are nice, but they allocate and are UTF16, but ReadOnlySpan<byte> is faster for UTF8, and can reuse buffers), the performant syntax often ends up being very verbose, with the nice syntax being barely shorter than Go's. Go also does the right thing by default, and its strings are basically array slices into UTF8 byte arrays.
Token efficiency is only one metric. Simplicity of syntax and semantics are another valuable one.
re: tokens and session length, there are other ways to manage this than language choice. Summarization is one, something I do is to not out read_file content in the messages, but rather in the system prompt. This means that when it tries to reread after edit, we don't have two copies of the file in context.
Going to 10M token sessions, keeping per turn context under 100k, working on Golang... language choice for the sake of tokens does not seem a good thing to decide based on
I don't think context size is really the limit for larger codebases - it's more about how you use that context.
Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.
Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).
One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.
For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.
It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.
The approaches used by Claude Code and Cursor are inefficient. It's possible to calculate a covering set for a piece of code and provide that to an agent directly via a tool, and it turns out that this can reduce context usage in SWE-bench style tasks by >90% over RAG and grep/read.
An agent can make summaries via Markdown files while processing. Then use that to break the problem to several issues and then tackle them one by one, even automatically, but more usually interactively. The problem is the technique now, not the llm. Yes, it costs a lot (lot) more. But, it can do it, and people work cost way more than tokens.
I would expect that we’ll end up compressing (or whatever term you would use) this at some point so many of those syntactical differences will not be as significant.
But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.
I can't speak to Clojure, but I will say that LLMs are actually surprisingly good at writing and understanding Julia code compared to some languages that have a much larger training corpus to pull from.
I kinda (but not really because I don't much care about tokens and don't really know anything about models) wonder about Common Lisp. There's probably far fewer examples of CL code in any training sets than Clojure or Python or whatever, but it could still be somewhat interesting.
I guess it also depends on which dataset LLM was trained on. Rare or niche languages get fragmented into more tokens even if the code itself is short. So two languages with the same number of characters can produce very different token counts because one aligns with what the model has seen millions of times and the other does not.
Semantically, julia is a fully dynamic language. But the trick is that it does this by recognizing that being static is a constraint on a dynamic language, so it implements dynamic typing by stitching together islands of statically typed code.
intresting project, do you mind to explain what brought you to do that research?
im a litte surprised that the more simple languages tend to use more tokens, but after thing i realizend that languages with more expressiv syntax allow to write with less "words". But i also think it is a little bit like a race of watches. who realy wants to know what watch runs faster?
I'm finding that I have to share more and more code to ensure that various standards are being kept.
For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.
I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)
That costs more tokens for each problem than just saying "her look at this section and work toward this goal"
solomonb|1 month ago
The high level declarative nature and type driven development style of languages like Haskell also make it really easy for an experienced developer to review and validate the output of the LLM.
Early on in the GPT era I had really bad experiences generating Haskell code with LLMs but I think that the combination of improved models, increased context size, and agentic tooling has allowed LLMs to really take advantage of functional programming.
eru|1 month ago
keeda|1 month ago
I think the underlying reason is that functional programming is very conducive to keeping the context tight and focused. For instance, most logic relevant to a task tends to be concentrated in a few functions and data structures across a smallish set of files. That's all you need to feed into the context.
Contrast that with say, Java, where the logic is often spread across a deep inheritance hierarchy located in bunch of separate files. Add to that large frameworks that encapsulate a whole lot of boilerplate and bespoke logic with magic being injected from arbitrary places via e.g. annotations. You'd need to load all of those files (or more likely, simply the whole codebase) and relevant documentation to get accurate results. And even then the additional context is not just extraneous and expensive, but also polluted with irrelevant data that actually reduces accuracy.
A common refrain of mine is that for the best results, you have to invest a lot of time experimenting AND adapt yourself to figure out what works best with AI. In my case, it was gradually shifting to a functional style after spending my whole career writting OO code.
bicx|1 month ago
Jacques2Marais|1 month ago
jwr|1 month ago
kaliszad|1 month ago
janalsncm|1 month ago
make3|1 month ago
Seeing all the C languages and JavaScript at the bottom like this makes me wonder if it's not just that Curly brackets take a lot of tokens.
cryptonector|1 month ago
moelf|1 month ago
muyuu|1 month ago
protocolture|1 month ago
But had never considered that a programming language might be created thats less human readable/auditable to enable LLMs.
Scares me a bit.
make3|1 month ago
We're not building a language for LLMs just yet.
btbytes|1 month ago
[1] https://www.jsoftware.com/
eimrine|1 month ago
HtmlProgrammer|1 month ago
If you’re going to write an article atleast do the base research yourself man
kozika|1 month ago
https://github.com/TakatoHonda/sui-lang
gpm|1 month ago
So I'm not convinced this is either the right metric, or even if you got the right metric that it's a metric you want to minimize.
make3|1 month ago
limoce|1 month ago
efitz|1 month ago
I am not sure token efficiency is an interesting problem in the long term, though.
And in the short term I wonder if prompts could be pre-compiled to “compressed tokens”; the idea would be to use a smaller number of tokens to represent a frequently needed concept; kind of like LZ compression. Or maybe token compression becomes a feature of future models optimized for specific tasks.
I was wondering last year if it would be worthwhile trying to create a language that was especially LLM-friendly, eg that embedded more context in the language structure. The idea is to make more of the program and the thinking behind it, explicit to the LLM but in a programming language style to eliminate the ambiguity of natural language (one could just use comments).
Then it occurred to me that with current LLM training methodology that there’s a chicken-and-egg problem; it doesn’t start to show rewards until there is a critical mass of good code in the language for LLMs to train on.
aleph_minus_one|1 month ago
On https://danuker.go.ro/programming-languages.html you can find charts of popularity (TIOBE) vs code density for various programming languages together with which programming languages are Pareto-optimal regarding these two criteria.
bri-holt|1 month ago
thw_9a83c|1 month ago
Update: I noticed that the author mentions that "APL's famous terseness isn't a plus for LLMs." Isn't that just a design limitation of the LLM tokenizers?
[1]: https://github.com/ETHproductions/japt
[2]: https://github.com/isaacg1/pyth
[3]: https://github.com/DennisMitchell/jellylanguage
112233|1 month ago
Plus, they will strongly "pull" the context when LLM parses it back, to the point of overriding your instructions (true story)
johnisgood|1 month ago
C is surprisingly efficient as well. Minimal keywords, terse syntax, single-character operators. Not much boilerplate, and the core logic is dense.
I think the worst languages are Java, C#, and Rust (lifetime annotations, verbose generics).
In my opinion, C or Go for imperative code, Factor / Forth if the model knows them well.
Smaug123|1 month ago
Bootvis|1 month ago
torginus|1 month ago
C# often has a 'nice' and 'performant' way of doing things (for example, strings are nice, but they allocate and are UTF16, but ReadOnlySpan<byte> is faster for UTF8, and can reuse buffers), the performant syntax often ends up being very verbose, with the nice syntax being barely shorter than Go's. Go also does the right thing by default, and its strings are basically array slices into UTF8 byte arrays.
neonsunset|1 month ago
[deleted]
verdverm|1 month ago
re: tokens and session length, there are other ways to manage this than language choice. Summarization is one, something I do is to not out read_file content in the messages, but rather in the system prompt. This means that when it tries to reread after edit, we don't have two copies of the file in context.
Going to 10M token sessions, keeping per turn context under 100k, working on Golang... language choice for the sake of tokens does not seem a good thing to decide based on
HarHarVeryFunny|1 month ago
Claude Code makes some efforts to reduce context size, but at the end of the day is loading entire source files into context (then keeping them there until told to remove them, or context is compacted). One of the major wins is to run subagents for some tasks, that use their own context rather than loading more into CCs own context.
Cursor makes more efficient use of context by building a vector database of code chunks, then only loading matching chunks into context (I believe it does this for Composer/agentic use as well as for tab/autocomplete).
One of the more obvious ways to reduce context use in a larger multi-module codebase would be to take advantage of the split between small module definition (e.g. C++ .h files) and large module implementations (.cpp files). Generally you'd only need to load module interfaces/definitions into context if you are working on code that uses the module, and Cursor's chunked approach can reduce that further.
For whole codebase overview a language server can help locate things, and one could use the AI to itself generate shortish summaries/overviews of source files and the codebase and structure, similar to what a human developer might keep in their head, rather than repeatedly reading entire source files for code that isn't actually being modified.
It seems we're really in the early days of agentic coding tools, and they have a lot of room to get better and more efficient.
CuriouslyC|1 month ago
If you're interested in learning more, https://github.com/sibyllinesoft/scribe
didip|1 month ago
Because that’s what happened in the real world when generating a bunch of untyped Python code.
tzahifadida|1 month ago
switchbak|1 month ago
But I would love for more expressive and compact languages to do better, selfish as I am. But I think training data size is more of a factor, and we won’t be all moving up Clojure any time soon.
eigenspace|1 month ago
tmtvl|1 month ago
epolanski|1 month ago
E.g. when it comes to authoring code, C, which comes language, is by far one of the languages that LLMs excel most at.
anishgupta|1 month ago
singularity2001|1 month ago
eigenspace|1 month ago
Surac|1 month ago
daft_pink|1 month ago
nineteen999|1 month ago
Those are pretty terse.
awesome_dude|1 month ago
For example I shared some Model code with Claude and Gemini (both via web interfaces) and they both tried to put Controller code into the Model, despite me multiple times telling them that the code wasn't wanted nor needed in there.
I had to (eventually) share the entire project with the models (despite them having been working with the code all along) before they would comply with my request (whilst also congratulating me on my far superior architecture..)
That costs more tokens for each problem than just saying "her look at this section and work toward this goal"
solumunus|1 month ago
zhisme|1 month ago
nige123|1 month ago
6510|1 month ago
andersmurphy|1 month ago
TZubiri|1 month ago
xigoi|1 month ago
reena_signalhq|1 month ago
[deleted]
lngnmn2|1 month ago
[deleted]
yeasku|1 month ago
[deleted]