Type-constrained code generation with language models

[+] homebrewer|10 months ago|reply

Hejlsberg mentioned the ability to quickly provide accurate type information to LLMs as one of the reasons for rewriting tsc into Go:

https://youtu.be/10qowKUW82U?t=3186

[+] tough|10 months ago|reply

But isn't TypeScript already a typed language to begin with?

[+] energy123|10 months ago|reply

This is what I'd consider doing if I was a small AI lab. Don't try to build a frontier LLM that beats all benchmarks. Try to make the world's best LLM at one programming language. Create your RL pipeline that puts all your resources into making the LLM the best at that language. Even better if there's a dearth of human-created training data on Github, since all your competitors will be bad at it.

Google somewhat did this with javascript in their latest Gemini-2.5 Pro release. But what about doing it for a smaller language? Google isn't going to do that, but there is still a lot of demand.

[+] eigenspace|10 months ago|reply

I'm not saying this is a bad idea, but it does sound like a rather risky prospect. You're basically proposing a bet against the ability of LLMs to generalize across programming languages, and to embed concepts at a deeper level than the syntax.

Many people do think this, but I'm not sure many of them are running AI labs.

[+] Drakim|10 months ago|reply

It makes sense to specialize it on one programming language to dedicate all of the LLM's intellectual space to that one domain, but on the flip side I wonder how much the LLM's sharpness and reasoning capabilities is increased by having more data to train on even if it's the wrong programming language.

As a developer I certainly think my programming skills in a specific language was improved by knowing other languages so I can contrast and compare.

[+] nurettin|10 months ago|reply

Using the language itself isn't the challenge for LLMs, they do that with a very high success rate. I haven't seen an LLM make syntax errors for several months. Calling the right functions with correct parameters is the challenge your hypothetical AI lab will have to solve (or half ass it and show great benchmark results).

[+] jiggawatts|10 months ago|reply

This was an obvious next step. Most current products can only restrict the token prediction to valid JSON or a specific JSON schema at best. There's no reason that this should be the only grammar available for constrained output mode.

The real challenge will be to make this detect and switch languages automatically. For example, a snippet of code could include a LaTeX formula in a comment and SQL in a string literal. There are many more examples, such as regex inside a shell script, and so on.

The obvious next step after that is back-tracking. It's possible to emit a token that is valid, but then allows no further completions that are valid. In other words, the model can paint itself into a corner. To my knowledge, no current online LLM service uses any kind of backtracking, they run in append ("forwards") mode only.

[+] tough|10 months ago|reply

SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking

https://arxiv.org/abs/2504.00532

IterGen: Iterative Semantic-aware Structured LLM Generation with Backtracking

https://arxiv.org/abs/2410.07295

ROCODE: Integrating Backtracking Mechanism and Program Analysis in Large Language Models for Code Generation

https://arxiv.org/abs/2411.07112v1

[+] foota|10 months ago|reply

I believe Microsoft introduced a framework that did this sort of backtracking that you're suggesting. I'm not sure how much traction it got.

[+] unknown|10 months ago|reply

[deleted]

[+] helltone|10 months ago|reply

Backtracking idea is interesting, could maybe diffusion help? At some point it turns into sat solving.

[+] unknown|10 months ago|reply

[deleted]

[+] nielstron|10 months ago|reply

re detecting and switching language: you could run several constraint systems in parallel and switch as soon as one of them rejects the input and another accepts it

re backtracking: a core part of this paper is ensuring a prefix property. that is there is always a legitimate completion and the model can not "corner" itself!

research needs to be done for what kind of languages and language features this prefix property can be ensured

[+] _jayhack_|10 months ago|reply

Also worth checking out MultiLSPy, effectively a python wrapper around multiple LSPs: https://github.com/microsoft/multilspy

Used in multiple similar publications, including "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), which uses static analysis beyond the type system to filter out e.g. invalid variable names, invalid control flow etc.

[+] nielstron|10 months ago|reply

Yes this work is super cool too! Note that LSPs can not guarantee resolving the necessary types that we use to ensure the prefix property, which we leverage to avoid backtracking and generation loops.

[+] LostBenjamin|10 months ago|reply

As an author of this paper, I am very excited see the great discussion here!

Several people mentioned the generation - compilation - fixing loop. Just want to remind you that our approach works for not only the generation step but also the fixing step. This is because fixing is essentially asking LLMs to generate a new version of the code. The paper actually has a "repair" experiment to demonstrate this and our approach achieves significant gain in this experiment, i.e., 37% relative improvement in functional correctness.

[+] yewW0tm8|10 months ago|reply

37% gain relative to what? What percent of generated functions were incorrect?

[+] tough|10 months ago|reply

Thank you for your research really impressive work!

[+] ArcaneMoose|10 months ago|reply

I think TypeScript is uniquely positioned to be the optimal language for LLMs. Tons of training data (benefiting from all the JS examples as well) plus the structure of types for LLMs to follow and tools to enforce.

[+] johnmw|10 months ago|reply

Those who agree might be interested in "Introducing TypeChat" by Anders Hejlsberg + others (2023) [1]

[1]: https://microsoft.github.io/TypeChat/blog/introducing-typech...

[+] pram|10 months ago|reply

LLMs work well with any static analysis tool. I frequently instruct Claude to use stuff like “go vet” and “deadcode” when it goes on a tear and writes a bunch of broken trash and declares mission accomplished.

[+] miki123211|10 months ago|reply

And unlike many other languages, Typescript types are extremely expressive.

For example, you can write a function that takes an object received from an API that uses snake_cased keys, and returns that same object, but with camelCased keys instead. This is not some "special case" in the Typescript compiler, the ability to do this emerges naturally from Typescript's features. I don't know any other language that can do this.

Most people don't know enough TS to use tese things effectively, but I think one could train an LLM to be very good at them. The combination of LLMs placing such advanced constraints on themselves, and then generating code based on those constraints, seems extremely powerful.

[+] rfoo|10 months ago|reply

> Tons of training data (benefiting from all the JS examples as well)

More != better.

[+] AaronAPU|10 months ago|reply

I can’t be the only one who hopes this was a joke.

[+] OutOfHere|10 months ago|reply

There are languages that constrain types a lot more tightly than TypeScript, e.g. Kotlin, Rust, and Haskell. The more constrained the types, the more correct the program could be.

[+] babyent|10 months ago|reply

It’s better sure but as a power TS user it still sucks at generating better code, and consistently fucks up with generics (or doesn’t use them) or simple types sometimes.

[+] threeseed|10 months ago|reply

Scala would be the best given that its type system is formally modelled:

https://infoscience.epfl.ch/entities/publication/6c6bb09d-a4...

[+] primitivesuave|10 months ago|reply

Completely agree. Even with the basic LLMs in the $20/month Cursor plan, I can work 10x faster on TypeScript codebases than I could otherwise, while for Python that multiple feels more like 2-3x. The autocompletions are especially impressive when there is a well-organized type system.

Also in response to adjacent commenters - many mission-critical TS codebases will disable the use of an explicit "any" with eslint - https://typescript-eslint.io/rules/no-explicit-any/.

[+] cpfiffer|10 months ago|reply

We (.txt, the outlines people) had a brief thread about this paper on twitter if you're interested: https://x.com/dottxtai/status/1922322194379551128

[+] muglug|10 months ago|reply

Really cool results!

That this research comes out of universities, and not large AI labs, makes me think those labs believe that larger models are still the way to go.

[+] aibrother|10 months ago|reply

+1 this seems like healthy development

[+] nielstron|10 months ago|reply

thank you!

[+] tough|10 months ago|reply

The code can be found here: https://github.com/eth-sri/type-constrained-code-generation

[+] bmc7505|10 months ago|reply

The correct way to do this is with finite model theory but we're not there yet.

[+] slt2021|10 months ago|reply

we really need LLM trained on AST, instead of token, is there any research on this?

[+] tough|10 months ago|reply

ASTrust: Towards More Trustworthy and Interpretable LLMs for Code through Syntax-Grounded Explanations

https://arxiv.org/abs/2407.08983

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

https://arxiv.org/abs/2401.03003

CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation

https://arxiv.org/abs/2405.02355

[+] nielstron|10 months ago|reply

The downside is that you need to properly preprocess code, have less non-code Training Data, and can not adapt easily to new programming languages

[+] int19h|10 months ago|reply

Been using Devin for a few months now, for Typescript and Python.

I've never seen it check-in uncompilable code, but watching the Devin console I can see it building and using the code to ensure commits are not complete garbage. When it has checked-in compilable and almost right but slightly wrong code, automatically running lint and tests (it doesn't always run them before checking in) from ci triggers it to push a fix on its own.

Feedback loops are nice, but they can be expensive, and time consuming (oh look at me complain that it takes Devin a whopping 15 minutes to complete a task) so I can definitely see the value in type constraints.

[+] android521|10 months ago|reply

is Devin worth the money? Would it be a big jump in productivity migrating from cursor to Devin?

[+] notnullorvoid|10 months ago|reply

The general idea seems very promising, I had been hoping someone would do something like this since seeing JSON schema structured outputs for LLMs.

Need to dig in a bit more on the implementation, but I was surprised that the paper didn't mention hooking into existing language service/server. There's more than types that an LLM could leverage from existing language tooling. Auto imports is a good example, it is handy for the human developer to keep a linear writing flow, something a LLM needs even more.

[+] nielstron|10 months ago|reply

the problem with LSPs is that they don't guarantee generating a type annotation that we can use for constraints, i.e. we can not ensure the prefix property using LSPs. so we had to roll our own :)

Pulling in more features to help the system is definitely worth looking into!

[+] kreetx|10 months ago|reply

They should extend this to Haskell and make use of the Curry-Howard isomorphism: define the program you want by a type signature and have the LLM find the implementation.

[+] koakuma-chan|10 months ago|reply

The vibe code society would benefit way more if libraries hosted their docs in a way that's easy to copy and paste into an LLM.

[+] thesz|10 months ago|reply

  > To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation.

This should not work with type inference even at the level of C++ "auto x = " - "auto" does not constrain "x" at all and what is right of equal sign is not constrained either..

In Haskell, the gap is even wider. A long "where" clause may have dependencies constraining things in different direction.

But, what important I see here is the continuation of reinvention of Cyc, from different starting point. ;)

Definitely, "every big LLM has in support code an ad-hoc bug ridden inefficient implementation of half of Cyc." Cyc was written in Lisp, most of LLM support code is C/C++, thus, it is just a corrolary of Greenspun's Tenth Rule.

[+] _alternator_|10 months ago|reply

Until a couple weeks ago, I considered this a promising approach. What changed? Agents, and Claude code in particular.

My prior experience was that LLMs were not much better than reading the docs, and certainly you wouldn’t get far vibe-coding in Rust. But Claude code behaves like I would, writing code that does t compile (typical LLM behavior), then reading the errors, correcting the code, and iterating until it compiles.

Its first attempt at a graph based scheduler in Rust took about $3 and 10 minutes to work correctly. It was ~500 loc, so definitely faster than what I can write in rust. (To be fair I spent a decent amount of time drafting a description of what I wanted in a markdown file to get Claude started).

[+] gdiamos|10 months ago|reply

If you have two methods that both improve accuracy, why not stack them?

[+] seeknotfind|10 months ago|reply

This is anticipated from work on constrained output from LLMs, and it's good to see it being developed. One nitpick though, this paper mentions the complexities of implementing type checking for program prefixes in languages that are not context free. It's true this is extremely difficult for languages which are context sensitive, especially because types may be defined after they are used. However, it does not mention that it is impossible to implement such a program for Turing complete languages such as C++. I would never miss such an opportunity to criticize C++ and highlight the need for better language design. I love you C++.

[+] nielstron|10 months ago|reply

noted. we'll make sure to critizise turing complete type systems more thoroughly next time :))

[+] hongbo_zhang|10 months ago|reply

We published a similar paper for MoonBit: Explore the Design of an AI-Friendly Programming Language https://conf.researchr.org/details/icse-2024/llm4code-2024-p...

[+] compacct27|10 months ago|reply

Honestly it's already working great in Cursor. Even adapting one type structure to another is quickly handled.

127 comments