The discussions around this point are taking it too seriously, even when they are 100% correct. LLMs are not deterministic, so they are not compilers. Sure, if you specify everything - every tiny detail, you can often get them to mostly match. But not 100%. Even if you do fix that, at that point you are coding using English, which is an inefficient language for that level of detail in a specification. And even if you accept that problem, you still have gone to a ton of work just to fight the fundamental non-deterministic nature of LLMs.
It all feels to me like the guys who make videos of using using electric drills to hammer in a nail - Sure, you can do that, but it is the wrong tool for the job. Everyone knows the phrase: "When all you have is a hammer, everything looks like a nail." But we need to also keep in mind the other side of that coin: "When all you have is nails, all you need is a hammer." LLMs are not a replacement for everything that happens to be digital.
I think the point I wanted to make was that even if it was deterministic (which you can technically make it to be I guess?) you still shouldn’t live in a world where you’re guided by the “guesses” that the model makes when solidifying your intent into concrete code. Discounting hallucinations (I know this a is a big preconception, I’m trying to make the argument from a disadvantaged point again), I think you need a stronger argument than determinism in the discussion against someone who claims they can write in English, no reason for code anymore; which is what I tried to make here. I get your point that I might be taking the discussion to seriously though.
>LLMs are not deterministic, so they are not compilers.
"Deterministic" is not the the right constraint to introduce here. Plenty of software is non-deterministic (such as LLMs! But also, consensus protocols, request routing architecture, GPU kernels, etc) so why not compilers?
What a compiler needs is not determinism, but semantic closure. A system is semantically closed if the meanings of its outputs are fully defined within the system, correctness can be evaluated internally and errors are decidable. LLMs are semantically open. A semantically closed compiler will never output nonsense, even if its output is nondeterministic. But two runs of a (semantically closed) nondeterministic compiler may produce two correct programs, one being faster on one CPU and the other faster on another. Or such a compiler can be useful for enhancing security, e.g. programs behave identically, resist fingerprinting.
Nondeterminism simply means the compiler selects any element of an equivalence class. Semantic closure ensures the equivalence class is well‑defined.
They are designed to be where temperature=0. Some hardware configurations are known defy that assumption, but when running on perfect hardware they most definitely are.
What you call compilers are also nondeterministic on 'faulty' hardware, so...
LLMs are deterministic at minimal temperature. Talking about determinism completely misses the point. The human brain is also non-deterministic and I don't see anybody dismiss human written code based on that. If you remove randomness and choose tokens deterministically, that doesn't magically solve the problems of LLMs.
This is where the desire to NOT anthropomorphize LLMs actually gets in the way.
We have mechanisms for ensuring output from humans, and those are nothing like ensuring the output from a compiler. We have checks on people, we have whole industries of people whose whole careers are managing people, to manage other people, to manage other people.
with regards to predictability LLMs essentially behave like people in this manner. The same kind of checks that we use for people are needed for them, not the same kind of checks we use for software.
> The same kind of checks that we use for people are needed for them
Those checks works for people because humans and most living beings respond well to rewards/punishment mechanisms. It’s the whole basis of society.
> not the same kind of checks we use for software.
We do have systems that are non deterministic (computer vision, various forecasting models…). We judge those by their accuracy and the likely of having false positive or false negatives (when it’s a classifier). Why not use those metrics?
> The same kind of checks that we use for people are needed for them...
The whole benefit of computers is that they don't make stupid mistakes like humans do. If you give a computer the ability to make random mistakes all you have done is made the computer shitty. We don't need checks, we need to not deliberately make our computers worse.
Looking at LLMs as a less-than-completely-reliable compiler is a good idea, but it's misleading to think of them as natural-language-to-implementation compiler because they are actually an anything-to-anything compiler.
If you don't like the results or the process, you have to switch targets or add new intermediates. For example instead of doing description -> implementation, do description -> spec -> plan -> implementation
The more I use LLMs, the more I find this true. Haskell made me think for minutes before writing one line of code. Result? I stopped using Haskell and went back to Python because with Py I can "think while I code". The separation of thinking|coding phases in Haskell is what my lazy mind didn't want to tolerate.
Same goes with LLMs. I want the model to "get" what I mean but often times (esp. with Codex) I must be very specific about the project scope and spec. Codex doesn't let me "think while I vibe", because every change is costly and you'd better have a good recovery plan (git?) when Codex goes stray.
> My stance has been pretty rigid for some time: LLMs hallucinate, so they aren’t reliable building blocks. If you can’t rely on the translation step, you can’t treat it as a serious abstraction layer because it provides no stable guarantees about the underlying system.
This is technically true. But unimportant. When I write code in a higher level language and it gets compiled to machine code, ultimately I am testing statically generated code for correctness. I don’t care what type of weird tricks the compiler did for optimizations.
How is that any different than when someone is testing LLM generated C code? I’m still testing C code that isn’t going to magically be changed by the LLM without my intervention anymore than my C code is going to be changed without my recompiling it.
On this latest project I was on, the Python generated code by Codex was “correct” with the happy path. But there were subtle bugs in the distributed locking mechanics and some other concurrency controls I specified. Ironically, those were both caught by throwing the code in ChatGPT in thinking mode.
No one is using an LLM to compute is a number even or odd at runtime.
Because for all high level languages, errors happen at the same level of the language. You do not write programs in Go and then verify it in opcodes with a dissasembler. Incorrect syntax and runtime reference the Go files and symbols, not CPU registers.
The same thing happens in JavaScript. I debug it using a Javascript debugger, not with gdb. Even when using bash script, you don’t debug it by going into the programs source code, you just consult the man pages.
When using LLM, I would expect not to go and verify the code to see if it actually correct semantically.
> I don’t care what type of weird tricks the compiler did for optimizations.
you might not, but plenty of others do. on the jvm for example, anyone building a performance sensitive application has to care about what the compiler emits + how the jit behaves. simple things like accidental boxing, megamorphic call preventing inlining, etc. have massive effects.
i've spent many hours benchmarking, inspecting in jitwatch, etc.
As a ham radio operator (KA9DGX), I tend to view all of this through the lens of impedance matching, it's my metaphor of choice.
You could use a badly designed antenna with a horrible VSWR at the end of a coax, and effectively communicate with some portion of the world, by using a tuner, which helps cover up the inefficiencies involved. However, doing so loses signal, in both directions. You can add amplification at the antenna for receive (a pre-amp) and transmit with more power, but eventually the coax will break down, possibly well before the legal limit.
It is far better to use a well designed antenna and matching system at the feed point. It maximizes signal transmission in both directions, by reducing losses as much as possible.
--
A compiler matches our cognitive impedance to that of the computer. We don't handle generating opcodes and instruction addresses manually very well. I don't see how an LLM is going to do that any better. Compilers, on the other hand, do it reliably, and very efficiently.
The best cognitive impedance matches happened a while ago, when Visual Basic 6 and Delphi for Windows first came out. You might think LLMs make it easier that that, but you'd be mistaken, for any problem of sufficient complexity.
This is an interesting problem, one I've thought a lot about myself. On one hand, LLMs have the capacity to greatly help people, and I think, especially in the realm of gradually learning how to program, on the other hand, the non-determinism is such a difficult problem to work around.
One current idea of mine, is to iteratively make things more and more specific, this is the approach I take with psuedocode-expander ([0]) and has proven generally useful. I think there's a lot of value in the LLM instead of one shot generating something linearly, building from the top down with human feedback, for instance. I give a lot more examples on the repo for this project, and encourage any feedback or thoughts on LLM driven code generation in a more sustainable then vibe-coding way.
As for the argument that modern compilers also have nondeterminism in places is still comparing apples to oranges.
LLM's indeterminism means it can always choose from the entire token space. Including junk.
A compiler's nondeterminism is to, possibly randomly, choose from one of a set of VALID solutions. Whatever the nondeterministic outcome is, you are guaranteed a valid solution.
I agree LLMs shouldn't be "compilers" because that implies abstracting away all decisions embedded in the code. Code is structured decisions and we will always want access and control over those decisions. We might not care about many of those decisions, but some of those we absolutely do. Some might be architectural, some might be we want the button to always be red.
This is why I think the better goal is an abstraction layer that differentiates human decisions from default (LLM) decisions. A sweeping "compiler" locks humans out of the decision making process.
Have you ever led a project where you had to give the specs to other developers? Have you ever contracted out complete implementation to a consulting company? Those are just really slow Mechanical Turk style human LLMs
Here's an experiment that might be worth trying: temporarily delete a source file, ask your coding agent to regenerate it, and examine the diffs to see what it did differently.
This could be a good way to learn how robust your tests are, and also what accidental complexity could be removed by doing a rewrite. But I doubt that the results would be so good that you could ask a coding agent to regenerate the source code all the time, like we do for compilers and object code.
I just had Claude rewrite a simple utility as far code, but complex if you didn’t know the gotchas of a particular AWS Service. It was much better than my implementation and it already knew how things work underneath,
For context, my initial implementation went through the official AWS open source process (no longer there) five years ago and I’m still getting occasional emails and LinkedIn Messages because it’s one of the best ways to solve the problem that is publicly available - the last couple of times, I basically gave the person the instructions I gave ChatGPT (since I couldn’t give them the code) and told them to have it regenerate the code in Python and it would do much better than what I wrote when I didn’t know the service as well as I do now, and the service has more features that you have to be concerned about
In the comparison to compilers, it relevant to point out that work began on them in the 1950's. That they're basically solid by the time most people here used them, should be looked at with that time frame in mind. ChatGPT came out in 2022, 3-4 years ago. Compilers have had around three quarters of a century years to get where they are today. I'll probably be dead in seventy years, nevermind have any idea what AI (or society) is going to look like then!
But for reference, we don't (usually) care which register three compiler uses for which variable, we just care that it works, with no bugs. If the non-dertetminism of LLMs mean the variable is called file, filename, or fileName, file_name, and breaking with convention, why do we care? At the level Claude let's us work with code now, it's immaterial.
Compilation isn't stable. If you clear caches and recompile, you don't get a bit-for-bit exact copy, especially on today's multi-core processors, without doing extra work to get there.
But the reason we don't care which register the compiler uses is that compilers, even without strict stability, reliably enforce abstractions that free us from having to care. If your compiler decided on 5% of inputs that it just doesn't feel like using more than two data registers, you'd have to think about it on 100% of inputs.
Compilation is transforming one computing model to another. LLMs aren't great at everything, but seem particularly well suited for this purpose.
One of the first things I tried to have an llm do is transpile. These days that works really well. You find an interesting project in python, i'm a js guy, boom js version. Very helpful.
You see a business you like, boom competing business.
These are going to turn into business factories.
Anthropic has a business factory. They can make new businesses. Why do they need to sell that at all once it works?
We're focusing on a compiler implementation. Classic engineering mindset. We focus on the neat things that entertain us. But the real story is what these models will actually be doing to create value.
> The underspecification forces the model to guess the data model, edge cases, error behavior, security posture, performance tradeoffs in your program
It’s guessing using the entire sum total of its ingested knowledge, and, it’s reasonable to assume frontier labs are investing heavily into creating and purchasing synthetic & higher quality data.
Judgment is a byproduct of having seen a great many things, only some of which work, and being able to apply that in context.
For most purposes (granted not all) it won’t matter which of the many possible programs you get - as long as it’s usable and does the task, it’ll be fine.
There are people playing around with straight machine code generation, or integrating ML into the optimisation backend, finally compiling via a translation to an existing language is already a given in vibe coding with agents.
Speaking of which, using agentic runtimes is hardly any different from writing programs, there are some instructions which then get executed just like any other applications, and if it gets compiled before execution or plainly interpreted, becomes a runtime implementation detail.
Are we there yet without hallucinations?
Not yet, however the box is already open, and there are enough people trying to make it happen.
I’m not sure I entirely agree with this. The example that comes to mind is text rendering in browsers. You can define a website, and how it will look, pretty well but not perfectly. There’s going to be some minor differences, like in the text rendering pipeline.
I think it’s more productive to chart all of these systems, LLMs included, on a line of abstraction leakiness. Even disregarding their stochastic nature, I think they’re a much too leaky abstraction to find any use in compilers. There’s a giant mismatch that I think is too big to reconcile.
Anyone who knows 0.1% about LLMs should know that they are not deterministic systems and are totally unpredictable with their outputs meaning that they cannot become compilers at all.
Anyone that knows 0.1% about GC and JIT compilers also knows how hard is to have deterministic behaviours, and how much their behaviours are driven by heuristics.
> It’s that the programming interface is functionally underspecified by default. Natural language leaves gaps; many distinct programs can satisfy the same prompt. The LLM must fill those gaps.
I think this is an interesting development, because we (linguists and logicians in particular) have spent a long time developing a highly specified language that leaves no room for ambiguity. One could say that natural language was considered deficient – and now we are moving in the exact opposite direction.
Well they could be if we had a way to restore error state, like setting a trap and or catching signals by setting handlers and saving, restoring stack/registers then just like some JIT compilation we could progressively "fix" the assembly/machine instructions. Most "functions" are pretty short and the transformer architecture should be able to do it but the trickier part will be referencing global memory constants I think.
Maybe God was so angry seeing His fellows embrassing LLMs. So He asked vaguely one of those lame things, for the first time:
0. "Make something cool out of this insane amount of energy." (temp: 10^42 Kelvin)
1. He slept for a while.
2. Datacenter exploded His realm.
3. ~380 000 years passed and fiat lux.
4. ~13 billions years passed and here we are.
5. JMP 0.
One thing that's missing from this is that the specification itself only matters insofar as it meets its own meta-specification of "what people will use/pay for". LLMs may have an easier time understanding that than what a specific developer wants from them - a perfect implementation of an un-marketable product is mostly pointless.
That doesn't make a difference here. Even with a nonzero temperature, an LLM could still be deterministic as long as you have control of its random seed. As the article says:
"This gets to my core point. What changes with LLMs isn’t primarily nondeterminism, unpredictability, or hallucination. It’s that the programming interface is functionally underspecified by default."
Even if you turn the temperature down to 0, it's not deterministic. Floating points are messy. If there is even a tiny difference when it comes to the order of operations on the actual GPU that's running the billions of parallelized floating point operations over and over, it's very possible to end up with changing top probability logits.
More to the point: is randomness of representation or implementation an inherent issue if the desired semantics of a program are still obeyed?
This is not really a point about whether LLMs can currently be used as English compilers, but more questioning whether determinism of the final machine code output is a critical property of a build system.
No, for the reasons given in the sibling comments: you won't want to be locked into a single model for the rest of time and, even if you did, floating point execution order will still cause non-determinism.
I'm actually building this and believe I've overcome the most difficult aspects mentioned here. It will be released as Open Source next week. https://intentcode.dev
My biggest AI win so far was using ChatGPT as a transpiler to convert from vanilla JS code to GLSL. It took 7 prompts and about 1.5 hours, but without the AI, I would have been thrilled to have completed the project in a week.
LLMS are great for writing compilers and compiler tests however. Also, if you have to write assembly code or machine code and can’t use a compiled high level language, LLMs are big help there also.
They're giant pattern regurgitators, impressive for sure, but they only can be as good as their training data, reason why they seems to be more effective for TypeScript, Python etc. Nothing less nothing more. No AGI, no Job X is done. Hallucinations are a feature, otherwise they would just spit out training data. The thing is the whole discussion around these tools is so miserable that I'm pondering the idea of canceling from every corner of the internet, the fatigue is real and pushing back the hype feels so exausting, worse than crypto, nft and web3. I'm a user of these tools me pushing back the hype is because its ripple effects arrive inside my day job and I'm exausted of people handing to you generated shit just to try making a point and saying "see? like that"
>From one gut feeling I derive much consolation: I suspect that machines to be programmed in our native tongues —be it Dutch, English, American, French, German, or Swahili— are as damned difficult to make as they would be to use.
Yes, because no one conflates an engineer with a compiler. But there are people making the argument that we should treat natural language specs/prompts as the new source and computer language code as a transient artifact.
If you have decent unit and functional tests, why do you care how the code is written?
This feels like the same debate assembly programmers had about C in the 60s. "You don’t understand what the compiler is doing, therefore it’s dangerous". Eventually we realised the important thing isn’t how the code was authored but whether the behaviour is correct, testable, and maintainable.
If code generated by an LLM:
- passes a real test suite (not toy tests),
- meets performance/security constraints,
- goes through review like any other change,
then the acceptance criteria haven’t changed. The test suite is part of the spec. If the spec is enforced in CI, the authoring tool is secondary.
The real risk isn’t "LLMs as compilers", it’s letting changes bypass verification and ownership. We solved that with C, with large dependency trees, with codegen tools. Same playbook applies here.
If you give expected input and get expected output, why does it matter how the code was written?
Because testing at this level is a likely impossible across all domains of programming. You can narrow the set of inputs and get relatively far, but the more complex the systems the broader the space of problems becomes. And even a simple crud app on an EC2 has a lot more failure modes than people are able to test for with current tools.
> passes a real test suite (not toy tests)
“not toy tests” is doing a lot of heavy lifting here. Like an immeasurable amount of lifting.
Are LLMs not already compilers? They translate human natural language to code pretty well now. But yeah, they probably don't fit the bill of English based code to machine code
A lot of people are mentally modeling the idea that LLMs are either now or will eventually be infinitely capable. They are and will stubbornly persist in being finite, no matter how much capacity that "finite" entails. For the same reason that higher level languages allow humans to worry less about certain details and more about others, higher level languages will allow LLMs to use more of their finite resources on solving the hard problems as well.
Using LLMs to do something like what a compiler can already do is also modelling LLMs as infinite rather than finite. In fact in this particular situation not only are they finite, they're grotesquely finite, in particular, they are expensive. For example, there is no world where we just replace our entire infrastructure from top to bottom with LLMs. To see that, compare the computational effort of adding 10 8-digit numbers with an LLM versus a CPU. Or, if you prefer something a bit less slanted, the computational costs of serving a single simple HTTP request with modern systems versus an LLM. The numbers run something like LLMs being trillions of times more expensive, as an opening bid, and if the AIs continue to get more expensive it can get even worse than that.
For similar reasons, using LLMs as a compiler is very unlikely to ever produce anything even remotely resembling a payback versus the cost of doing so. Let the AI improve the compiler instead. (In another couple of years. I suspect today's AIs would find it virtually impossible to significatly improve an already-optimized compiler today.)
Moreover, remember, oh, maybe two years back when it was all the rage to have AIs be able to explain why they gave the answer they did? Yeah, I know, in the frenzied greed to be the one to grab the money on the table, this has sort of fallen by the wayside, but code is already the ultimate example of that. We ask the LLM to do things, it produces code we can examine, and the LLM session then dies away leaving only the code. This is a good thing. This means we can still examine what the resulting system is doing. In a lot of ways we hardly even care what the LLM was "thinking" or "intending", we end up with a fantastically auditable artifact. Even if you are not convinced of the utility of a human examining it, it is also an artifact that the next AI will spend less of its finite resources simply trying to understand and have more left over to actually do the work.
We may find that we want different programming languages for AIs. Personally I think we should always try to retain that ability for humans to follow it, even if we build something like that. We've already put the effort into building AIs that produce human-legible code and I think it's probably not that great a penalty in the long run to retain that. At the moment it is hard to even guess what such a thing would look like, though, as the AIs are advancing far faster than anyone (or any AI) could produce, test, prove out, and deploy such a language, against the advantage of other AIs simply getting better at working with the existing coding systems.
Stop this. This is such a stupid way way of describing mistakes from AI. Please try to use the confusion matrix or any other way. If you're going to try and make arguments, it's hard to take them seriously if you keep regurgitating that LLM's hallucinate. It's not a well defined definition so if you continually make this your core argument, it becomes disingenuous.
That was a painfull read for me. It reminds me of a specific annoyance I had at university with a professor who loved to make sweeping, abstract claims that sounded incredibly profound in the lecture hall but evaporated the moment you tried to apply them. It was always a hidden 'I-am-very-smart' attempt that fell apart if you actually deconstructed the meaning, the logic, or the claimed results. This article is the exact same breed of intellectualizing. It feels deep, but there is no actual logical hold if you break up the claims and deductive steps.
You can see it clearly if you just translate the article's expensive vocabulary into plain English. When the author writes, 'When you hand-build, the space of possibilities is explored through design decisions you’re forced to confront,' they are just saying, 'When you write code yourself, you have to choose how to write it.' When they claim, 'contextuality is dominated by functional correctness,' they just mean, 'Usually, we just care if the code works.' When they warn about 'inviting us to outsource functional precision itself,' they really mean, 'LLMs let you be lazy.' And finaly, 'strengthening the will to specify,' is just a dramatic way of saying, 'We need to write better requirements.' It is obscurantism plain and simple. using complexity to hide the fact that the insight is trivial.
But that is just an estethical problem to me. Worse. The argument collapses entirely when you look at the logical leap between the premises.
The author basically argues that because Natural Language is vague, engineers will inevitably stop caring about the details and just accept whatever reasonable output the AI gives. This is pure armchair psychology. It assumes that just because the tool allows for vagueness, professionals will suddenly abandon the concept of truth or functional requirements. That is a massive, unsubstantiated jump.
If we use fuzzy matching to find contacts on our phones all the time. Just because the search algorithm is imprecise doesn't mean we stop caring if we call the right person. We don't say, 'Well, the fuzzy match gave me Bob instead of Bill, I guess I'll just talk to Bob now.' The hard constraint, the functional requirement of talking to the specific person you need, remains absolute. Similarly, in software, the code either compiles and passes the tests, or it doesn't. The medium of creation might be fuzzy, but the execution environment is binary. We aren't going to drift into accepting broken banking software just because the prompt was in English.
This entire essay feels like those social psychology types that now have been thoroughly been discredited by the replication crisis in psychology. The ones who are where concerned with dazzling people with verbal skills than with being right. It is unnecessarily complex, relying on projection of dreamt up concepts and behavior, rather than observation. THIS tries to sound profound by turning a technical discussion into a philosophical crisis, but underneath the word salad, it is not just shallow, it is wrong.
codingdave|23 days ago
It all feels to me like the guys who make videos of using using electric drills to hammer in a nail - Sure, you can do that, but it is the wrong tool for the job. Everyone knows the phrase: "When all you have is a hammer, everything looks like a nail." But we need to also keep in mind the other side of that coin: "When all you have is nails, all you need is a hammer." LLMs are not a replacement for everything that happens to be digital.
alpaylan|23 days ago
CGMthrowaway|23 days ago
"Deterministic" is not the the right constraint to introduce here. Plenty of software is non-deterministic (such as LLMs! But also, consensus protocols, request routing architecture, GPU kernels, etc) so why not compilers?
What a compiler needs is not determinism, but semantic closure. A system is semantically closed if the meanings of its outputs are fully defined within the system, correctness can be evaluated internally and errors are decidable. LLMs are semantically open. A semantically closed compiler will never output nonsense, even if its output is nondeterministic. But two runs of a (semantically closed) nondeterministic compiler may produce two correct programs, one being faster on one CPU and the other faster on another. Or such a compiler can be useful for enhancing security, e.g. programs behave identically, resist fingerprinting.
Nondeterminism simply means the compiler selects any element of an equivalence class. Semantic closure ensures the equivalence class is well‑defined.
9rx|23 days ago
They are designed to be where temperature=0. Some hardware configurations are known defy that assumption, but when running on perfect hardware they most definitely are.
What you call compilers are also nondeterministic on 'faulty' hardware, so...
bee_rider|23 days ago
WithinReason|23 days ago
mickdarling|23 days ago
We have mechanisms for ensuring output from humans, and those are nothing like ensuring the output from a compiler. We have checks on people, we have whole industries of people whose whole careers are managing people, to manage other people, to manage other people.
with regards to predictability LLMs essentially behave like people in this manner. The same kind of checks that we use for people are needed for them, not the same kind of checks we use for software.
skydhash|23 days ago
Those checks works for people because humans and most living beings respond well to rewards/punishment mechanisms. It’s the whole basis of society.
> not the same kind of checks we use for software.
We do have systems that are non deterministic (computer vision, various forecasting models…). We judge those by their accuracy and the likely of having false positive or false negatives (when it’s a classifier). Why not use those metrics?
bigstrat2003|23 days ago
The whole benefit of computers is that they don't make stupid mistakes like humans do. If you give a computer the ability to make random mistakes all you have done is made the computer shitty. We don't need checks, we need to not deliberately make our computers worse.
mvr123456|23 days ago
If you don't like the results or the process, you have to switch targets or add new intermediates. For example instead of doing description -> implementation, do description -> spec -> plan -> implementation
behnamoh|23 days ago
The more I use LLMs, the more I find this true. Haskell made me think for minutes before writing one line of code. Result? I stopped using Haskell and went back to Python because with Py I can "think while I code". The separation of thinking|coding phases in Haskell is what my lazy mind didn't want to tolerate.
Same goes with LLMs. I want the model to "get" what I mean but often times (esp. with Codex) I must be very specific about the project scope and spec. Codex doesn't let me "think while I vibe", because every change is costly and you'd better have a good recovery plan (git?) when Codex goes stray.
raw_anon_1111|23 days ago
This is technically true. But unimportant. When I write code in a higher level language and it gets compiled to machine code, ultimately I am testing statically generated code for correctness. I don’t care what type of weird tricks the compiler did for optimizations.
How is that any different than when someone is testing LLM generated C code? I’m still testing C code that isn’t going to magically be changed by the LLM without my intervention anymore than my C code is going to be changed without my recompiling it.
On this latest project I was on, the Python generated code by Codex was “correct” with the happy path. But there were subtle bugs in the distributed locking mechanics and some other concurrency controls I specified. Ironically, those were both caught by throwing the code in ChatGPT in thinking mode.
No one is using an LLM to compute is a number even or odd at runtime.
skydhash|23 days ago
The same thing happens in JavaScript. I debug it using a Javascript debugger, not with gdb. Even when using bash script, you don’t debug it by going into the programs source code, you just consult the man pages.
When using LLM, I would expect not to go and verify the code to see if it actually correct semantically.
rileymichael|23 days ago
you might not, but plenty of others do. on the jvm for example, anyone building a performance sensitive application has to care about what the compiler emits + how the jit behaves. simple things like accidental boxing, megamorphic call preventing inlining, etc. have massive effects.
i've spent many hours benchmarking, inspecting in jitwatch, etc.
mikewarot|22 days ago
You could use a badly designed antenna with a horrible VSWR at the end of a coax, and effectively communicate with some portion of the world, by using a tuner, which helps cover up the inefficiencies involved. However, doing so loses signal, in both directions. You can add amplification at the antenna for receive (a pre-amp) and transmit with more power, but eventually the coax will break down, possibly well before the legal limit.
It is far better to use a well designed antenna and matching system at the feed point. It maximizes signal transmission in both directions, by reducing losses as much as possible.
--
A compiler matches our cognitive impedance to that of the computer. We don't handle generating opcodes and instruction addresses manually very well. I don't see how an LLM is going to do that any better. Compilers, on the other hand, do it reliably, and very efficiently.
The best cognitive impedance matches happened a while ago, when Visual Basic 6 and Delphi for Windows first came out. You might think LLMs make it easier that that, but you'd be mistaken, for any problem of sufficient complexity.
explosion-s|23 days ago
One current idea of mine, is to iteratively make things more and more specific, this is the approach I take with psuedocode-expander ([0]) and has proven generally useful. I think there's a lot of value in the LLM instead of one shot generating something linearly, building from the top down with human feedback, for instance. I give a lot more examples on the repo for this project, and encourage any feedback or thoughts on LLM driven code generation in a more sustainable then vibe-coding way.
[0]: https://github.com/explosion-Scratch/psuedocode-expander/
Tade0|23 days ago
Well, you can always set temperature to 0, but that doesn't remove hallucinations.
vrighter|13 days ago
As for the argument that modern compilers also have nondeterminism in places is still comparing apples to oranges.
LLM's indeterminism means it can always choose from the entire token space. Including junk.
A compiler's nondeterminism is to, possibly randomly, choose from one of a set of VALID solutions. Whatever the nondeterministic outcome is, you are guaranteed a valid solution.
lubujackson|23 days ago
This is why I think the better goal is an abstraction layer that differentiates human decisions from default (LLM) decisions. A sweeping "compiler" locks humans out of the decision making process.
raw_anon_1111|23 days ago
skybrian|23 days ago
This could be a good way to learn how robust your tests are, and also what accidental complexity could be removed by doing a rewrite. But I doubt that the results would be so good that you could ask a coding agent to regenerate the source code all the time, like we do for compilers and object code.
raw_anon_1111|23 days ago
For context, my initial implementation went through the official AWS open source process (no longer there) five years ago and I’m still getting occasional emails and LinkedIn Messages because it’s one of the best ways to solve the problem that is publicly available - the last couple of times, I basically gave the person the instructions I gave ChatGPT (since I couldn’t give them the code) and told them to have it regenerate the code in Python and it would do much better than what I wrote when I didn’t know the service as well as I do now, and the service has more features that you have to be concerned about
fragmede|23 days ago
But for reference, we don't (usually) care which register three compiler uses for which variable, we just care that it works, with no bugs. If the non-dertetminism of LLMs mean the variable is called file, filename, or fileName, file_name, and breaking with convention, why do we care? At the level Claude let's us work with code now, it's immaterial.
Compilation isn't stable. If you clear caches and recompile, you don't get a bit-for-bit exact copy, especially on today's multi-core processors, without doing extra work to get there.
SpicyLemonZest|23 days ago
dpweb|23 days ago
One of the first things I tried to have an llm do is transpile. These days that works really well. You find an interesting project in python, i'm a js guy, boom js version. Very helpful.
echelon|23 days ago
You see a business you like, boom competing business.
These are going to turn into business factories.
Anthropic has a business factory. They can make new businesses. Why do they need to sell that at all once it works?
We're focusing on a compiler implementation. Classic engineering mindset. We focus on the neat things that entertain us. But the real story is what these models will actually be doing to create value.
cadamsdotcom|21 days ago
It’s guessing using the entire sum total of its ingested knowledge, and, it’s reasonable to assume frontier labs are investing heavily into creating and purchasing synthetic & higher quality data.
Judgment is a byproduct of having seen a great many things, only some of which work, and being able to apply that in context.
For most purposes (granted not all) it won’t matter which of the many possible programs you get - as long as it’s usable and does the task, it’ll be fine.
pjmlp|23 days ago
There are people playing around with straight machine code generation, or integrating ML into the optimisation backend, finally compiling via a translation to an existing language is already a given in vibe coding with agents.
Speaking of which, using agentic runtimes is hardly any different from writing programs, there are some instructions which then get executed just like any other applications, and if it gets compiled before execution or plainly interpreted, becomes a runtime implementation detail.
Are we there yet without hallucinations?
Not yet, however the box is already open, and there are enough people trying to make it happen.
olivia-banks|23 days ago
I think it’s more productive to chart all of these systems, LLMs included, on a line of abstraction leakiness. Even disregarding their stochastic nature, I think they’re a much too leaky abstraction to find any use in compilers. There’s a giant mismatch that I think is too big to reconcile.
rvz|23 days ago
The obvious has been stated.
pjmlp|23 days ago
WithinReason|23 days ago
plastic-enjoyer|23 days ago
I think this is an interesting development, because we (linguists and logicians in particular) have spent a long time developing a highly specified language that leaves no room for ambiguity. One could say that natural language was considered deficient – and now we are moving in the exact opposite direction.
throwaway2027|23 days ago
Martin_Silenus|22 days ago
unknown|22 days ago
[deleted]
somesortofthing|23 days ago
aethrum|23 days ago
kibwen|23 days ago
"This gets to my core point. What changes with LLMs isn’t primarily nondeterminism, unpredictability, or hallucination. It’s that the programming interface is functionally underspecified by default."
helloplanets|23 days ago
abm53|23 days ago
This is not really a point about whether LLMs can currently be used as English compilers, but more questioning whether determinism of the final machine code output is a critical property of a build system.
nickm12|20 days ago
MyHonestOpinon|23 days ago
unknown|23 days ago
[deleted]
jasfi|22 days ago
calebm|23 days ago
seanmcdirmid|22 days ago
hollowturtle|23 days ago
lambda-lollipop|23 days ago
>From one gut feeling I derive much consolation: I suspect that machines to be programmed in our native tongues —be it Dutch, English, American, French, German, or Swahili— are as damned difficult to make as they would be to use.
slopusila|23 days ago
yet nobody complained about this
in fact engineers appreciate that, "we are not replaceable code monkeys cogs in the machine as management would like"
nickm12|20 days ago
MyHonestOpinon|23 days ago
Daviey|23 days ago
This feels like the same debate assembly programmers had about C in the 60s. "You don’t understand what the compiler is doing, therefore it’s dangerous". Eventually we realised the important thing isn’t how the code was authored but whether the behaviour is correct, testable, and maintainable.
If code generated by an LLM:
then the acceptance criteria haven’t changed. The test suite is part of the spec. If the spec is enforced in CI, the authoring tool is secondary.The real risk isn’t "LLMs as compilers", it’s letting changes bypass verification and ownership. We solved that with C, with large dependency trees, with codegen tools. Same playbook applies here.
If you give expected input and get expected output, why does it matter how the code was written?
shauhss|23 days ago
> passes a real test suite (not toy tests)
“not toy tests” is doing a lot of heavy lifting here. Like an immeasurable amount of lifting.
nickm12|20 days ago
smallnix|23 days ago
Why? Because new languages have an IR in their compilation path?
ryanschneider|23 days ago
lfsss|23 days ago
lunarboy|23 days ago
rvz|23 days ago
Can you formally verify prose?
> But yeah, they probably don't fit the bill of English based code to machine code
Which is why LLMs cannot be compilers that transform code to machine code.
jerf|23 days ago
Using LLMs to do something like what a compiler can already do is also modelling LLMs as infinite rather than finite. In fact in this particular situation not only are they finite, they're grotesquely finite, in particular, they are expensive. For example, there is no world where we just replace our entire infrastructure from top to bottom with LLMs. To see that, compare the computational effort of adding 10 8-digit numbers with an LLM versus a CPU. Or, if you prefer something a bit less slanted, the computational costs of serving a single simple HTTP request with modern systems versus an LLM. The numbers run something like LLMs being trillions of times more expensive, as an opening bid, and if the AIs continue to get more expensive it can get even worse than that.
For similar reasons, using LLMs as a compiler is very unlikely to ever produce anything even remotely resembling a payback versus the cost of doing so. Let the AI improve the compiler instead. (In another couple of years. I suspect today's AIs would find it virtually impossible to significatly improve an already-optimized compiler today.)
Moreover, remember, oh, maybe two years back when it was all the rage to have AIs be able to explain why they gave the answer they did? Yeah, I know, in the frenzied greed to be the one to grab the money on the table, this has sort of fallen by the wayside, but code is already the ultimate example of that. We ask the LLM to do things, it produces code we can examine, and the LLM session then dies away leaving only the code. This is a good thing. This means we can still examine what the resulting system is doing. In a lot of ways we hardly even care what the LLM was "thinking" or "intending", we end up with a fantastically auditable artifact. Even if you are not convinced of the utility of a human examining it, it is also an artifact that the next AI will spend less of its finite resources simply trying to understand and have more left over to actually do the work.
We may find that we want different programming languages for AIs. Personally I think we should always try to retain that ability for humans to follow it, even if we build something like that. We've already put the effort into building AIs that produce human-legible code and I think it's probably not that great a penalty in the long run to retain that. At the moment it is hard to even guess what such a thing would look like, though, as the AIs are advancing far faster than anyone (or any AI) could produce, test, prove out, and deploy such a language, against the advantage of other AIs simply getting better at working with the existing coding systems.
kittikitti|23 days ago
Stop this. This is such a stupid way way of describing mistakes from AI. Please try to use the confusion matrix or any other way. If you're going to try and make arguments, it's hard to take them seriously if you keep regurgitating that LLM's hallucinate. It's not a well defined definition so if you continually make this your core argument, it becomes disingenuous.
dgxyz|23 days ago
jtrn|23 days ago
You can see it clearly if you just translate the article's expensive vocabulary into plain English. When the author writes, 'When you hand-build, the space of possibilities is explored through design decisions you’re forced to confront,' they are just saying, 'When you write code yourself, you have to choose how to write it.' When they claim, 'contextuality is dominated by functional correctness,' they just mean, 'Usually, we just care if the code works.' When they warn about 'inviting us to outsource functional precision itself,' they really mean, 'LLMs let you be lazy.' And finaly, 'strengthening the will to specify,' is just a dramatic way of saying, 'We need to write better requirements.' It is obscurantism plain and simple. using complexity to hide the fact that the insight is trivial.
But that is just an estethical problem to me. Worse. The argument collapses entirely when you look at the logical leap between the premises.
The author basically argues that because Natural Language is vague, engineers will inevitably stop caring about the details and just accept whatever reasonable output the AI gives. This is pure armchair psychology. It assumes that just because the tool allows for vagueness, professionals will suddenly abandon the concept of truth or functional requirements. That is a massive, unsubstantiated jump.
If we use fuzzy matching to find contacts on our phones all the time. Just because the search algorithm is imprecise doesn't mean we stop caring if we call the right person. We don't say, 'Well, the fuzzy match gave me Bob instead of Bill, I guess I'll just talk to Bob now.' The hard constraint, the functional requirement of talking to the specific person you need, remains absolute. Similarly, in software, the code either compiles and passes the tests, or it doesn't. The medium of creation might be fuzzy, but the execution environment is binary. We aren't going to drift into accepting broken banking software just because the prompt was in English.
This entire essay feels like those social psychology types that now have been thoroughly been discredited by the replication crisis in psychology. The ones who are where concerned with dazzling people with verbal skills than with being right. It is unnecessarily complex, relying on projection of dreamt up concepts and behavior, rather than observation. THIS tries to sound profound by turning a technical discussion into a philosophical crisis, but underneath the word salad, it is not just shallow, it is wrong.
MarginalGainz|23 days ago
[deleted]
genie3io|23 days ago
[deleted]