Something not unlike this happened to me when moving some batch processing code from C++ to Python 1.4 (this was 1997). The batch started finishing about 10x faster. We refused to believe it at first and started looking to make sure the work was actually being done. It was.
The port had been done in a weekend just to see if we could use Python in production. The C++ code had taken a few months to write. The port was pretty direct, function for function. It was even line for line where language and library differences didn't offer an easier way.
A couple of us worked together for a day to find the reason for the speedup. Just looking at the code didn't give us any clues, so we started profiling both versions. We found out that the port had accidentally fixed a previously unknown bug in some code that built and compared cache keys. After identifying the small misbehaving function, we had to study the C++ code pretty hard to even understand what the problem was. I don't remember the exact nature of the bug, but I do remember thinking that particular type of bug would be hard to express in Python, and that's exactly why it was accidentally fixed.
We immediately started moving the rest of our back end to Python. Most things were slower, but not by much because most of our back end was i/o bound. We soon found out that we could make algorithmic improvements so much more quickly, so a lot of the slowest things got a lot faster than they had ever been. And, most importantly, we (the software developers) got quite a bit faster.
This was particularly true for one of the projects I've worked with in the past, where Python was chosen as the main language for a monitoring service.
In short, it proved itself to be a disaster: just the Python process collecting and parsing the metrics of all programs consumed 30-40% of the processing power of the lower end boxes.
In the end, the project went ahead for a while more, and we had to do all sorts of mitigations to get the performance impact to be less of an issue.
We did consider replacing it all by a few open source tools written in C and some glue code, the initial prototype used few MBs instead of dozens (or even hundreds) of MBs of memory, while barely registering any CPU load, but in the end it was deemed a waste of time when the whole project was terminated.
> After identifying the small misbehaving function, we had to study the C++ code pretty hard to even understand what the problem was. I don't remember the exact nature of the bug, but I do remember thinking that particular type of bug would be hard to express in Python, and that's exactly why it was accidentally fixed.
Pure speculation, but I would guess this has something to do with a copy constructor getting invoked in a place you wouldn't guess, that ends up in a critical path.
Ome advantage of python is that it is so slow that if you choose the wrong algorithm or data structure that soon gets obvious. And for complicated stuff this is exactly where I find the LLMs struggle. So I make a first version in Python, and only when I am happy with the results and the speed feels reasonable compared to the problem complexity, I ask Claude Code to port the critical parts to Rust.
> We soon found out that we could make algorithmic improvements so much more quickly
It's true that writing code in C doesn't automatically make it faster.
For example, string manipulation. 0-terminated strings (the default in C) are, frankly, an abomination. String processing code is a tangle of strlen, strcpy, strncpy, strcat, all of which require repeated passes over the string looking for the 0. (Even worse, reloading the string into the cache just to find its length makes things even slower.)
Worse is the problem that, in order to slice a string, you have to malloc some memory and copy the string. And then carefully manage the lifetime of that slice.
The fix is simple - use length-delimited strings. D relies on them to great effect. You can do them in C, but you get no succor from the language. I've proposed a simple enhancement for C to make them work https://www.digitalmars.com/articles/C-biggest-mistake.html but nobody in the C world has any interest in it (which baffles me, it is so simple!).
Another source of slowdown in C is I've discovered over the years that C is not a plastic language, it is a brittle one. The first algorithm you select for a C project gets so welded into it that it cannot be changed without great difficulty. (And we all know that algorithms are the key to speed, not coding details.) Why isn't C plastic?
It's because one cannot switch back and forth between a reference type and a value type without extensively rewriting every use of it. For example:
struct S { int a; }
int foo(struct S s) { return s.a; }
int bar(struct S *s) { return s->a; }
If you want to switch between reference and value, you've got to go through all your code swapping . and ->. It's just too tedious and never happens. In D:
struct S { int a; }
int foo(S s) { return s.a; }
int bar(S *s) { return s.a; }
I discovered while working on D that there is no reason for the C and C++ -> operator to even exist, the . operator covers both bases!
This is the difference between scripting and programming. If you use C++ as a scripting language you're gonna have a bad time.Of course a scripting language is faster for scripting! That doesn't mean you go full Graham and throw away real programming languages, it just means you aren't writing systems software.
The usual strategy is to write a script then if it's slow see how you could design a program that would
The usual strategy in the real world is to copy paste thousands of lines of C++ code until someone comes along and writes a proper direct solution to the problem.
Of course there are ideas on how to fix this, either writing your own scripting libraries (stb), packages (go/rust/ts), metaprogramming (lisp/jai). As for bugs those are a function of how you choose to write code, the standard way of writing shell it bug prone, the standard way of writing python is less so, not using overloading & going wider in c++ generally helps.
I suspect that you used highly optimized algorithms written for python, like the vector algorithms in numpy?
You will struggle to write better code, at least I would.
> We immediately started moving the rest of our back end to Python. Most things were slower, but not by much because most of our back end was i/o bound.
Would be kind of cool if e. g. python or ruby could be as fast as C or C++.
I wonder if this could be possible, assuming we could modify both to achieve that as outcome. But without having a language that would be like C or C++. Right now there is a strange divide between "scripting" languages and compiled ones.
The real win here isn't TS over Rust, it's the O(N²) -> O(N) streaming fix via statement-level caching. That's a 3.3x improvement on its own, independent of language choice. The WASM boundary elimination is 2-4x, but the algorithmic fix is what actually matters for user-perceived latency during streaming. Title undersells the more interesting engineering imo.
They even directly conclude at the end of the article that improvements in algorithm are more important than the choice of language:
> Algorithmic complexity improvements dominate language-level optimisations. Going from O(N²) to O(N) in the streaming case had a larger practical impact than switching from WASM to TypeScript.
Yet they still have chosen to put the “Rust rewrite” part in the title. I almost think it's a click bait.
Yeah the algorithmic fix is doing most of the work here. But call that parser hundreds of times on tiny streaming chunks and the WASM boundary cost per call adds up fast. Same thing would happen with C++ compiled to WASM.
O(N²) -> O(N) was 3.3x faster, but before that, eliminating the boundary (replacing wasm with JS) led to speedups of 2.2x, 4.6x, 3.0x (see one table back).
It looks like neither is the "real win". both the language and the algorithm made a big difference, as you can see in the first column in the last table - going to wasm was a big speedup, and improving the algorithm on top of that was another big speedup.
Kinda is. We came up with abstractions to help reason about what really matters. The more you need to deal with auxillary stuff (allocations, lifetimes), more likely you will miss the big issue.
Yeah if you're serializing and deserializing data across the JS-WASM boundary (or actually between web workers in general whether they're WASM or not) the data marshaling costs can add up. There is a way of sharing memory across the boundary though without any marshaling: TypedArrays and SharedArrayBuffers. TypedArrays let you transfer ownership of the underlying memory from one worker (or the main thread) to another without any copying. SharedArrayBuffers allow multiple workers to read and write to the same contiguous chunk of memory. The downside is that you lose all the niceties of any JavaScript types and you're basically stuck working with raw bytes.
You still do get some latency from the event loop, because postMessage gets queued as a MacroTask, which is probably on the order of 10μs. But this is the price you have to pay if you want to run some code in a non-blocking way.
Strongly agree from an Emscripten C++ wasm pov: it's key to minimise emscripten::val roundtrips. Caches must be designed for rectilinear data geometry, and SharedArrayBuffers are the way for bulk data. But only JS allows us to express asynchrony, so we need an on_completion callback design at the lang boundary.
So the actual processing is faster in rust/c/c++ but the marshaling costs are so big so ts is faster in this case? No vlue how something like swc does this but there it's way faster then babel.
"We rewrote this code from language L to language M, and the result is better!" No wonder: it was a chance to rectify everything that was tangled or crooked, avoid every known bad decision, and apply newly-invented better approaches.
So this holds even for L = M. The speedup is not in the language, but in the rewriting and rethinking.
By the way, I did a deeper dive on the problem of serializing objects across the Rust/JS boundary, noticed the approach used by serde wasn’t great for performance, and explored improving it here: https://neugierig.org/software/blog/2024/04/rust-wasm-to-js....
This article is obviously AI generated and besides being jarring to read, it makes me really doubt its validity. You can get substantially faster parsing versus `JSON.parse()` by parsing structured binary data, and it's also faster to pass a byte array compared to a JSON string from wasm to the browser. My guess is not only this article was AI generated, but also their benchmarks, and perhaps the implementation as well.
That final summary benchmark means nothing. It mentions 'baseline' value for the 'Full-stream total' for the rust implementation, and then says the `serde-wasm-bindgen` is '+9-29% slower', but it never gives us the baseline value, because clearly the only benchmark it did against the Rust codebase was the per-call one.
Then it mentions:
"End result: 2.2-4.6x faster per call and 2.6-3.3x lower total streaming cost."
But the "2.6-3.3x" is by their own definition a comparison against the naive TS implementation.
I really think the guy just prompted claude to "get this shit fast and then publish a blog post".
This. It’s so annoying to read these types of blogs now where the writer clearly didn’t put the effort to understand things fully or atleast review the blog their LLM wrote. Who is this useful for?
The article as a whole makes no sense. They are generating UI with an LLM. How fast the UI appears to the user is going to be completely dictated by the speed of the LLM, not the speed of the serialisation.
as an author of the blog - ouch
did a little bit more than prompt claude but a lot of claude prompting was definitely involved
I understand your frustration with AI writing though. We are a small team and given our roadmap it was either use LLMs to help collate all the internal benchmark results file into a blog or never write it so we chose the former. This was a genuinely surprising and counterintuitive result for us, which is why we wanted to share it. Happy to clarify any of the numbers if helpful.
Had the opposite experience. Our JS FEM solver (~550ms per load case) was rewritten in Rust and dropped to ~270ms. But we compile to native.exe, not WASM — we call it via stdin/stdout with JSON from a Node.js compute engine. Tried the WASM route first but the serialization overhead for large stiffness matrices ate the gains, exactly like this article describes. Native binary + stdin/stdout turned out to be the sweet spot: no boundary tax, no FFI, and you get full native SIMD. The sparse solver variant (sprs crate, COO/CSC assembly) scales even better for larger models.
This is why, when a programming language already has tooling for compilers, being it ahead of time, or dynamic, it pays off to first go around validating algorithms and data structures before a full rewrite.
Additionally even after those options are exhausted, only a key parts might need a rewrite, not the whole thing.
However, I wonder how many care about actually learning about algorithms, data structures and mechanical sympathy in the age of Electron apps.
It feels quite often that a rewrite is chosen, because knowing how to actually apply those skills is the CS stuff many think isn't worthwhile learning about.
> Attempted Fix: Skip the JSON Round-Trip
> We integrated serde-wasm-bindgen
So you're reinventing JSON but binary? V8 JSON nowadays is highly optimized [1] and can process gigabytes per second [2], I doubt it is a bottleneck here.
No, serde-wasm-bindgen implements the serde Serializer interface by calling into JS to directly construct the JS objects on the JS heap without an intermediate serialization/deserialization. You pay the cost of one or more FFI calls for every object though.
Not directly related to the post but what does OpenUI do? I'm finding it interesting but hard to understand. Is it an intermediate layer that makes LLMs generate better UI?
Its the library that bridges the gap between LLMs and live UI. Best example would be to imagine you want to build interactive charts within your AI agent (like Claude)
The most obvious approach would be to let LLMs generate code and render it but that introduces problems like safety, UI consistency and speed. OpenUI solves those problems and provides a safe, consistent and token optimized runtime for the LLMs to render live UI
> The openui-lang parser converts a custom DSL emitted by an LLM into a React component tree.
> converts internal AST into the public OutputNode format consumed by the React renderer
Why not just have the LLM emit the JSON for OutputNode ? Why is a custom "language" and parser needed at all? And yes, there is a cost for marshaling data, so you should avoid doing it where possible, and do it in large chunks when its not possible to avoid. This is not an unknown phenomenon.
The WASM story is interesting from a security angle too. WASM modules inheriting the host's memory model means any parsing bugs that trigger buffer overreads in the Rust code could surface in ways that are harder to audit at the JS boundary. Moving to native TS at least keeps the attack surface in one runtime, even if the theoretical memory safety guarantees go down.
Its also worth underlining that it's not just "The parsing computation is fast enough that V8's JIT eliminates any Rust advantage", but specifically that this kind of straight-forward well-defined data structures and mutation, without any strange eval paths or global access is going to be JITed to near native speed relatively easily.
I’m more of a dabbler dev/script guy than a dev but Every. single. thing I ever write in javascript ends up being incredibly fast. It forces me to think in callbacks and events and promises.
Python and C (or async!) seem easy and sorta lazy in comparison.
[+] [-] rented_mule|7 days ago|reply
The port had been done in a weekend just to see if we could use Python in production. The C++ code had taken a few months to write. The port was pretty direct, function for function. It was even line for line where language and library differences didn't offer an easier way.
A couple of us worked together for a day to find the reason for the speedup. Just looking at the code didn't give us any clues, so we started profiling both versions. We found out that the port had accidentally fixed a previously unknown bug in some code that built and compared cache keys. After identifying the small misbehaving function, we had to study the C++ code pretty hard to even understand what the problem was. I don't remember the exact nature of the bug, but I do remember thinking that particular type of bug would be hard to express in Python, and that's exactly why it was accidentally fixed.
We immediately started moving the rest of our back end to Python. Most things were slower, but not by much because most of our back end was i/o bound. We soon found out that we could make algorithmic improvements so much more quickly, so a lot of the slowest things got a lot faster than they had ever been. And, most importantly, we (the software developers) got quite a bit faster.
[+] [-] ameixaseca|7 days ago|reply
This was particularly true for one of the projects I've worked with in the past, where Python was chosen as the main language for a monitoring service.
In short, it proved itself to be a disaster: just the Python process collecting and parsing the metrics of all programs consumed 30-40% of the processing power of the lower end boxes.
In the end, the project went ahead for a while more, and we had to do all sorts of mitigations to get the performance impact to be less of an issue.
We did consider replacing it all by a few open source tools written in C and some glue code, the initial prototype used few MBs instead of dozens (or even hundreds) of MBs of memory, while barely registering any CPU load, but in the end it was deemed a waste of time when the whole project was terminated.
[+] [-] asveikau|7 days ago|reply
Pure speculation, but I would guess this has something to do with a copy constructor getting invoked in a place you wouldn't guess, that ends up in a critical path.
[+] [-] tda|7 days ago|reply
[+] [-] WalterBright|7 days ago|reply
It's true that writing code in C doesn't automatically make it faster.
For example, string manipulation. 0-terminated strings (the default in C) are, frankly, an abomination. String processing code is a tangle of strlen, strcpy, strncpy, strcat, all of which require repeated passes over the string looking for the 0. (Even worse, reloading the string into the cache just to find its length makes things even slower.)
Worse is the problem that, in order to slice a string, you have to malloc some memory and copy the string. And then carefully manage the lifetime of that slice.
The fix is simple - use length-delimited strings. D relies on them to great effect. You can do them in C, but you get no succor from the language. I've proposed a simple enhancement for C to make them work https://www.digitalmars.com/articles/C-biggest-mistake.html but nobody in the C world has any interest in it (which baffles me, it is so simple!).
Another source of slowdown in C is I've discovered over the years that C is not a plastic language, it is a brittle one. The first algorithm you select for a C project gets so welded into it that it cannot be changed without great difficulty. (And we all know that algorithms are the key to speed, not coding details.) Why isn't C plastic?
It's because one cannot switch back and forth between a reference type and a value type without extensively rewriting every use of it. For example:
If you want to switch between reference and value, you've got to go through all your code swapping . and ->. It's just too tedious and never happens. In D: I discovered while working on D that there is no reason for the C and C++ -> operator to even exist, the . operator covers both bases![+] [-] casey2|6 days ago|reply
The usual strategy is to write a script then if it's slow see how you could design a program that would
The usual strategy in the real world is to copy paste thousands of lines of C++ code until someone comes along and writes a proper direct solution to the problem.
Of course there are ideas on how to fix this, either writing your own scripting libraries (stb), packages (go/rust/ts), metaprogramming (lisp/jai). As for bugs those are a function of how you choose to write code, the standard way of writing shell it bug prone, the standard way of writing python is less so, not using overloading & going wider in c++ generally helps.
[+] [-] asa400|7 days ago|reply
Crazy how many stories like this I’ve heard of how doing performance work helped people uncover bugs and/or hidden assumptions about their systems.
[+] [-] zeroonetwothree|7 days ago|reply
[+] [-] DaleBiagio|7 days ago|reply
[deleted]
[+] [-] envguard|7 days ago|reply
[deleted]
[+] [-] peter_retief|7 days ago|reply
[+] [-] shevy-java|7 days ago|reply
Would be kind of cool if e. g. python or ruby could be as fast as C or C++.
I wonder if this could be possible, assuming we could modify both to achieve that as outcome. But without having a language that would be like C or C++. Right now there is a strange divide between "scripting" languages and compiled ones.
[+] [-] blundergoat|8 days ago|reply
[+] [-] zahrevsky|7 days ago|reply
> Algorithmic complexity improvements dominate language-level optimisations. Going from O(N²) to O(N) in the streaming case had a larger practical impact than switching from WASM to TypeScript.
Yet they still have chosen to put the “Rust rewrite” part in the title. I almost think it's a click bait.
[+] [-] nulltrace|7 days ago|reply
[+] [-] azakai|7 days ago|reply
It looks like neither is the "real win". both the language and the algorithm made a big difference, as you can see in the first column in the last table - going to wasm was a big speedup, and improving the algorithm on top of that was another big speedup.
[+] [-] socalgal2|8 days ago|reply
[+] [-] catlifeonmars|7 days ago|reply
[+] [-] unknown|5 days ago|reply
[deleted]
[+] [-] wolvesechoes|7 days ago|reply
Kinda is. We came up with abstractions to help reason about what really matters. The more you need to deal with auxillary stuff (allocations, lifetimes), more likely you will miss the big issue.
[+] [-] sroussey|8 days ago|reply
One thing I noticed was that they time each call and then use a median. Sigh. In a browser. :/ With timing attack defenses build into the JS engine.
[+] [-] adastra22|7 days ago|reply
[+] [-] simonbw|7 days ago|reply
You still do get some latency from the event loop, because postMessage gets queued as a MacroTask, which is probably on the order of 10μs. But this is the price you have to pay if you want to run some code in a non-blocking way.
[+] [-] osullivj|7 days ago|reply
[+] [-] fHr|7 days ago|reply
[+] [-] jesse__|7 days ago|reply
[+] [-] nine_k|8 days ago|reply
So this holds even for L = M. The speedup is not in the language, but in the rewriting and rethinking.
[+] [-] evmar|8 days ago|reply
[+] [-] slopinthebag|7 days ago|reply
[+] [-] slopinthebag|7 days ago|reply
[+] [-] spankalee|8 days ago|reply
This new company chose a very confusing name that has been used by the Open UI W3C Community Group for over 5 years.
https://open-ui.org/
Open UI is the standards group responsible for HTML having popovers, customizable select, invoker commands, and accordions. They're doing great work.
[+] [-] moomin|7 days ago|reply
Looks inside
“The old implementation had some really inappropriate choices.”
Every time.
[+] [-] joaohaas|7 days ago|reply
That final summary benchmark means nothing. It mentions 'baseline' value for the 'Full-stream total' for the rust implementation, and then says the `serde-wasm-bindgen` is '+9-29% slower', but it never gives us the baseline value, because clearly the only benchmark it did against the Rust codebase was the per-call one.
Then it mentions: "End result: 2.2-4.6x faster per call and 2.6-3.3x lower total streaming cost."
But the "2.6-3.3x" is by their own definition a comparison against the naive TS implementation.
I really think the guy just prompted claude to "get this shit fast and then publish a blog post".
[+] [-] chvish|7 days ago|reply
[+] [-] JimDabell|7 days ago|reply
[+] [-] rabisg|7 days ago|reply
I understand your frustration with AI writing though. We are a small team and given our roadmap it was either use LLMs to help collate all the internal benchmark results file into a blog or never write it so we chose the former. This was a genuinely surprising and counterintuitive result for us, which is why we wanted to share it. Happy to clarify any of the numbers if helpful.
[+] [-] mpajares|3 days ago|reply
[+] [-] pjmlp|7 days ago|reply
Additionally even after those options are exhausted, only a key parts might need a rewrite, not the whole thing.
However, I wonder how many care about actually learning about algorithms, data structures and mechanical sympathy in the age of Electron apps.
It feels quite often that a rewrite is chosen, because knowing how to actually apply those skills is the CS stuff many think isn't worthwhile learning about.
[+] [-] coldtea|7 days ago|reply
Never mind the age of Electron apps, even fewer care about those in the age of agents.
[+] [-] gavinray|7 days ago|reply
AFAIK, you can create a shared memory block between WASM <-> JS:
https://developer.mozilla.org/en-US/docs/WebAssembly/Referen...
Then you'd only need to parse the SharedArrayBuffer at the end on the JS side
[+] [-] szmarczak|7 days ago|reply
So you're reinventing JSON but binary? V8 JSON nowadays is highly optimized [1] and can process gigabytes per second [2], I doubt it is a bottleneck here.
[1] https://v8.dev/blog/json-stringify [2] https://github.com/simdjson/simdjson
[+] [-] kam|7 days ago|reply
https://docs.rs/serde-wasm-bindgen/
[+] [-] vmsp|7 days ago|reply
[+] [-] rabisg|7 days ago|reply
The most obvious approach would be to let LLMs generate code and render it but that introduces problems like safety, UI consistency and speed. OpenUI solves those problems and provides a safe, consistent and token optimized runtime for the LLMs to render live UI
[+] [-] jeremyjh|7 days ago|reply
> converts internal AST into the public OutputNode format consumed by the React renderer
Why not just have the LLM emit the JSON for OutputNode ? Why is a custom "language" and parser needed at all? And yes, there is a cost for marshaling data, so you should avoid doing it where possible, and do it in large chunks when its not possible to avoid. This is not an unknown phenomenon.
[+] [-] envguard|7 days ago|reply
[+] [-] athrowaway3z|7 days ago|reply
[+] [-] horacemorace|7 days ago|reply
[+] [-] nallana|7 days ago|reply
[+] [-] ivanjermakov|7 days ago|reply