(no title)
dust42 | 10 days ago
Tech summary:
- 15k tok/sec on 8B dense 3bit quant (llama 3.1)
- limited KV cache
- 880mm^2 die, TSMC 6nm, 53B transistors
- presumably 200W per chip
- 20x cheaper to produce
- 10x less energy per token for inference
- max context size: flexible
- mid-sized thinking model upcoming this spring on same hardware
- next hardware supposed to be FP4
- a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.
Not exactly a competitor for Nvidia but probably for 5-10% of the market.
Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.
Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
vessenes|10 days ago
1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.
2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.
3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.
However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.
Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.
rbanffy|10 days ago
I often remind people two orders of quantitative change is a qualitative change.
> The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.
The real product they have is automation. They figured out a way to compile a large model into a circuit. That's, in itself, pretty impressive. If they can do this, they can also compile models to an HDL and deploy them to large FPGA simulators for quick validation. If we see models maturing at a "good enough" state, even a longer turnaround between model release and silicon makes sense.
While I also see lots of these systems running standalone, I think they'll really shine combined with more flexible inference engines, running the unchanging parts of the model while the coupled inference engine deals with whatever is too new to have been baked into silicon.
I'm concerned with the environmental impact. Chip manufacture is not very clean and these chips will need to be swapped out and replaced at a cadence higher than we currently do with GPUs.
Gareth321|10 days ago
ssivark|9 days ago
Can we use older (previous generation, smaller) models as a speculative decoder for the current model? I don't know whether the randomness in training (weight init, data ordering, etc) will affect this kind of use. To the extent that these models are learning the "true underlying token distribution" this should be possible, in principle. If that's the case, speculative decoding is an elegant vector to introduce this kind of tech, and the turnaround time is even less of a problem.
btown|10 days ago
jasonwatkinspdx|8 days ago
They may be using Rapidus, which is a Japanese government backed foundry built around all single wafer processing vs traditional batching. They advertise ~2 month turnaround time as standard, and as short as 2 weeks for priority.
empath75|10 days ago
joha4270|10 days ago
> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model
suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?
soleveloper|10 days ago
twalla|9 days ago
noveltyaccount|10 days ago
pennomi|10 days ago
fennecbutt|8 days ago
merlindru|9 days ago
alexjplant|9 days ago
I enjoy envisioning futures more whimsical than "the bargain-basement LLM provider that my insurance company uses denied my claim because I chose badly-vectored words".
jameslk|9 days ago
I’m really curious if context will really matter if using methods like Recursive Language Models[0]. That method is suited to break down a huge amount of context into smaller subagents recursively, each working on a symbolic subset of the prompt.
The challenge with RLM seemed like it burned through a ton of tokens to trade for more accuracy. If tokens are cheap, RLM seems like it could be beneficial here to provide much more accuracy over large contexts despite what the underlying model can handle
0. https://arxiv.org/abs/2512.24601
aurareturn|10 days ago
And it’s a 3bit quant. So 3GB ram requirement.
If they run 8B using native 16bit quant, it will use 60 H100 sized chips.
dust42|10 days ago
Are you sure about that? If true it would definitely make it look a lot less interesting.
elternal_love|10 days ago
varispeed|10 days ago
Aissen|10 days ago
That's a lot of surface, isn't it? As big an M1 Ultra (2x M1 Max at 432mm² on TSMC N5P), a bit bigger than an A100 (820mm² on TSMC N7) or H100 (814mm² on TSMC N5).
> The larger the die size, the lower the yield.
I wonder if that applies? What's the big deal if a few parameter have a few bit flips?
rbanffy|10 days ago
We get into the sci-fi territory where a machine achieves sentience because it has all the right manufacturing defects.
Reminds me of this https://en.wikipedia.org/wiki/A_Logic_Named_Joe
empath75|10 days ago
xnx|9 days ago
WhitneyLand|10 days ago
gwern|9 days ago
make3|8 days ago
pankajdoharey|9 days ago
https://arxiv.org/abs/2511.06174
https://arxiv.org/abs/2401.03868
For a real world use case, you would need an FPGA with terabytes of RAM. Perhaps it'll be a Off chip HBM. But for s large models, even that won't be enough. Then you would need to figure out NV-link like interconnect for these FPGAs. And we are back to square one.
smokel|9 days ago
bsenftner|10 days ago
oliwary|10 days ago
pulse7|8 days ago
mikhail-ramirez|10 days ago
Tepix|10 days ago
robotnikman|9 days ago
zozbot234|10 days ago
dust42|10 days ago
PhunkyPhil|9 days ago