(no title)
Alifatisk | 9 days ago
The focus here should be on the custom hardware they are producing and its performance, that is whats impressive. Imagine putting GLM-5 on this, that'd be insane.
This reminds me a lot of when I tried the Mercury coder model by Inceptionlabs, they are creating something called a dLLM which is like a diffusion based llm. The speed is still impressive when playing aroun with it sometimes. But this, this is something else, it's almost unbelievable. As soon as I hit the enter key, the response appears, it feels instant.
I am also curious about Taalas pricing.
> Taalas’ silicon Llama achieves 17K tokens/sec per user, nearly 10X faster than the current state of the art, while costing 20X less to build, and consuming 10X less power.
Do we have an idea of how much a unit / inference / api will cost?
Also, considering how fast people switch models to keep up with the pace. Is there really a potential market for hardware designed for one model only? What will they do when they want to upgrade to a better version? Throw the current hardware and buy another one? Shouldn't there be a more flexible way? Maybe only having to switch the chip on top like how people upgrade CPUs. I don't know, just thinking out loudly.
mike_hearn|9 days ago
https://www.nextplatform.com/wp-content/uploads/2026/02/taal...
Probably they don't know what the market will bear and want to do some exploratory pricing, hence the "contact us" API access form. That's fair enough. But they're claiming orders of magnitude cost reduction.
> Is there really a potential market for hardware designed for one model only?
I'm sure there is. Models are largely interchangeable especially as the low end. There are lots of use cases where you don't need super smart models but cheapness and fastness can matter a lot.
Think about a simple use case: a company has a list of one million customer names but no information about gender or age. They'd like to get a rough understanding of this. Mapping name -> guessed gender, rough guess of age is a simple problem for even dumb LLMs. I just tried it on ChatJimmy and it worked fine. For this kind of exploratory data problem you really benefit from mass parallelism, low cost and low latency.
> Shouldn't there be a more flexible way?
The whole point of their design is to sacrifice flexibility for speed, although they claim they support fine tunes via LoRAs. LLMs are already supremely flexible so it probably doesn't matter.
pigpop|9 days ago
test001only|8 days ago
ahofmann|8 days ago
Speed and cost wins over quality and this will also be true for LLMs.
himata4113|9 days ago
alfalfasprout|8 days ago
This is using hardwired weights with on-die SRAM used for K/V for example. It's WAY more power efficient and faster. The tradeoff being it's hardwired.
Still, most frontier models are "good enough" where an obscenely fast version would be a major seller.
Herring|8 days ago
robotpepi|8 days ago
real-hacker|7 days ago