In summary, they forced the model to process data in ternary system and then build a custom FPGA chip to process the data more efficiently. Tested to be "comparable" to small models (3B), theoretically scale to 70B, unknown for SOTAs (>100B params).
We have always known custom chips are more efficient especially for tasks like these where it is basically approximating an analog process (i.e. the brain). What is impressive is how fast it is prgressing. These 3B params models would demolish GPT2 which was, what, 4-5 years old? And they would be pure scifi tech 10 years ago.
Now they can run on your phone.
A machine, running locally on your phone, that can listen and respond to anything a human may say. Who could have confidently claim this 10 years ago?
I was confused by the claim in the headline but it seems that this is really the meat of the paper - they're looking for an architecture that is more efficient to implement and run in hardware. It is interesting. We know that computers must be wasting huge amounts of compute on something by analogy with a human brain and researchers will figure out why sooner or later.
With a ternary system, would we expect 1/3 of the elements to be zero? I kind of wonder about using a sparse MM, then they wouldn’t have to represent 0 and the could just use one bit to represent 1 or -1. 66% density is not really very sparse at all though.
Note that the architecture does use matmuls. They just defined ternary matmuls to not be 'real' matrix multiplication. I mean... it is certainly a good thing for power consumption to be wrangling less bits, but from a semantic standpoint, it is matrix multiplication.
Combined with the earlier paper this year that claimed LLMs work fine (and faster) with trinary numbers (rather than floats? or long ints?) — the idea of running a quick LLM local is looking better and better.
tomohelix|1 year ago
In summary, they forced the model to process data in ternary system and then build a custom FPGA chip to process the data more efficiently. Tested to be "comparable" to small models (3B), theoretically scale to 70B, unknown for SOTAs (>100B params).
We have always known custom chips are more efficient especially for tasks like these where it is basically approximating an analog process (i.e. the brain). What is impressive is how fast it is prgressing. These 3B params models would demolish GPT2 which was, what, 4-5 years old? And they would be pure scifi tech 10 years ago.
Now they can run on your phone.
A machine, running locally on your phone, that can listen and respond to anything a human may say. Who could have confidently claim this 10 years ago?
roenxi|1 year ago
bee_rider|1 year ago
anon291|1 year ago
JKCalhoun|1 year ago
Combined with the earlier paper this year that claimed LLMs work fine (and faster) with trinary numbers (rather than floats? or long ints?) — the idea of running a quick LLM local is looking better and better.
cgearhart|1 year ago
ChrisArchitect|1 year ago
Some more discussion a few weeks ago: https://news.ycombinator.com/item?id=40620955
bee_rider|1 year ago
The whole point of AI was to sell premium GEMMs and come up with funky low precision accelerators.
mysteria|1 year ago
https://news.ycombinator.com/item?id=40787349
MiguelX413|1 year ago
aixpert|1 year ago
skeledrew|1 year ago
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]