(no title)
joshhart | 6 months ago
Some of the other main tricks - compress the model to 8 bit floating point formats or even lower. This reduces the amount of data that has to stream to the compute unit, also newer GPUs can do math in 8-bit or 4-bit floating point. Mixture of expert models are another trick where for a given token, a router in the model decides which subset of the parameters are used so not all weights have to be streamed. Another one is speculative decoding, which uses a smaller model to generate many possible tokens in the future and, in parallel, checks whether some of those matched what the full model would have produced.
Add all of these up and you get efficiency! Source - was director of the inference team at Databricks
bawana|6 months ago
joshhart|6 months ago