top | item 39883449

(no title)

mobicham | 1 year ago

Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.

discuss

danielhanchen|1 year ago

Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!