top | item 36136079

(no title)

ioedward | 2 years ago

You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory.

discuss

mirekrusin|2 years ago

Yes, if you use ADAM - but it doesn't add up to 80, does it?

Even for fp64 it adds only 16 bytes.

RMSPRop, Adagrad have half of this overhead.

SGD has no optimizer overhead of course.