top | item 36136079 (no title) ioedward | 2 years ago You also need the optimizer (e.g. Adam)'s state, which is usually double the parameter's size. So if using fp16, one parameter takes up 6 bytes in memory. discuss order hn newest mirekrusin|2 years ago Yes, if you use ADAM - but it doesn't add up to 80, does it?Even for fp64 it adds only 16 bytes.RMSPRop, Adagrad have half of this overhead.SGD has no optimizer overhead of course.
mirekrusin|2 years ago Yes, if you use ADAM - but it doesn't add up to 80, does it?Even for fp64 it adds only 16 bytes.RMSPRop, Adagrad have half of this overhead.SGD has no optimizer overhead of course.
mirekrusin|2 years ago
Even for fp64 it adds only 16 bytes.
RMSPRop, Adagrad have half of this overhead.
SGD has no optimizer overhead of course.