top | item 47202953

(no title)

4-bit quantization on newer nvidia hardware is being supported in training as well these days. I believe the gpt-oss models were trained natively in MXFP4, which is a 4-bit floating point / e2m1 (2-exponent, 1 bit mantissa, 1 bit sign).

It doesn't seem terribly common yet though. I think it is challenging to keep it stable.

[1] https://www.opencompute.org/blog/amd-arm-intel-meta-microsof...

[2] https://www.opencompute.org/documents/ocp-microscaling-forma...

discuss

zozbot234|13 hours ago

mxfp4 is a block-based floating point format. The E2M1 format applies to individual values, but each 32-values block also has a shared 8-bit floating point exponent to provide scaling information about the whole block.