(no title)
joshAg | 3 years ago
The first issue is that we're taking 64 bits of data and trying to squeeze them into 16 bits. Now, sure it's not that bad, because we have the sign bit and NANs and infinities, but even if you toss away the exponent entirely, that's still 53 bits of mantissa to squeeze into 16 bits of int.
The second issue is all the values not directly expressible as an integer, either because they're infinity, NAN, too big, too small, or fractional.
The only way we can overcome these issues is to decide what exactly we mean by "converts", because while we might not _like_ it, casting to an int64 and then masking off our 16 most favorite bits of the 64 available is a stable conversion. That might be silly, but it brings up a valid question. What is our conversion algorithm?
Maybe by "convert" we meant map from the smallest float to the smallest int and then the next smallest float to the next smallest int, and then either wrapping around or just pegging the rest to the int16.max.
Or maybe we meant from the biggest float to the biggest int and so on doing the inverse of the previous. Those are two very different results.
And we haven't even considered whether to throw on NAN or infinity or what to do with -0 for both those cases.
Or maybe we meant translate from the float value to the nearest representable integer? We'd have a lot of stuff mapping to int16.max and int16.min, and we'd still have to decide how to handle infinity, NAN, and -0, but still possible.
Basically, until we know the rough conversion function, we can't even know if NAN, infinity and -0 are special cases and we can't even know if clipping will be an edge case or not. There's lots of conversions where we can happily wrap around on ourself and there are no edge cases, and there's lots of conversions where we have edge cases, but we can clip or wrap, and there's lots of conversions where we have edge cases and clipping/wrapping."
vaidhy|3 years ago