> NaN/infinity values can propagate and cause chaos
NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they'd react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!
NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.
"This sounds terrible" you might think.
But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.
What's the result of sqrt(-1.0)? It's not a number, so it's a NaN. If a NaN appears in your results, you know you've got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is better than having a bad result go unrecognized.
NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.
This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
> But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.
You return (value, found), or (value,error) or Result<T>.
>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You
never know how that affects your results.
You return error and handle the error. You want to know sensor is wonky or returns bad data.
Also you technically should use signalling NaN for "this is error" and quiet NaN for "this just impossible math result", which makes it even more error prone. Just return fucking error if it is a function.
Sure, useful for expressions but the handling should be there and then, and if function can have error it should return it explicitly as error, else you have different error handling for different types of functions.
> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)
I think the concept of NaNs are sound, but I think relying on them is fraught with peril, made so by the unobvious test for NaN-ness in many languages (ie, "if (x != x)"), and the lure of people who want to turn on "fast math" optimizations which do things like assume NaNs aren't possible and then dead-code-eliminate everything that's guarded by an "x != x" test.
Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.
> This means that NaNs propagate from the source of the original NaN to the final printed result.
An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.
> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.
Only if you’re working with pure enough functions, I think? Otherwise I might be polluting state with NaN along the way. And therefore need to error handle the whole way. So I may rather prefer to catch NaN up front and throw an error.
Loud and clear on the other point. I work with ROS a lot and the message definitions do not support NaN or null. So we have these magic numbers all over the place for things like “the robot does not yet know where it is.” (therefore: on the moon.)
If you have only a couple of minutes to develop a mental model of floating-point numbers (and you have none currently), the most valuable thing IMO would be to spend them staring at a diagram like this one: https://upload.wikimedia.org/wikipedia/commons/b/b6/Floating... (uploaded to Wikipedia by user Joeleoj123 in 2020, made using Microsoft Paint) — it already covers the main things you need to know about floating-point, namely there are only finitely many discrete representable values (the green lines), and the gaps between them are narrower near 0 and wider further away.
With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of "superstition" like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error / is "random", etc. You won't find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it's better than "superstition".) The second-most valuable thing, if you have 5–10 minutes, may be to go to https://float.exposed/ and play with it for a while.
Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.
The page layout example ("example 7") illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn't matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)
I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic / interval arithmetic are actually more feasible these days, and it's no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.
If you have even less time, just think of them as representing physical measurements made with practical instruments and the math done with analog equipment.
The common cause of floating point problems is usually treating them as a mathematical ideal. The quirks appear at the extremes when you try to to un-physical things with them. You can't measure exactly 0 V with a voltmeter, or use an instrument for measuring the distance to stars then add a length obtained from a micrometer without entirely losing the latter's contribution.
Example 4 mentions that the result might be different with the same code. Here is an example that is particularly counter-intuitive.
Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.
However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:
p = uv + wu
Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.
With FMA, the compiler might optimize this code to:
p = FMA(u, v, wu)
That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!
So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.
Note that with gcc/clang you can control the auto-use of fma with compile flags (-ffp-contract=off). It is pretty crazy imho that gcc defaults to using fma
Yes, and this is not just a theoretical concern: There was an article here [1] in 2021 claiming that Apple M1's FMA implementation had "flaws". There was actually no such flaw. Instead, the author was caught off guard by the very phenomenon you are describing.
Back in university I was taking part in programming competition. I don't remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n < 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.
They didn't take into account that floats come with an estimated uncertainty, and that values that are the same within the limits of experimental error are identical? That's a really badly set problem!
It can also come up in simple control systems. A simple low-pass filter can fail to converge to a steady state value if the time constant is long and the sample rate is high.
Y += (X-Y) * alpha * dt
When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.
I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.
> It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.
OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.
I had one issue where pdftotext would produce different output on different machines (Linux vs Mac). It broke some of our tests.
I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.
It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).
There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.
x86 (x87) FP is notoriously inconsistent because of the 80 bit extended precision that may not be used. In a JITed language line Java/C# it’s even less fun as it can theoretically be inconsistent even for the same compiled program on different machines.
Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.
macOS doesn't use -ffloat-store, it uses SSE for floats instead of x87 because it only supports machines with SSSE3. You wanted -mfpmath=sse or -march=native.
One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."
Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:
1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.
2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.
3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).
But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.
> Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats.
I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).
IEEE-754 is a marvelous standard. It's a short, easy to understand standard attached to an absolutely mind boggling number of special cases or explanation as to why certain decisions in the simple standard were actually incredibly important (and often really smart and non-obvious). It's the product of some very smart people who had, through their careers, made FP implementations and discovered why various decisions turned out to have been bad ones.
I'm glad it's in hardware, and not just because FP used to be quite slow and different on every machine. I'm glad it's in hardware because chip designers (unlike most software developers) are anal about getting things right, and implementing FP properly is hard -- harder than using it!
> And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations
> 3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).
One thing that I suspect trips people a lot is decimal string/literal <-> (binary) float conversions instead of the floating point math itself. This includes the classic 0.1+0.2 thing, and many of the problems in the article.
I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years
> One thing that pains me about this kind of zoo of problems is that people often have the takeaway, "floating point is full of unknowable, random errors, never use floating point, you will never understand it."
> Floating point is amazingly useful!
Another thing about floats is they are for most parts actually very predictable. In particular all basic operations should produce bit-exact results to last ulp. Also because they are language independent standard, you generally can get same behavior in different languages and platforms. This makes learning floats properly worthwhile because the knowledge is so widely applicable
> There's a reason why it's implemented in hardware in all modern computers
Yah, legacy.
The reason we used it originally is that computers were small and slow. Now that they're big and fast we could do without it, except that there is already so much hardware and software out there that it will never happen.
Related: In numerical analysis, I found the distinction between forwards and backwards numerical error to be an interesting concept. The forwards error initially seems like the only right kind, but is often impossible to keep small in numerical linear algebra. In particular, Singular Value Decomposition cannot be computed with small forwards error. But the SVD can be computed with small backwards error.
Also: The JSON example is nasty. Should IDs then always be strings?
Specs, vs. their implementations, vs. backwards compat. JSON just defines a number type, and neither the grammar nor the spec places limits on it (though the spec does call out exactly this problem). So the JSON is to-spec valid. But implementations have limits as to what they'll decode: JS's is that it decodes to number (a double) by default, and thus, loses precision.
(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)
JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.
> Should IDs then always be strings?
That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.
My 'favourite' is that the quadratic formula -b±sqrt(b²-4ac)/2a falls apart when you solve for the positive solution using floating point for cases where ε=b²/4ac is small, the workaround being to use the binomial expansion -b/2a*(0.5ε-0.125ε²+O(ε³))
This one is fun! However, that particular formula is not my favorite, as it doesn't behave nicely when a is small. The art here is deciding what to special case.
The way I approach this[1] is to choose the sign of the ± to be the same as -b to compute the first root x1. Then the second root is c/(a * x1). There are a few other checks in the code for things overflowing, but that basically gives you very accurate answers across the range.
This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.
> example 4: different languages sometimes do the same floating point calculation differently
It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.
I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you're computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.
OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.
There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.
> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:
> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl
IEEE 754 doesn't mandate a certain level of accuracy for transcendental functions like sin/log. You shouldn't expect different libraries to give you the same value. If you're doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered "very inaccurate" by FP standards).
Regarding denormal/subnormal numbers mentioned as "weird": the main issue with them is that their hardware implementation is awfully slow, to the point of being unusable for most computation cases with even moderate FLOPs
Love it. I actually use Excel which even power users take for granted to highlight that people really need to understand the underlying system, or the system needs to have guard rails to prevent people from stubbing their toes. Microsoft even had to write a page explaining what might happen [1] with floating point wierdness.
If you're very careful, a double can be an integer type. (A 53-bit one, I think?) (I don't love this line of thinking. It has a lot of sharp edges. But JS programmers effectively do this all the time, often without thinking too hard about it.)
(And even before BigInt, there's an odd u32-esque "type" in JS; it's not a real type — it doesn't appear in the JS type system, but rather an internal one that certain operations will be converted to internally. That's why (0x100000000 | 0) == 0 ; even though 0x100000000 (and every other number in that expression, and the right answer) is precisely representable as a f64. This doesn't matter for JSON decoding, though, … and most other things.)
I believe this is technically inaccurate; while Javascript groups most of the number values under, well, "number", modern underlying implementations may resort to perform integer operations when they recognize it is possible. There are also a couple hacks you can do with bit operations to "work" with integers, although I don't remember them off the top of my head - typically used for truncating and whatnot and was mainly a performance thing.
Also there are typed arrays and bigints if we can throw those in, too.
What do you mean? Floating-point arithmetic is, by design, exact for small integers. The result of adding 2.0 to 3.0 is exactly 5.0. This is one of the few cases where it is perfectly legitimate to compare floats for equality.
In fact, using 64-bit doubles to represent ints you get way more ints than using plain 32-bit ints. Thus, choosing doubles to represent integers makes perfect sense (unless you worry about wasting a bit of memory and performance).
Example 7 really got me, can anyone explain that? I’m not sure how “modulo” operation would be implemented in hardware, if it is a native instruction or not, but one would hope it would give a result consistent with the matching divide operation.
Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and/or compiler? Is this a library/language bug rather than a Floating Point gotcha?
The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]
The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.
This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.
This has nothing to do with the definition or implementation of the remainder or modulo function.
It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.
In decimal, 13.716 is exactly 3 times 4.572, so any kind of remainder must be null, but after conversion from decimal to binary that relationship is no longer true, and because the remainder is not a continuous function its value may be wildly different from the correct value.
When you compute with approximate numbers, like the floating-point numbers, as long as you compose only continuous functions, the error in the final result remains bounded and smaller errors in inputs lead to a diminished error in the output.
However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.
The conclusion is that whenever computing with approximate numbers (which may also use other representations, not only floating-point) you have to be exceedingly cautious when using any function that is not continuous.
Here is Go version. works exactly as expected, no surprises. People just need to grow up and use a modern language, not a 50 year old out of date language:
package main
import "fmt"
func main() {
var i, iterations, meters float64
for iterations = 100_000_000; i < iterations; i++ {
meters += 0.01
}
// Expected: 1000.000000 km
fmt.Printf("Expected: %f km\n", 0.01 * iterations / 1000)
// Got: 1000.000001 km
fmt.Printf("Got: %f km \n", meters / 1000)
}
AI already has led to a rethinking of computer architectures, in which the conventional von Neumann structure is replaced by near-compute and at-memory floorplans. But novel layouts aren’t enough to achieve the power reductions and speed increases required for deep learning networks. The industry also is updating the standards for floating-point (FP) arithmetic. https://semiengineering.com/will-floating-point-8-solve-ai-m...
I'm not on Mastodon, so I'll share here: I inherited some numerical software that was used primarily to prototype new algorithms and check errors for a hardware product that solved the same problem. It was known that different versions of the software produced slightly different answers, for seemingly no reason. The hardware engineer who handed it off to me didn't seem to be bothered by it. He wasn't using version control, so I couldn't dig into it immediately, but I couldn't stop thinking about it.
Soon enough I had two consecutive releases in hand, which produced different results, and which had identical numerical code. The only code I had changed that ran during the numerical calculations was code that ran between iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.
How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don't remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn't specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.
The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.
Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.
Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that's how code interleaved with numerical calculations can affected the results even if it semantically doesn't change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.
It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there's a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.
Oh man, those 80-bit registers on 32-bit machines were weird. I was very confused as an undergrad when I ran the basic program to find machine epsilon, and was getting a much smaller epsilon than I expected on a 64-bit float. Turns out, the compiler had optimised all of my code to run on registers and I was getting the machine epsilon of the registers instead.
On the hardware, it's fundamentally integer arithmetic under the hood.
Floating point itself is an implementation on top of that.
With a slide rule it's basically truncated integers all the way, you're very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You're constantly aware of any step where your significant figures drop below 3 because that's such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.
The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.
On a proper computer, integer addition, subtraction, and multiplication are exact. It's the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.
Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset/scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.
The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.
What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while "periodically" dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.
If there's nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.
Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.
Got bitten by the big number being added to small number when converting code from double precision to single precision, hard to track down when subtle errors get introduced due to this.
in computer graphics, some people try to accumulate the transformations in 1 matrix. So A_{n+1} = T_{n} * A_{n}
where T_{n} is a small transformation like a rotation around an axis
They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.
Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.
Used to run into these problems all the time when I was doing work in numerical analysis.
The PATRIOT missile error (it wasn't a disaster) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it's target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, truncated into a 24 bit register. Moreover, they weren't "slightly off". The clock drift alone put the missile considerably off target.
While I don't claim that floating points didn't have a hand in this error it's likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. "Good enough for government work" struck again here.
WalterBright|3 years ago
NaN is the most misunderstood feature of IEEE floating point. Most people react to a NaN like they'd react to the dentist telling them they need a root canal. But NaN is actually a very valuable and useful tool!
NaN is just a value that represents an invalid floating point value. The result of any operation on a NaN is a NaN. This means that NaNs propagate from the source of the original NaN to the final printed result.
"This sounds terrible" you might think.
But let's study it a bit. Suppose you are searching an array for a value, and the value is not in the array. What do you return for an index into the array? People often use -1 as the "not found" value. But then what happens when the -1 value is not noticed? It winds up corrupting further attempts to use it. The problem is that integers do not have a NaN value to use for this.
What's the result of sqrt(-1.0)? It's not a number, so it's a NaN. If a NaN appears in your results, you know you've got mistake in your algorithm or initial values. Yes, I know, it can be clumsy to trace it back to its source, but I submit it is better than having a bad result go unrecognized.
NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.
This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
NaN is your friend!
ilyt|3 years ago
You return (value, found), or (value,error) or Result<T>.
>NaN has value beyond that. Suppose you have an array of sensors. One of those sensors goes bad (like they always do). What value to you use for the bad sensor? NaN. Then, when the data is crunched, if the result is NaN, you know that your result comes from bad data. Compare with setting the bad input to 0.0. You never know how that affects your results.
You return error and handle the error. You want to know sensor is wonky or returns bad data.
Also you technically should use signalling NaN for "this is error" and quiet NaN for "this just impossible math result", which makes it even more error prone. Just return fucking error if it is a function.
Sure, useful for expressions but the handling should be there and then, and if function can have error it should return it explicitly as error, else you have different error handling for different types of functions.
> This is why D (in one of its more controversial choices) sets uninitialized floating point values to NaN rather than the more conventional choice of 0.0.
I'd like to see how much of the code actually uses that as a feature and not just sets it to 0.0 (or initializes it right away)
dragonwriter|3 years ago
None in an Optional/Maybe, and Error in a Result, Null in a nullable type, an exception,
maximilianburke|3 years ago
Really though, I'm a fan, I just think that we need better means for checking them in legacy languages and we need to entirely do away with "fast math" optimizations.
jordigh|3 years ago
It's not often that I get to correct Mr D himself, but 1.0/0.0 is...
inetknght|3 years ago
An exception would be better. Then you immediately get at the first problem instead of having to track down the lifetime of the observed problem to find the first problem.
bachmeier|3 years ago
This is definitely something I like about D, but I'd much prefer a compiler error. double x; double y = x+1.5; is less than optimal.
pwpwp|3 years ago
> What do you return for an index into the array?
An option/maybe type would solve this much better.
> Yes, I know, it can be clumsy to trace it back to its source
An exception would be much better, alerting you to the exact spot where the problem occurred.
Waterluvian|3 years ago
Loud and clear on the other point. I work with ROS a lot and the message definitions do not support NaN or null. So we have these magic numbers all over the place for things like “the robot does not yet know where it is.” (therefore: on the moon.)
UncleMeat|3 years ago
That's not true! maxNum(nan, x) = x.
unknown|3 years ago
[deleted]
svat|3 years ago
With just that understanding, you can understand the reason for most of the examples in this post. You avoid both the extreme of thinking that floating-point numbers are mathematical (exact) real numbers, and the extreme of "superstition" like believing that floating-point numbers are some kind of fuzzy blurry values and that any operation always has some error / is "random", etc. You won't find it surprising why 0.1 + 0.2 ≠ 0.3, but 1.0 + 2.0 will always give 3.0, but 100000000000000000000000.0 + 200000000000000000000000.0 ≠ 300000000000000000000000.0. :-) (Sure this confidence may turn out to be dangerous, but it's better than "superstition".) The second-most valuable thing, if you have 5–10 minutes, may be to go to https://float.exposed/ and play with it for a while.
Anyway, great post as always from Julia Evans. Apart from the technical content, her attitude is really inspiring to me as well, e.g. the contents of the “that’s all for now” section at the end.
The page layout example ("example 7") illustrates the kind of issue because of which Knuth avoided floating-point arithmetic in TeX (except where it doesn't matter) and does everything with scaled integers (fixed-point arithmetic). (It was even worse then before IEEE 754.)
I think things like fixed-point arithmetic, decimal arithmetic, and maybe even exact real arithmetic / interval arithmetic are actually more feasible these days, and it's no longer obvious to me that floating-point should be the default that programming languages guide programmers towards.
sacrosancty|3 years ago
The common cause of floating point problems is usually treating them as a mathematical ideal. The quirks appear at the extremes when you try to to un-physical things with them. You can't measure exactly 0 V with a voltmeter, or use an instrument for measuring the distance to stars then add a length obtained from a micrometer without entirely losing the latter's contribution.
guyomes|3 years ago
Some CPU have the instruction FMA(a,b,c) = ab + c and it is guaranteed to be rounded to the nearest float. You might think that using FMA will lead to more accurate results, which is true most of the time.
However, assume that you want to compute a dot product between 2 orthogonal vectors, say (u,v) and (w,u) where w = -v. You will write:
p = uv + wu
Without FMA, that amounts to two products and an addition between two opposite numbers. This results in p = 0, which is the expected result.
With FMA, the compiler might optimize this code to:
p = FMA(u, v, wu)
That is one FMA and one product. Now the issue is that wu is rounded to the nearest float, say x, which is not exactly -vu. So the result will be the nearest float to uv + x, which is not zero!
So even for a simple formula like this, testing if two vectors are orthogonal would not necessary work by testing if the result is exactly zero. One recommended workaround in this case is to test if the dot product has an absolute value smaller than a small threshold.
zokier|3 years ago
lanstin|3 years ago
thxg|3 years ago
[1] https://news.ycombinator.com/item?id=27880461
unknown|3 years ago
[deleted]
kilotaras|3 years ago
Back in university I was taking part in programming competition. I don't remember the exact details of a problem, but it was expected to be solved as a dynamic problem with dp[n][n] as an answer, n < 1000. But, wrangling some numbers around one could show that dp[n][n] = dp[n-1][n-1] + 1/n, and the answer was just the sum of first N elements of harmonic series. Unluckily for us the intended solution had worse precision and our solution failed.
HarryHirsch|3 years ago
kloch|3 years ago
There is a simple workaround for this:
https://en.wikipedia.org/wiki/Kahan_summation
It's usually only needed when adding billions of values together and the accumulated truncation errors would be at an unacceptable level.
phkahler|3 years ago
Y += (X-Y) * alpha * dt
When dt is small and alpha is too, the right hand side can be too small to affect the 24bit mantissa of the left.
I prefer a 16/32bit fixed point version that guarantees convergene to any 16bit steady state. This happened in a power conversion system where dt=1/40000 and I needed a filter in the 10's of Hz.
kergonath|3 years ago
OTOH, it’s easy to implement, so I have a couple of functions to do it easily, and I got quite a lot of use out of them. It’s probably overkill sometimes, but sometimes it’s useful.
dunham|3 years ago
I tracked down where it was happening (involving an ==), but it magically stopped when I added print statements or looked at it in the debugger.
It turns out the x86 was running the math at a higher precision and truncating when it moved values out of registers - as soon as it hit memory, things were equal. MacOS was defaulting to -ffloat-store to get consistency (their UI library is float based).
There were too many instances of == in that code base (which IMO is a bad idea with floats), so I just added -ffloat-store to the Linux build and called it a day.
alkonaut|3 years ago
Thankfully the solution to that problem came when x86 (32 bit) mostly disappeared.
astrange|3 years ago
jordigh|3 years ago
Floating point is amazingly useful! There's a reason why it's implemented in hardware in all modern computers and why every programming language has a built-in type for floats. You should use it! And you should understand that most of its limitations are an inherent mathematical and fundamental limitation, it is logically impossible to do better on most of its limitations:
1. Numerical error is a fact of life, you can only delay it or move it to another part of your computation, but you cannot get rid of it.
2. You cannot avoid working with very small or very large things because your users are going to try, and floating point or not, you'd better have a plan ready.
3. You might not like that floats are in binary, which makes decimal arithmetic look weird. But doing decimal arithmetic does not get rid of numerical error, see point 1 (and binary arithmetic thinks your decimal arithmetic looks weird too).
But sure, don't use floats for ID numbers, that's always a problem. In fact, don't use bigints either, nor any other arithmetic type for something you won't be doing arithmetic on.
gumby|3 years ago
I completely agree with you even though I go out of my way to avoid FP, and even though, due to what I usually work on, I can often get away with avoiding FP (often fixed point works -- for me).
IEEE-754 is a marvelous standard. It's a short, easy to understand standard attached to an absolutely mind boggling number of special cases or explanation as to why certain decisions in the simple standard were actually incredibly important (and often really smart and non-obvious). It's the product of some very smart people who had, through their careers, made FP implementations and discovered why various decisions turned out to have been bad ones.
I'm glad it's in hardware, and not just because FP used to be quite slow and different on every machine. I'm glad it's in hardware because chip designers (unlike most software developers) are anal about getting things right, and implementing FP properly is hard -- harder than using it!
ogogmad|3 years ago
You can do exact real arithmetic. But this is only done by people who prove theorems with computers - or by the Android calculator! https://en.wikipedia.org/wiki/Computable_analysis
Other alternatives (also niche) are exact rational arithmetic, computer algebra, arbitrary precision arithmetic.
Fixed point sometimes gets used instead of floats because some operations lose no precision over them, but most operations still do.
zokier|3 years ago
One thing that I suspect trips people a lot is decimal string/literal <-> (binary) float conversions instead of the floating point math itself. This includes the classic 0.1+0.2 thing, and many of the problems in the article.
I think these days using floating point hex strings/literals more would help a lot. There are also decimal floating point numbers that people largely ignore despite being standard for over 15 years
zokier|3 years ago
> Floating point is amazingly useful!
Another thing about floats is they are for most parts actually very predictable. In particular all basic operations should produce bit-exact results to last ulp. Also because they are language independent standard, you generally can get same behavior in different languages and platforms. This makes learning floats properly worthwhile because the knowledge is so widely applicable
unknown|3 years ago
[deleted]
carapace|3 years ago
> There's a reason why it's implemented in hardware in all modern computers
Yah, legacy.
The reason we used it originally is that computers were small and slow. Now that they're big and fast we could do without it, except that there is already so much hardware and software out there that it will never happen.
ogogmad|3 years ago
Also: The JSON example is nasty. Should IDs then always be strings?
deathanatos|3 years ago
Specs, vs. their implementations, vs. backwards compat. JSON just defines a number type, and neither the grammar nor the spec places limits on it (though the spec does call out exactly this problem). So the JSON is to-spec valid. But implementations have limits as to what they'll decode: JS's is that it decodes to number (a double) by default, and thus, loses precision.
(I feel like this issue is pretty well known, but I suppose it probably bites everyone once.)
JS does have the BigInt type, nowadays. Unfortunately, while the JSON.parse API includes a "reviver" parameter, the way it ends up working means that it can't actually take advantage of BigInt.
> Should IDs then always be strings?
That's a decent-ish solution; as it side-steps the interop issues. String, to me, is not unreasonable for an ID, as you're not going to be doing math on it.
gugagore|3 years ago
Backward error: the error between the given question, and the question whose right answer is the given answer.
Easier to parse like this: a small forward error means that you give an answer close to the right one.
A small backward error means that the answer you give is the right answer for a nearby question.
owisd|3 years ago
raphlinus|3 years ago
The way I approach this[1] is to choose the sign of the ± to be the same as -b to compute the first root x1. Then the second root is c/(a * x1). There are a few other checks in the code for things overflowing, but that basically gives you very accurate answers across the range.
This is explained a bit in a good Math Exchange post[2]. Jim Blinn's "How to solve a quadratic equation" also makes good reading.
Wait til you get to cubics and quartics.
[1]: https://docs.rs/kurbo/latest/src/kurbo/common.rs.html#116-16...
[2]: https://math.stackexchange.com/questions/866331/numerically-...
BeetleB|3 years ago
It's worse than that: You can use the same language, compiler, library, machine, and still get different results if your OS is different.
I forget all the details, but it boils down to how intermediate results are handled. When you compute certain functions, there are several intermediate calculations before it spits out the result. You get more accuracy if you allow those intermediate calculations to happen in a higher precision format (e.g. you're computing in 32 bits, so it will compute the intermediate values in 64 bits). But that is also slower.
OS's make a "default" choice. I think Linux defaults to slower, but more accurate, and BSD defaults to faster, but less accurate.
There may be flags you can set to force one configuration regardless of the OS, but you shouldn't assume your libraries do that.
> In principle you might think that different implementations should work the same way because of the IEEE 754 standard for floating point, but here are a couple of caveats that were mentioned:
> math operations in libc (like sin/log) behave differently in different implementations. So code using glibc could give you different results than code using musl
IEEE 754 doesn't mandate a certain level of accuracy for transcendental functions like sin/log. You shouldn't expect different libraries to give you the same value. If you're doing 64 bit calculations, I would imagine most math libraries will give results accurate enough for 99.99% of math applications, even if only the first 45 bits are correct (and this would be considered "very inaccurate" by FP standards).
asicsp|3 years ago
mochomocha|3 years ago
mikehollinger|3 years ago
[1] https://docs.microsoft.com/en-us/office/troubleshoot/excel/f...
BeetleB|3 years ago
A thousand thanks for not saying "addition is not commutative".
(Addition is commutative in floating point. It merely is not associative).
dahfizz|3 years ago
Can anyone justify this? Do JS developers prefer not having exact integers, or is this something that everyone just kinda deals with?
deathanatos|3 years ago
If you're very careful, a double can be an integer type. (A 53-bit one, I think?) (I don't love this line of thinking. It has a lot of sharp edges. But JS programmers effectively do this all the time, often without thinking too hard about it.)
(And even before BigInt, there's an odd u32-esque "type" in JS; it's not a real type — it doesn't appear in the JS type system, but rather an internal one that certain operations will be converted to internally. That's why (0x100000000 | 0) == 0 ; even though 0x100000000 (and every other number in that expression, and the right answer) is precisely representable as a f64. This doesn't matter for JSON decoding, though, … and most other things.)
thdc|3 years ago
Also there are typed arrays and bigints if we can throw those in, too.
enriquto|3 years ago
What do you mean? Floating-point arithmetic is, by design, exact for small integers. The result of adding 2.0 to 3.0 is exactly 5.0. This is one of the few cases where it is perfectly legitimate to compare floats for equality.
In fact, using 64-bit doubles to represent ints you get way more ints than using plain 32-bit ints. Thus, choosing doubles to represent integers makes perfect sense (unless you worry about wasting a bit of memory and performance).
josefx|3 years ago
evancox100|3 years ago
Edit: x87 has FPREM1 which can calculate a remainder (accurately one hopes), but I can’t find an equivalent in modern SSE or AVX. So I guess you are at the mercy of your language’s library and/or compiler? Is this a library/language bug rather than a Floating Point gotcha?
timerol|3 years ago
The division of these numbers is 2.9999998957049091386350361962468173875300478102103478639802753918, but the nearest float to that is 3. (Exactly 3.) [2]
The modulo operation can (presumably) determine that 3X > Y, so the modulo is Y - 2X, as normal.
This gives inconsistent results, if you don't know that every float is actually a range, and "3" as a float includes some numbers that are smaller than 3.
[1] https://www.wolframalpha.com/input?i=13.715999603271484375+%... [2] https://www.wolframalpha.com/input?i=2.999999895704909138635..., then https://float.exposed/0x40400000
adrian_b|3 years ago
It is a problem that appears whenever you compose an inexact function, like the conversion from decimal to binary, with a function that is not continuous, like the remainder a.k.a. modulo function.
In decimal, 13.716 is exactly 3 times 4.572, so any kind of remainder must be null, but after conversion from decimal to binary that relationship is no longer true, and because the remainder is not a continuous function its value may be wildly different from the correct value.
When you compute with approximate numbers, like the floating-point numbers, as long as you compose only continuous functions, the error in the final result remains bounded and smaller errors in inputs lead to a diminished error in the output.
However, it is enough to insert one discontinuous function in the computation chain for losing any guarantee about the magnitude of the error in the final result.
The conclusion is that whenever computing with approximate numbers (which may also use other representations, not only floating-point) you have to be exceedingly cautious when using any function that is not continuous.
svnpenn|3 years ago
Lind5|3 years ago
cratermoon|3 years ago
marshallward|3 years ago
https://www.marshallward.org/fortrancon2021/#/title-slide
https://m.youtube.com/watch?v=wQQuFXm6ZqU&t=27s
dkarl|3 years ago
Soon enough I had two consecutive releases in hand, which produced different results, and which had identical numerical code. The only code I had changed that ran during the numerical calculations was code that ran between iterations of the numerical parts of the code. IIRC, it printed out some status information like how long it had been running, how many calculations it had done, the percent completed, and the predicted time remaining.
How could that be affecting the numerical calculations??? My first thought was a memory bug (the code was in C-flavored C++, with manual memory management) but I got nowhere looking for one. Unfortunately, I don't remember the process by which I figured out the answer, but at some point I wondered what instructions were used to do the floating-point calculations. The Makefile didn't specify any architecture at all, and for that compiler, on that architecture, that meant using x87 floating-point instructions.
The x87 instruction set was originally created for floating point coprocessors that were designed to work in tandem with Intel CPUs. The 8087 coprocessor worked with the 8086, the 287 with the 286, the 387 with the 386. Starting with the 486 generation, the implementation was moved into the CPU.
Crucially, the x87 instruction set includes a stack of eight 80-bit registers. Your C code may specify 64-bit floating point numbers, but since the compiled code has to copy those value into the x87 registers to execute floating-point instructions, the calculations are done with 80-bit precision. Then the values are copied back into 64-bit registers. If you are doing multiple calculations, a smart compiler will keep intermediate values in the 80-bit registers, saving cycles and gaining a little bit of precision as a bonus.
Of course, the number of registers is limited, so intermediate values may need to be copied to a 64-bit register temporarily to make room for another calculation to happen, rounding them in the process. And that's how code interleaved with numerical calculations can affected the results even if it semantically doesn't change any of the values. Calculating percent completed, printing a progress bar -- the compiler may need to move values out of the 80-bit registers to make room for these calculations, and when the code changes (like you decide to also print out an estimated time remaining) the compiler might change which intermediate values are bumped out of the 80-bit registers and rounded to 64 bits.
It was silly that we were executing these ancient instructions in 2004 on Opteron workstations, which supported SSE2, so I added a compiler flag to enable SSE2 instructions, and voila, the numerical results matched exactly from build to build. We also got a considerable speedup. I later found out that there's a bit you can flip to force x87 arithmetic to always round results to 64 bits, probably to solve exactly the problem I encountered, but I never circled back to try it.
jordigh|3 years ago
fuzzfactor|3 years ago
Floating point itself is an implementation on top of that.
With a slide rule it's basically truncated integers all the way, you're very limited on significant figures, you float the point in your head or some other way, and apply it afterward. You're constantly aware of any step where your significant figures drop below 3 because that's such a drastic drop from 3 down to 2. Or down to 1 which can really slap you in the face.
The number of decimal places also is an integer naturally which is related to scientific notation. Whether a result is positite or negative is a simple flag.
On a proper computer, integer addition, subtraction, and multiplication are exact. It's the division which produces integers which are not exact, since they can usually be truncated results, as they are written to memory at the bitness in use at the time, storing the whole-number portion of the quotient only.
Integer arithmetic can usually be made as exact as you want and it helps to consciously minimize divisions, and if necessary offset/scale the operands beforehand such that the full range of quotients is never close enough to zero for the truncation to affect your immediate result within the number of significant figures necessary for your ultimate result to be completely reliable.
The number of significant figures available on even an 8-bit computer just blows away the slide rule but not if you don't take full advantage of it.
What sometimes happens is that zero-crossing functions, when the quotients are not offset, will fluctuate between naturally taking advantage of the enhanced bitness of the hardware for large values (blowing away the slide rule a huge amount of the time), while "periodically" dropping below the accuracy of a slide rule when some quotient, especially an intermediate one, is too near zero. Floating point or integer.
If there's nobody keeping track of not just the magnitude of the possible values, but also the significant figures being carried at each point, your equations might not come out as good as a slide rule sometimes.
Edit: IOW when the final reportable result is actually a floating point number where you need to have an accurate idea how many figures to the right of the decimal place are truly valid, it might be possible to use all integer calculations to your advantage from the beginning, and confidently place the decimal point as a final act.
Waterluvian|3 years ago
I thought the JSON spec was that a number is of any precision and that it’s up to the parser to say, “best I can do is 754.”
That is, I think you can deserialize very long decimal numbers to higher accuracy in some languages.
tails4e|3 years ago
toolslive|3 years ago
They learn by experience that they also slowly accumulate errors and end up with a transformation matrix A that's no longer orthogonal and will skew the image.
Or people try to solve large lineair systems of floating points with a naive Gaussian elimination approx and end up with noise. Same with naive iterative eigen vector calculations.
toolslive|3 years ago
ape4|3 years ago
I wonder if people sometimes make a one element integer array this way so they can have a integer to work with.
lifefeed|3 years ago
jrockway|3 years ago
weakfortress|3 years ago
The PATRIOT missile error (it wasn't a disaster) was more due to the handling of timestamps than just floating point deviation. There were several concurrent failures that allowed the SCUD to hit it's target. IIRC the clock drift was significant and was magnified by being converted to a floating point and, importantly, truncated into a 24 bit register. Moreover, they weren't "slightly off". The clock drift alone put the missile considerably off target.
While I don't claim that floating points didn't have a hand in this error it's likely the correct handling of timestamps would not have introduced the problem in the first place. Unlike the other examples given this one is a better example of knowing your system and problem domain rather than simply forgetting to calculate a delta or being unaware of the limitations of IEEE 754. "Good enough for government work" struck again here.
unknown|3 years ago
[deleted]
Aardwolf|3 years ago
> 1. it has a funny name
Reasoning accepted!
unknown|3 years ago
[deleted]
effnorwood|3 years ago
[deleted]