It actually isn't "surprising" if you understand how the format works. It essentially uses scientific notation but in binary, with a set number of bits for both the mantissa and the exponent as well as a few changes thrown in for better behavior at its limits (like denormalization). This means that it can't directly express numbers which are very easy to write in decimal form, like 0.1, just like we can't express 1/3 as a finite decimal. It's designed to manage this as well as possible with the small number of bits at its disposal, but we still inevitably run into these issues.
Of course, most programmers only have a vague idea of how floating point numbers work. (I'm certainly among them!) It's very easy to run into problems. And even with a better understanding of the format, it's still very difficult to predict exactly what will happen in more complex expressions.
A really cool aside is that there are some relatively new toys we can use to model floating point numbers in interesting ways. In particular, several SMT solvers including Z3[1] now support a "theory of floating point" which lets us exhaustively verify and analyze programs that use floating point numbers. I haven't seen any applications taking advantage of this directly, but I personally find it very exciting and will probably try using it for debugging the next time I have to work with numeric code.
A little while back, there was an article about how you can test floating point functions by enumerating every single 32-bit float. This is a great way of thinking! However, people were right to point out that this does not really scale when you have more than one float input or if you want to talk about doubles. This is why SMT solvers supporting floating point numbers is so exciting: it makes this sort of approach practical even for programs that use lots of doubles. So you can test every single double or every single pair of doubles or more, just by being clever about how you do it.
I haven't tried using the floating point theories, so I have no idea how they scale. However, I suspect they are not significantly worse than normal bitvectors (ie signed/unsigned integers). And those scale really well to larger sizes or multiple variables. Assuming the FP support scales even a fraction as well, this should be enough to practically verify pretty non-trivial functions!
As the answer on stackoverflow mentions, there is -ffast-math option to gcc that tells the compiler to treat fp operations as associative. I've done some testing:
> Of course, most programmers only have a vague idea of how floating point numbers work.
From my experience, most programmers have a vague idea of how basic integers work, so that's not so surprising. (I'm in the group who only knows a little more than average about floating-point, but for what I do, I don't use FP math much either.)
I've always thought it would be helpful to have floating point values that tracked the upper and lower bounds of the possible error. That would make analysis of the results much easier.
A common and simple way to treat FP mathematically is by bounding the result with the machine epsion u [1]. For example, given FP numbers x and y:
x <fp_mul> y = (1+e)xy ; abs(e) <= u
The u ideally depends only on the FP format and rounding mode. I say ideally because library functions like sqrt may be less precise than the ideal (which is the result being equal to the exact value rounded according to the rounding mode). Note that the abstraction only as long as you don't go outside the range of normal exponents (it breaks when you reach infinity or subnormal).
EXAMPLE. Suppose we have a FP value L representing a length of something, and want to split it into segments no longer than the FP value M. Assume both integers are small enough to be representable in FP. The answer in exact arithmetic is:
N = ceil(L/M).
However if we evaluate this in FP, we may get a result N'<N for certain edge values. Which means one of the segment lengths we output will be greater than M!. So the result of the algorithm is incorrect!
We can fix it by multiplying (L/M) with a number sufficiently greater than one before the ceil, at the cost of sometimes deciding to split into one more segment than mathematically necessary. The fixed FP algorithm may then be:
N' = ceil(1.00005 <fp_mul> (L <fp_div> M))
After we apply the machine epsilon formulas we come to the conclusion that:
N' = ceil( (1+e1)(1+e2)1.00005 (L/M) ) ; abs(e1)<=u, abs(e2)<=u
We then show that:
(1+e1)(1+e2)*1.00005 >= 1.
This proves that what the algorithm gives to the ceil function is not less the exact value L/M, and that N'>=N. And ceil() can't make errors by definition, assuming the integer value is representable. QED
Though, as nice as the examples are, FP rounding problems in practice are rarely due to binary/decimal conversions, which are only relevant when reading and writing decimal encoded numbers. These are normally absent when doing tasks such as moving motors based on sensor readings.
And soon, fused-multiply-add will be common enough that compilers will start generating by default the instruction for code that contains addiction and multiplications, causing a regression with respect to the current situation.
slightly off-topic here, floating point operation is even much much more harder inside kernel, mainly because of trapping mechinism. In fact most OS either do not support floating point operation inside kernel or just turn that option off by default. But the good thing is you do not really need floating point operations inside kernel, but when you do and needs to modify your kenrel code to achieve that, it is extremely implementationally intensive and hard to made one (and the runtime cost is expensive also).
One thing I always wondered about is why are we using floating point arithmetic at all, instead of fixed point math with explicitly specified ranges (say, "here I need 20 bits for the integer part and 44 for the fractional part")? What is the practical value of having a floating point that would justify dealing with all that complexity and conceptual problems they introduce?
Have you used fixed-point arithmetic much? You need to specify the width and the position of the dot for each intermediate variable (for instance, multiplying two numbers that have 20 bits for the integer part may require 40 bits for the integer part of the result). Since the operands in 20.44 format had only 64 bits of precision, it only makes sense to represent the result to 64 bits, so the format of this result should probably be 40.24.
If you don't do this, your program may still work but you are wasting space. For instance, if the multiplication of your two 20.44 numbers does not overflow the 20.44 format, good for you, but it means that between the two operands, you were wasting 20 bits in total for leading zeroes that carried no information.
Consider floating-point as a way not to have to make all these decisions at compile-time and still use space extremely efficiently to represent as much precision as possible.
Because of hardware, and the fact that fixed point doesn't solve the core problems with floating point as it stands today. Its hard enough to get good stable/fast floating point hardware, much less adding the complexities of variable sized words, with variable sized mantissa/exponents.
Really, what your probably asking for is decimal floating point. Which, has its own set of limitations when implemented in hardware.
And if you happen to have an application where what you really need is more precision there are a number of bignum libraries like GMP (https://gmplib.org/) which provide that. But, throwing more precision at a problem doesn't alleviate the need to do error estimation.
Dynamic range. A double-precision variable has the same precision at 1e+10 as at 1e-10, your proposal does not. This is not a big problem if you have a good idea about the numbers in your application (e.g., digital signal processing), but it is a big problem if you write a matrix library (or anything else which deals with real numbers) for arbitrary inputs.
On one hand FP lets you program without thinking too hard about the numerical properties of your program - until you hit a gotcha. On the other hand FP is pretty powerful for many domains if you're a FP wizard. Like physics simulation.
There have been other approaches besides fixed that people have tried out, such as interval arithmetic and exact (rational) arithmetic, but they all have their own problems.
It's a pretty impressive system. If you're interested in their architecture, here's a YouTube video worth watching: "Marco Cecconi The Architecture of Stack Overflow - Developer Conference 2013" [0]
The non associativity of addition is obvious, but for the multiplication, I understand why it does not give always the same answer, but I do not see why a change of the order could change the accuracy.
It doesn't. If we forget about FPU flags, floating-point multiplication is pure. That does not change anything to the fact that a * (b * c) is different from (a * b) * c.
The problem is not side-effects in multiplication. The problem is that * is not associative.
[+] [-] Khaine|12 years ago|reply
[+] [-] mahmud|12 years ago|reply
[+] [-] tikhonj|12 years ago|reply
Of course, most programmers only have a vague idea of how floating point numbers work. (I'm certainly among them!) It's very easy to run into problems. And even with a better understanding of the format, it's still very difficult to predict exactly what will happen in more complex expressions.
A really cool aside is that there are some relatively new toys we can use to model floating point numbers in interesting ways. In particular, several SMT solvers including Z3[1] now support a "theory of floating point" which lets us exhaustively verify and analyze programs that use floating point numbers. I haven't seen any applications taking advantage of this directly, but I personally find it very exciting and will probably try using it for debugging the next time I have to work with numeric code.
A little while back, there was an article about how you can test floating point functions by enumerating every single 32-bit float. This is a great way of thinking! However, people were right to point out that this does not really scale when you have more than one float input or if you want to talk about doubles. This is why SMT solvers supporting floating point numbers is so exciting: it makes this sort of approach practical even for programs that use lots of doubles. So you can test every single double or every single pair of doubles or more, just by being clever about how you do it.
I haven't tried using the floating point theories, so I have no idea how they scale. However, I suspect they are not significantly worse than normal bitvectors (ie signed/unsigned integers). And those scale really well to larger sizes or multiple variables. Assuming the FP support scales even a fraction as well, this should be enough to practically verify pretty non-trivial functions!
[1]: http://z3.codeplex.com/
[+] [-] alco|12 years ago|reply
[+] [-] nly|12 years ago|reply
[+] [-] userbinator|12 years ago|reply
From my experience, most programmers have a vague idea of how basic integers work, so that's not so surprising. (I'm in the group who only knows a little more than average about floating-point, but for what I do, I don't use FP math much either.)
[+] [-] mark-r|12 years ago|reply
[+] [-] ambrop7|12 years ago|reply
x <fp_mul> y = (1+e)xy ; abs(e) <= u
The u ideally depends only on the FP format and rounding mode. I say ideally because library functions like sqrt may be less precise than the ideal (which is the result being equal to the exact value rounded according to the rounding mode). Note that the abstraction only as long as you don't go outside the range of normal exponents (it breaks when you reach infinity or subnormal).
EXAMPLE. Suppose we have a FP value L representing a length of something, and want to split it into segments no longer than the FP value M. Assume both integers are small enough to be representable in FP. The answer in exact arithmetic is:
N = ceil(L/M).
However if we evaluate this in FP, we may get a result N'<N for certain edge values. Which means one of the segment lengths we output will be greater than M!. So the result of the algorithm is incorrect!
We can fix it by multiplying (L/M) with a number sufficiently greater than one before the ceil, at the cost of sometimes deciding to split into one more segment than mathematically necessary. The fixed FP algorithm may then be:
N' = ceil(1.00005 <fp_mul> (L <fp_div> M))
After we apply the machine epsilon formulas we come to the conclusion that:
N' = ceil( (1+e1)(1+e2)1.00005 (L/M) ) ; abs(e1)<=u, abs(e2)<=u
We then show that:
(1+e1)(1+e2)*1.00005 >= 1.
This proves that what the algorithm gives to the ceil function is not less the exact value L/M, and that N'>=N. And ceil() can't make errors by definition, assuming the integer value is representable. QED
[1] http://en.wikipedia.org/wiki/Machine_epsilon
[+] [-] ambrop7|12 years ago|reply
[+] [-] ronaldx|12 years ago|reply
I gather (at least in your examples) that dealing with bigger numbers first typically affords more 'real' accuracy?
[+] [-] octo_t|12 years ago|reply
floating point is hard. Programmers get it wrong, compilers get it right.
(normally).
[+] [-] pascal_cuoq|12 years ago|reply
There has been some progress since the time this report was an accurate description of the situation, but the situation is still mostly terrible: http://stackoverflow.com/questions/17663780/is-there-a-docum...
And soon, fused-multiply-add will be common enough that compilers will start generating by default the instruction for code that contains addiction and multiplications, causing a regression with respect to the current situation.
[+] [-] jw2013|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] sheetjs|12 years ago|reply
PDF rendering: http://www.cse.msu.edu/~cse320/Documents/FloatingPoint.pdf
HTML rendering: http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.ht...
[+] [-] TeMPOraL|12 years ago|reply
[+] [-] pascal_cuoq|12 years ago|reply
If you don't do this, your program may still work but you are wasting space. For instance, if the multiplication of your two 20.44 numbers does not overflow the 20.44 format, good for you, but it means that between the two operands, you were wasting 20 bits in total for leading zeroes that carried no information.
Consider floating-point as a way not to have to make all these decisions at compile-time and still use space extremely efficiently to represent as much precision as possible.
[+] [-] StillBored|12 years ago|reply
Really, what your probably asking for is decimal floating point. Which, has its own set of limitations when implemented in hardware.
And if you happen to have an application where what you really need is more precision there are a number of bignum libraries like GMP (https://gmplib.org/) which provide that. But, throwing more precision at a problem doesn't alleviate the need to do error estimation.
[+] [-] adwn|12 years ago|reply
[+] [-] zurn|12 years ago|reply
There have been other approaches besides fixed that people have tried out, such as interval arithmetic and exact (rational) arithmetic, but they all have their own problems.
[+] [-] pkulak|12 years ago|reply
Why would so many people upvote such a condescending comment?
[+] [-] kawliga|12 years ago|reply
[+] [-] d23|12 years ago|reply
[+] [-] bashcoder|12 years ago|reply
[0] https://www.youtube.com/watch?v=OGi8FT2j8hE
[+] [-] willvarfar|12 years ago|reply
Previously on HN https://news.ycombinator.com/item?id=2684254
[+] [-] octo_t|12 years ago|reply
[+] [-] termain|12 years ago|reply
[+] [-] username42|12 years ago|reply
[+] [-] Ono-Sendai|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] ww2|12 years ago|reply
[+] [-] pascal_cuoq|12 years ago|reply
The problem is not side-effects in multiplication. The problem is that * is not associative.
[+] [-] skrebbel|12 years ago|reply
[+] [-] pygy_|12 years ago|reply
Good to know.
[+] [-] rbonvall|12 years ago|reply
[+] [-] taeric|12 years ago|reply
[+] [-] jevinskie|12 years ago|reply
[+] [-] woodchuck64|12 years ago|reply
1. http://en.wikipedia.org/wiki/Scientific_notation
2. base 2, biased by 127, 8 bits (IEEE float)
3. base 2, implicit leading 1, 23 bits (IEEE float)
http://en.wikipedia.org/wiki/Single-precision_floating-point...
[+] [-] picomancer|12 years ago|reply