Approximating sin(x) to 5 ULP with Chebyshev polynomials

[+] gp7|9 years ago|reply

This page crashes my chrome tab!!

Anyway, anyone interested enough in this to be reading this comment thread should go look up approximation theory and approximation practice by trefethen [0], which is a very very good book, that covers chebyshev polynomials in a very clear way. The real party trick you get out of it, though, is that by taking samples of functions at the roots of chebyshev basis polynomials and applying the discrete cosine transform, you can get a set of chebyshev coefficients to approximate a function in O(n log n) time

[0] https://people.maths.ox.ac.uk/trefethen/ATAP/

[+] m_myers|9 years ago|reply

> This page crashes my chrome tab!!

The problem appears to be the SVG image http://mooooo.ooo/chebyshev-sine-approximation/horner_ulp.sv..., which has a path of 5000+ points.

[+] tanderson92|9 years ago|reply

I've always found DCTs and particularly Clenshaw-Curtis quadrature (better than Gauss quadrature arguably!) attractive: you can do an incredible amount of heavy lifting with this corner of approximation theory.

[+] qooiii2|9 years ago|reply

You can do a little better with a minimax polynomial that you can derive with the Remez algorithm. If you have Matlab, you can use the excellent Chebfun package's remez command to do this almost instantly for any well-behaved function.

The best approximation I could find calculated sin(x) and cos(x) for [-pi/4,pi/4] and then combined these to cover all other inputs. These functions need only 4 or 5 terms to get +/-1 ULP accuracy for 32-bit floats.

I thought the harder problem was the argument reduction to enable calculating something like sin(10^9). It turns out that you need a lot of bits of pi to do this accurately. While testing my implementation, I was surprised to learn that the x87 fsin implementation only has 66 bits of pi, so it gives extremely wrong answers for large inputs.

[+] wallacoloo|9 years ago|reply

> You can do a little better with a minimax polynomial that you can derive with the Remez algorithm.

Yup! This was something I meant to mention in a footnote. If I could get equivalent [infinite precision] error bounds using a minimax polynomial of a lower degree, I think this would make a difference. Otherwise, even if the theoretical error bounds are lower for a minimax polynomial of the same degree, it wouldn't do anything to combat the noise introduced by rounding errors, assuming you keep everything else the same - and that noise seems to be by far the limiting factor. But that's just conjecture - no way to know for certain without trying it.

I do still like the Chebyshev polynomial approach because it's the only non-iterative approach I know of. It's something that I could implement without the aid of a CAS program pretty trivially, if I wanted to.

> The best approximation I could find calculated sin(x) and cos(x) for [-pi/4,pi/4] and then combined these to cover all other inputs.

Yes, that also seems to be the way to go (at least for precision). I didn't mention it in the article, but even though the code I showed was rust, the actual implementation is in the form of an AST for a plugin system that only supports f32 addition, multiplication, division, and modulo as operations (no branching, f32<->int conversions, etc). My entire motivation was to create a sine function for this system. It's known that the inputs to the function will be in (-pi, pi), so I don't have to perform reduction over any larger range. I don't see a practical way to reduce down to (-pi/4, pi/4) using the operations available to me, but I could be overlooking something.

[+] pkhuong|9 years ago|reply

For nice enough smooth functions, the minimax approximation can only improve Chebyshev by at most ~2 bits; this http://www.uta.edu/faculty/rcli/papers/li2004.pdf paper shows that the improvement is in fractional of bits for elementary functions. I'm really not sure it's worth the effort and incurring more error around every pole to tamp things down around Chebyshev's worst case. For single floats, I did find it interesting to restrict the search to coefficients that can be represented exactly as single floats. For doubles, rounding of coefficients is probably noise unless you're writing a libm.

[+] dnautics|9 years ago|reply

this is really beautiful. If you used an fused dot product [0], you probably very easily get to within 0.5 ULP.

[0] http://ieeexplore.ieee.org/document/6545890/?arnumber=654589...

[+] jacobolus|9 years ago|reply

(Disclaimer: I’m not a numerical analyst or hardware expert.) I’m not sure if you can use a dot product unit like the one in your link to make as much of a difference when handling recurrences like “Clenshaw’s algorithm” for evaluating Chebyshev polynomials (the method used in the OP, an analog of Horner’s algorithm), as you can get for taking a big dot product or computing an FFT or whatever. What I expect you really want is to keep your temporary variables in higher precision all the way through the recurrence.

You can probably get some improvement vs. separate floating point addition and multiplication with FMA instructions on existing standard hardware (and also a slight speed boost). I haven’t ever carefully tested the accuracy improvement in practice though.

If you are willing to take a hit on performance, you can even save the lower order bits of higher-precision accumulated floating point numbers by using two floats (doubles) per variable, and get almost double the precision with the same hardware (keywords: “compensated arithmetic”, “Kahan summation”).

Or see http://www.chebfun.org/examples/cheb/Turbo.html https://arxiv.org/abs/1404.2463 for a kind of mind-blowing, much mathier trick.

[+] shuber_|9 years ago|reply

Note that Chebyshev polynomials have an intimate relationship to trigonometric functions:

cos(n x) can be expressed as a polynomial of cos(x), namely T_n(cos(x)). In other words, T_n(x) = cos(n acos(x)).

[+] unknown|9 years ago|reply

[deleted]

[+] kazinator|9 years ago|reply

Here is approximating Fibonacci with polynomials. Evidently I wrote this in June, 2014:

  #include <math.h>
  #include <stdio.h>

  int fib(int n)
  {
    double out = 0.002579370;
    out *= n; out -= 0.065277776;
    out *= n; out += 0.659722220;
    out *= n; out -= 3.381944442;
    out *= n; out += 9.230555548;
    out *= n; out -= 12.552777774;
    out *= n; out += 7.107142857;
    out *= n;
    return floor(out + 0.5);
  }

  int main(void)
  {
    int i;
    for (i = 0; i < 9; i++) {
      printf("fib(%d) = %d\n", i, fib(i));
      switch (i) {
      case 7:
        puts("^ So, is this calculation fib or not?");
        break;
      case 8:
        puts("^ Answer to life, universe & everything responds negatively!");
      }
    }
    return 0;
  }

Run:

  $ ./fib2
  fib(0) = 0
  fib(1) = 1
  fib(2) = 1
  fib(3) = 2
  fib(4) = 3
  fib(5) = 5
  fib(6) = 8
  fib(7) = 13
  ^ So, is this calculation fib or not?
  fib(8) = 42
  ^ Answer to life, universe & everything responds negatively

[+] grondilu|9 years ago|reply

It seems to me that it makes more sense to approximate sinPi(x) = sin(pi x)

    scaling[x_] := 8/3  (x  - x^3)
    indices = {0, 2, 4, 6, 8, 10}
    inner[g_, h_] := NIntegrate[g h/Sqrt[1 - x^2], {x, -1, 1}, WorkingPrecision-> 40, PrecisionGoal-> 20]
    cheb[i_] := inner[ChebyshevT[i, x], Sin[Pi x]/scaling[x]]/inner[ChebyshevT[i,x], ChebyshevT[i, x]]
    coeff = cheb /@ indices
    apxi[i_] := coeff[[1 + i/2]]ChebyshevT[i, x]
    sinPi[x_] := Total[apxi /@ indices]scaling[x]
    Plot[sinPi[x], {x, -1, 1}]
    N[Simplify[sinPi[x]/(1-x^2)], 9]

This gives the following polynomial expression for sinPi(x)/(1-x²):

    3.14159264x -2.02611946x^3 +0.524036151x^5 -0.0751872634x^7 +0.00686018743x^9-0.000385937753x^11

So the function becomes:

    sub sinPi($x) {
        my $x2 = $x²;
        $x * (1 - $x2) *
        reduce * * $x2 + *,
        -0.000385937753,
         0.00686018743,
        -0.0751872634,
         0.524036151,
        -2.02611946,
         3.14159264
    }

That being said, cool stuff, definitely.

[+] grondilu|9 years ago|reply

Notice that your program could use a fold [1]. Not sure it would be much more efficient, but that would give it a more functional-programming style, which is arguably more elegant.

I wrote it in Perl 6 for instance:

    sub wallace-sin($x) {
        constant pi_major =  3.1415927e0;
        constant pi_minor = -0.00000008742278e0;
        my $x2 = $x²;
        ($x - pi_major - pi_minor) *
        ($x + pi_major + pi_minor) * $x *
        reduce * * $x2 + *,
            0.00000000013291342e0,
           -0.000000023317787e0,
            0.0000025222919e0,
           -0.00017350505e0,
            0.0066208798e0,
           -0.10132118e0;
    }

1. https://en.wikipedia.org/wiki/Fold_%28higher-order_function%...

[+] jacobolus|9 years ago|reply

This is silly. You should (almost) never optimize low-level numerical building-block routines for code style at the expense of performance.

The proper thing to do is make a generic library function for evaluating Chebyshev polynomials, and then pass in the polynomial coefficients and the values to evaluate in arrays. That’s a much more “functional” approach than hard-coding the coefficients could ever be.

Then if you want it to go fast, you can vectorize the code and use FMA instructions, and run it on multiple CPU cores, as in https://github.com/alkashani/numcrunch/blob/master/src/clens... (this is code my wife wrote for me; I think there’s still a factor of 2 or 4 being left on the table getting data flowing efficiently through the CPU caches, but we didn’t really deeply profile it to figure out what the issue was). Of course you can do even better by running this type of highly parallelizable code on a GPU.

[+] fsloth|9 years ago|reply

Your comment is beyond silly. The whole point of numerical code is to perform predictably and within the performance constraints of the specific problem. Style purism is the worst kind of elitism - it's braggadocio for it's own sake without real world merit. Every solution to a problem should be evaluated in relation to the problem domain and constraints given. Sometimes the more understandable version is better. Sometimes the functional version is better. But there is no stencil based development methodology that would provide the best answer every time. And, yes, I like code too that does not have side-effects and reads more like math. That's not always the best code to solve a problem though (unlike in problem specification where striving for something like TLA+ like presentation usually has it's merits).

[+] jlg23|9 years ago|reply

> Not sure it would be much more efficient

It won't - and I don't have to measure to know.

* OP uses standard optimization techniques. Their use has been justified over decades of actual research.

* OP uses a compiled language. OP can look at disassembly and see what's happening on the CPU.

* Your use of +reduce+ does not make sense at all - it's just syntactic sugar (you of course can use HOFs to handle constants, but that's completely missing the idea of function composition, which arguably is the only reason +reduce+ exists).

[+] brianon99|9 years ago|reply

Loop unrolling as shown by the author is quite common in maths functions. It is still quite readable and ensure good performance. Openblas use a similar tricks in the old days when there is no intrinsics. Not sure for now.

[+] ruricolist|9 years ago|reply

Doesn't Perl 6 have macros? If you have macros, you can do it both elegantly and efficiently by expanding Horner's rule at compile time. You can still use reduce; just have it return the form unevaluated.

In fact, because the coefficients are constant at compile time, you could precondition the polynomial and do even better than Horner's rule, with only three multiplications.

[+] kazinator|9 years ago|reply

The elegant solution in Perl6 is to just call the library function: sin($x).

No multiple lines of code, no reduce, nothing.

The point of this is to get something faster.

[+] gravypod|9 years ago|reply

Are there other people working on math tricks like this to make complex functions cheap?

[+] stephencanon|9 years ago|reply

sin( ) is typically implemented with (more sophisticated versions of) exactly the same "tricks"; most library implementations trade off a little speed for more accuracy (and obviously support inputs from the whole number line, not just (-pi,pi)), but honestly they're pretty fast already.

There are also some libraries that provide vectorized implementations, often with a few options for different accuracy/performance tradeoffs.

[+] contravariant|9 years ago|reply

Well, how do you think your CPU calculates the sine function?

[+] unknown|9 years ago|reply

[deleted]

[+] aeleos|9 years ago|reply

This is pretty cool. I always find it interesting when cool mathematics ideas are used to help computer science.

I remember a while back I tried implementing something like this in a hobby operating system I was making, with constants similar to the one in the post. I found them on some random paper that included constants for a bunch of different trig functions. I don't think it ended up working and I ended up using a table of values that it estimated between.

[+] tanderson92|9 years ago|reply

This is solidly numerical analysis, not really computer science.

[+] pmalynin|9 years ago|reply

The site crashes the Chrome Tab on 57.0.2987.133 (64-bit) Windows when I scroll down :/

[+] ezequiel-garzon|9 years ago|reply

Very basic question: what is the programming language of the code snippets?

[+] simias|9 years ago|reply

It's Rust

[+] unknown|9 years ago|reply

[deleted]

[+] hpcjoe|9 years ago|reply

One of the major concerns I have with this code, and codes derived from this, has to do with specific numerical underflow that can be seen to be happening before the series is summed.

1st:

  let pi_major = 3.1415927f32;
  let pi_minor = -0.00000008742278f32;

[...]

  (x - pi_major - pi_minor) *
  (x + pi_major + pi_minor) * p1 * x

Focus on the x, pi_minor and pi_major.

This is fine, pi_major is basically pi. pi_minor is a small constant.

One knows from basic numerical analytics as applied to computing systems floating point [0] to never do additions/subtractions along the lines of x +/- c*epsilon, and expect to have anything close to an accurate result.

What you are really doing is adding/subtracting numbers of such widely different scales. This is a bad thing in general, and the only real way to handle this is to increase precision. Otherwise you completely swamp the small number with a larger one, and you lose all precision from your smaller number.

Simple example in Julia (0.5.1 or higher)

julia> deltax = Float16(0.1)

Float16(0.1)

julia> sum = Float16(0)

Float16(0.0)

julia> for i=1:1000000 sum+=deltax end

julia> sum

Float16(256.0)

Yes, that is right, according to this, 1/10th as a Float16 summed 1 million times reports as 256. Which is wildly ... ridiculously ... wrong.

Then redo this as Float32

julia> deltax2 = Float32(0.1)

0.1f0

julia> sum2 = Float32(0)

0.0f0

julia> for i=1:1000000

       sum2+=deltax2

       end

julia> sum2

100958.34f0

Much closer, but you can still see the impact from the catastrophic loss of precision [1]. This is a very well known phenomenon, that people seem to forget about every now and then, until it comes back to bite you.

I illustrated this for an HPC/parallel programming class I taught a decade ago [2] using some simple C code. The math isn't the problem. Its the order of accumulation of error.

This should be fairly sobering to those whom are happy running f16 codes ... you need to be very careful about the order of your operations (far more so than with f32, or f64), lest you suffer a catastrophic loss of precision, and wind up with a code that "computes faster", or as we old HPC types like to say about codes that make these sorts of mistakes "generates garbage at a tremendous rate".

During my class a decade ago, I had an astronomer whom coded in Fortran complain to me bitterly that he's always done sums this way, and there was no way that this concept was correct. The C code was written to demonstrate the effect to him.

It is sobering how much garbage numerical code there is out there. It gets even harder to distinguish good numerical code from bad numerical code when people get lost in almost irrelevant things, like coding style, or language of implementation.

If you really want to freak someone out, and you have a calculator app, see if you can take 1/3 + 1/3 + 1/3 - 1 (or 1/7 + ... + 1/7 - -1). If they use less than f32, the 1/3 trick will get you a few ULP from zero. If they use something below f64, the 1/7 and a number of others will show this.

Basically do a forward and reverse function application of a non-zero non-pathological argument, and subtract the original value from this. A compiler may optimize this away. An interpreter wont. And you'll see the issue of precision right there in the non-zero result.

Note as well, some apps will try to be "smart" and handle that case (sort of like optimizing benchmark codes with specialized handling). This is more troubling than doing the right thing, but I've always been able to find ways to expose the precision limits with simple tests like this.

The limited precision isn't a problem, its how you make use of it, and use, compare, add/subtract numbers with it that matters.

[0] https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...

[1] https://en.wikipedia.org/wiki/Loss_of_significance

[2] https://scalability.org/2007/11/why-am-i-surprised-that-peop...

[edited for code clarity]

[+] jwmerrill|9 years ago|reply

These are good points in general, but I don't think they're relevant to the author's partitioning of PI to get extra precision near the zeros. The article does a good job of explaining exactly what's going on here, and has plots to prove that nothing is going catastrophically wrong.

[+] rofrol|9 years ago|reply

eli5?

59 comments