Cosine Implementation in C

[+] okaleniuk|3 years ago|reply

I love these things! Approximation is art.

I have a similar example in my book Geometry for Programmers (https://www.manning.com/books/geometry-for-programmers). There I make a polynomial approximation for sine that is: 1) odd (meaning antisymmetric); 2) precise at 0 and Pi/2 (and -Pi/2 automatically); 3) have precise derivatives and 0 and Pi/2 (and -Pi/2); 4) have better overall precision than the equivalent approximation with Taylor series; and it only consists of 4 members! Adding more members makes it impractical in the modern world since we have hardware implementations for sine on CPUs and GPUs.

Spoiler alert, the secret sauce is the integral equation ;-)

[+] okaleniuk|3 years ago|reply

Oh, and I have a free explanation too: https://wordsandbuttons.online/sympy_makes_math_fun_again.ht...

[+] boyka|3 years ago|reply

The realization that a 6th order approximation is good enough given our definition of float is beautiful.

[+] stkdump|3 years ago|reply

CPUs do have a 'hardware' (I guess microcode) implementation in the 80387 era FPU. But it isn't used. One reason is that the software implementation is faster in modern floating point code. I don't know about GPUs but I would guess compilers transparently take care of that.

[+] lifthrasiir|3 years ago|reply

Which is just a support function for cos (which doesn't do argument reduction). See [1] for the actual function, and [2] & [3] for the argument reduction code.

[1] https://github.com/ifduyue/musl/blob/master/src/math/cos.c

[2] https://github.com/ifduyue/musl/blob/master/src/math/__rem_p...

[3] https://github.com/ifduyue/musl/blob/master/src/math/__rem_p...

[+] FabHK|3 years ago|reply

To give a bit more background on this:

You really only need to compute one eighth of a circle segment (so, zero to pi/4), everything else you can do by symmetry by switching signs and flipping x, y. (And if you supported a larger range, you’d waste precision).

The function supports doubledouble precision (which is a trick, different from long double or quadruple precision, in which you express a number as an unevaluated sum of two doubles, the head (large) and the tail (much smaller)).

So, the “real” cosine function first takes its argument, and then reduces it modulo pi/2 to the desired range -pi/4, +pi/4 (with minimal loss of precision - you utilise the sign bit, so that’s better than using the range 0, +pi/2) in double double precision (rem_pio2 returns an int denoting which quadrant the input was in, and two doubles), and those two doubles are then passed to this function (or its cousin __sin, with the sign flipped as appropriate, depending on the quadrant).

Many people really have put in quite some care and thought to come up with this neat and precise solution.

ETA: Double double (TIL that it’s also something in basketball?): https://en.m.wikipedia.org/wiki/Quadruple-precision_floating...

Range reduction: https://redirect.cs.umbc.edu/~phatak/645/supl/Ng-ArgReductio...

[+] toxik|3 years ago|reply

Interesting idea to compare the absolute value by simply not doing a float comparison.

I see they do a bit more than that now. Wonder how much time that saves.

[+] schappim|3 years ago|reply

This is a faster implementation of cosine that utilises six static const double variables (C1, C2, C3, C4, C5, C6) to speed things up. These variables represent the coefficients of a polynomial approximation for the cosine function.

For the genuine article, check out: https://github.com/ifduyue/musl/blob/master/src/math/cos.c

[+] FabHK|3 years ago|reply

It’s the core implementation for a specific and very narrow range of arguments. The “genuine article” moves the argument into that range (via the rem_pio2 function performing range reduction, which is worth an article in itself), and then calls this core implementation.

[+] kolbe|3 years ago|reply

Why do you return x-x for inf or nan? I can't test now, but I imagine they both return nan?

I run a high performance math library in C#, and I'm always looking to improve.

https://github.com/matthewkolbe/LitMath

[+] unknown|3 years ago|reply

[deleted]

[+] EVa5I7bHFq9mnYK|3 years ago|reply

I remember an approximation function used in vacuum-tubes based surface-to-air system of the 60s: cos(x) = 0.7. As explained to us rookies, that precision was enough to hit the target.

[+] marshray|3 years ago|reply

I can see it now.

  float cos(x)
  {
      return 0.7; // Chosen by fair dice roll
  }

[+] ogogmad|3 years ago|reply

That can't be good for all x. You mean x=π/4 or to 1dp.

[+] dboreham|3 years ago|reply

My first experience with "Use the source Luke" was wondering as a young lad how computers calculated trig functions. This led me to the Unix V7 math lib source code[1], and thence to a book by Hart & Cheney which by luck my university library had on the shelf. Good times.

[1] https://github.com/v7unix/v7unix/blob/master/v7/usr/src/libm...

[+] abudabi123|3 years ago|reply

A like minded reader may want to have a look at the Go lang. math standard library source.

* https://pkg.go.dev/math

[+] stagger87|3 years ago|reply

Implementing low precision transcendental functions is much more interesting IMO, as it allows for a wider variety of algorithms and approaches. When you are limited to the IEEE definitions, there really isn't much room for creativity and you have to handle all the obscure edge cases and chase least significant bits, bleh. That's not something you want to do if you need to calculate a trillion atan's. If one is curious how these algorithms work, check out the book "Elementary Functions".

[+] lifthrasiir|3 years ago|reply

In reality though, a typical stock libm is not that accurate; their error generally ranges from 1 to 5 ulps with no actual guarantees. I'm more interested in a hybrid libm with configurable errors, from the most accurate possible (0.5 ulps) to something like 2^20 ulps, and all in between.

[+] l33t233372|3 years ago|reply

This function is completely constant in running time, so I don’t think these considerations are important here.

[+] mananaysiempre|3 years ago|reply

The IEEE standard doesn’t really require any particular precision for transcendental functions IIUC? The original standard doesn’t even define them. The 2008 version includes optional correctly-rounded transcendentals, but there seems to be a total of one implementation, CRlibm, which is an academic project of more than a decade (and a dead home page because of the INRIA Forge shutdown). Your run-of-the-mill libm has little to do with this.

[+] vector_spaces|3 years ago|reply

Does anyone know of a good reference on signals and systems for people with a pure math background rather than an engineering or physics background?

So, the implementation linked above uses the polynomial of best approximation to cosine on an interval. That polynomial was presumably obtained using the Remez algorithm, which, given a function f and a set of n+2 points in your desired interval, yields a polynomial of degree n best approximating f in the interval (in the uniform norm sense)[1] by performing a sequence of linear solves

I have a math and specifically approximation theory background, and the wiki article on the Remez algorithm makes a lot of sense to me. Like, this is the language I speak: https://en.wikipedia.org/wiki/Remez_algorithm

I'm less comfortable though when I try to understand the documentation for the remez implementation in SciPy because it is using the language of signals and systems, which is how engineers think about this stuff: https://docs.scipy.org/doc/scipy/reference/generated/scipy.s...

Signals seems to use a lot of analogies to sound waves, and it does map in some way or another to the mathematical language I'm comfortable with. I've spent a little bit of time reading papers and introductory texts on signals, but the definitions feel a little loosey goosey to me.

Anyway, just opportunistically using this thread to ask if anyone knows of some book or papers or blog posts that can help me translate this stuff into math language :)

[+] ur-whale|3 years ago|reply

I feel your pain.

I've spent most of my career working at the intersection of multiple highly technical domains, both theoretical and applied, and a) the "jargon switch" b) the methodology differences and worst c) the switch of conceptual framework required to work on a problem that spans multiple domains is initially very irritating.

I'm pointing that out because if you are only looking for a rosetta stone to only tackle a), even if you find one, it'll leave you wanting because you'll still need maps to tackle b) and c)

Example: analog electronics. From a mathematician's POV, an analog circuit is just a system of differential equations.

But very few analog electronics specialist ever think about a circuit that way.

The words they use are the least important part of the "translation" problem. The way they approach and think about the system will be utterly alien to a pure or even applied math person.

Eventually, after learning the trade, you'll start to see familiar patterns emerge that can be mapped back to stuff you're familiar with, but a simple "this EE jargon means that in math language" won't be enough: you're going to have to get your hands dirty and learn how they think of the problem.

[+] lntue|3 years ago|reply

I don't know much about numpy.signal package, but there are other python packages that have good implementation for generating polynomial approximation. Among them are pythonsollya (a Python wrapper for Sollya), and SageMath.

Other than that, some of the overview references about Remez algorithm (should be a bit closer to your taste) that I like are:

- J.M. Muller et. al., "Handbook of Floating-Point Arithmetic", section 10.3.2 (and follow references in that section).

- N. Brisebarre and S. Chevillard, "Efficient polynomial L-approximations," 18th IEEE Symposium on Computer Arithmetic (ARITH '07), which is implemented in Sollya tool: https://www.sollya.org/sollya-2.0/help.php?name=fpminimax

[+] stagger87|3 years ago|reply

The scipy documentation you linked is specifically for generating a FIR filter using the Remez algorithm, which is why it is written in that style. It is a very specific use case of the algorithm and probably not the one I would recommend learning from. Another top level comment has a link to a decent overview of the Remez algorithm written without the "Signals and Systems" language.

[+] asdf_snar|3 years ago|reply

Just noting that your link to docs.scipy is broken. I'm also in a similar boat to you and would be interested in an answer to this question.

[+] lntue|3 years ago|reply

Some of the more recent implementations of cosine for single precision that are correctly rounded to all rounding modes according to IEEE 754 standard:

- The CORE-MATH project: https://gitlab.inria.fr/core-math/core-math/-/blob/master/sr...

- The RLIBM project: https://github.com/rutgers-apl/The-RLIBM-Project/blob/main/l...

- The LLVM libc project: https://github.com/llvm/llvm-project/blob/main/libc/src/math...

[+] mianos|3 years ago|reply

This takes me back. I did the transcendental functions for a CPM C compiler in the early 80s using Chebyshev polynomials.

My tutor failed my implementation because my code formatting was utterly crap. :) I am still laughing as there was no sin, cos in the maths library at all at the time.

[+] howerj|3 years ago|reply

Personally I like Universal CORDIC, you can calculate much more than Cosine, but it tends to be slower. See https://en.wikibooks.org/wiki/Digital_Circuits/CORDIC#The_Un... and a somewhat janky fixed point library that implements it https://github.com/howerj/q

[+] zhamisen|3 years ago|reply

Very good explanation about Remez algorithm: https://xn--2-umb.com/22/approximation/index.html

[+] IIAOPSW|3 years ago|reply

Personally I go with the recursive definition. Use the trig identity cos(x)^2 + 1 = cos(2x).

    2cos(x)^2 - 1 = cos(2x).
    2(2cos(x)^2 - 1)^2 - 1 = 2cos(2x)^2 - 1 = cos(4x)
    2(2(2cos(x)^2 - 1)^2 - 1)^2 - 1 = 2cos(4x)^2 - 1 = cos(8x)

Keep squaring, doubling and subtracting one one the left. Keep raising by powers of 2 in the argument on the right. Eventually, if you're working with rational numbers, multiplication by 2^k moves all k fractional bits over to the integer side of the decimal, and then integer times pi is the base case of the recursion. If you rescale the units so that pi = 2^(n-1) for some n, you don't even have to think about the non-fractional part. The register value circling back to 0 is supposed to happen anyway.

You could Just as easily recursively define cos to "spiral downwards in x rather than outwards", each recursive layer reducing the argument of x by a factor of 2 until it reduces down to the precomputed value cos(machine precision).

This was all just circular reasoning

[+] spenczar5|3 years ago|reply

Here is a naive question. A lot of effort goes into dealing with floating point imprecision and instabilities, right. So why aren’t these algorithms implemented with integers? A 64 bit integer could represent the region from 0 to 1 with 2*64 evenly spaced slices, for example.

Would that make things more accurate across operations? Is this something that is done?

[+] moonchild|3 years ago|reply

That is called 'fixed point', and it is useful for some applications. What makes floating point interesting is that it can represent values with very different magnitudes; your 64-bit fixed point number can't represent any number greater than 1, for instance. Whereas a 32-bit float can represent numbers with magnitudes between ~2^-127 and ~2^127.

[+] ChuckNorris89|3 years ago|reply

There should be plenty of C trigonometric algorithms made for fixed point. They usually use lookup tables and interpolation. Used a lot in embedded microcontrollers that don't have hardware floating point or hardware trig functions, and in automotive where floats were (still are?) a no-go due to the dangerous float math pitfalls and the safety critical nature of ECUs.

I remember writing a fast, small footprint, but very accurate fixed point trig function for the microcontroller used in my batcher project, many lives ago. I'll try to find it and put it on github.

The neat thing is that you can use the native overflow wrap around property enabled by C integers and processor registers to rotate back and fourth around the trigonometric circle without needing to account for the edge cases of the 0/360 degree point. You do need to suppress MISRA warning here though as it will go nuts seeing that.

[+] jcranmer|3 years ago|reply

What you're describing is fixed point. Fixed point absolutely exists in practice, but it has an important limitation: its accuracy is scale-dependent, unlike floating-point.

Suppose for example you were doing a simulation of the solar system. What units do you use for coordinates--AUs, light-seconds, kilometers, meters, even miles? With floating-point, it doesn't matter: you get the same relative accuracy at the end. With fixed point, you get vastly different relative accuracy if your numbers are ~10¯⁶ versus ~10⁶.

[+] dzdt|3 years ago|reply

Re: "imprecision and instabilities" on standard hardware there really are neither. Floating point numbers have an exact meaning (a particular fraction with an integer numerator and a denominator that is a power of 2). There are exactly specified rules for basic operations so that adding, subtracting, multiplying, or dividing these fractions will give you the nearest representable fraction to the exact result (assuming the most common round-to-nearest mode).

The bit that appears as "imprecision and instabilities" is a mismatch between three things: (1) what the hardware provides (2) what controls or compiler optimisation settings the programming language exposes (3) what the user expects.

There are several sticky points with this. One mentioned in the parent article is higher precision. Starting from the original 8087 floating point coprocessor x86 has supported both 80-bit and 64-bit floating point in hardware, but programming language support has mostly ignored this and only shows the user "double" precison with 64 bits of storage allocated. But with some compiler settings parts of the compuation would be done at 80 bit precision with no user visibility or control over which parts had that extra precision.

Newer hardware modes stick to 64 bit computation so you're unlikely to run into that particular stickiness on a modern programming stack. There is another that comes up more now: fused multiply and add. Modern hardware can calculate

   round(a*x+b)

in one step where a,x,b are floating point numbers (i.e. fractions) and the "round" operation produces the nearest representable fraction to the correct result. This is faster and more accurate to calculate than doing the multiplication and addition seperately, which gives

    round(round(a*x)+b)

But again programming language control is weak over whether and when fused-multiply-and-add is used, and this is a detail that almost all programmers almost always just want to ignore anyway.

[+] syockit|3 years ago|reply

Speaking from no experience, but it should be doable. But since in most applications you're going to use floating point anyway (or, if you're using fixed point, maybe right bit shift or maybe even do integer division until it fits the range of your value), the extra precision would be discarded in the end.

[+] metalrain|3 years ago|reply

So why is that extra term needed here?

w = 1.0-hz;

return w + (((1.0-w)-hz) + (zr-xy));

------------------^ here

[+] jcranmer|3 years ago|reply

The idea is to be able to compute 1.0 - h * z + (z * r - x * y) in high precision.

If h * z is slightly less than 1.0, then w = 1.0 - h * z can lose a lot of bits of precision. To recover the bits that were dropped in subtracting h * z from 1.0, we compute which bits of h * z were used (the high bits) and then subtract those bits from h * z to get only the low bits. We find the bits that were used by simply undoing the operation: 1.0 - w is mathematically the same as h * z, but is not the same as implemented due to limited precision.

So (1.0 - w) - h * z is the extra bits that were dropped from the computation of w. Add that to the final term z * r - x * y before that result is added to w, and we have a more precise computation of w + z * r - x * y.

[+] moonchild|3 years ago|reply

Mathematically, we have (1 - (1 - hz)) - hz = (1 + hz) - hz = 1, so the computation seems redundant. But due to floating-point rounding, it is not, and still creates information. As the comment says:

> cos(x+y) = 1 - (x*x/2 - (r - x*y))

> For better accuracy, rearrange to

> cos(x+y) ~ w + (tmp + (r-x*y))

> where w = 1 - x*x/2 and tmp is a tiny correction term

> (1 - x*x/2 == w + tmp exactly in infinite precision)

(hz is x*x/2)

See double-double arithmetic - https://en.wikipedia.org/wiki/Quadruple-precision_floating-p...

[+] rurban|3 years ago|reply

Why does he link to his lame fork of musl for the official FreeBSD sources for libm, taken from Sun, used by musl?

https://github.com/freebsd/freebsd-src/blob/main/lib/msun/sr... or at least https://git.musl-libc.org/cgit/musl/tree/src/math/__cos.c

[+] erk__|3 years ago|reply

The history of the freebsd file is interesting in itself

[+] claytongulick|3 years ago|reply

This is the kind of thing that triggers my imposter syndrome big time.

I've been coding professionally for almost 30 years, and I see stuff like this and wonder if I'll ever be a real programmer.

[+] Mike_12345|3 years ago|reply

This is Numerical Analysis, not "programming". It's a topic in Applied Mathematics that you would need to study separately from programming. https://ocw.mit.edu/courses/18-330-introduction-to-numerical...

[+] jcranmer|3 years ago|reply

You're probably more of a "real programmer" than whomever wrote this code--it's numerics code, after all.

There's no real tricks or anything to this code, it's just a polynomial evaluation of a Taylor series (which you probably learned, and subsequently forgot, from calculus), with some steps done using techniques for getting higher precision out of a floating-point value, and (in some other functions in libm, not here) occasional use of the bit representation of floating-point values. It's standard techniques, but in a field that is incredibly niche and where there is very little value in delving into if you don't have a need for it.

[+] okaleniuk|3 years ago|reply

This is a fascinating topic indeed. And the math behind polynomial approximation is not that deep or unapproachable. You just haven't found an explanation that works for you. Some time ago I tried to make one with a sine as an example: https://wordsandbuttons.online/sympy_makes_math_fun_again.ht...

It doesn't mention Taylor series anywhere but if you only make your system out of N differential equations, you'll get N members of the power series.

[+] unknown|3 years ago|reply

[deleted]

[+] squeaky-clean|3 years ago|reply

Here's someone else's fast-sine code snippet I found and used about a decade ago when I was working on an additive synthesizer I never finished. It basically uses the same method, Taylor series approximation.

Driving it as an oscillator relies on just a single integer addition operation per sample to increment the phase, and relies on integer overflow to reset the phase.

I like the comment from someone asking if it could really be so fast since it has 7 whole multiplications in it. Ah I love the realtime dsp world, so different from my python/JS day job.

https://www.musicdsp.org/en/latest/Synthesis/261-fast-sin-ap...

[+] pjmlp|3 years ago|reply

If you care about this kind of stuff, the theory lies in numerical methods, for example,

http://www.sci.brooklyn.cuny.edu/~mate/nml/numanal.pdf

[+] nhatcher|3 years ago|reply

I recently had to implement some Bessel functions numerically. It's fantastic to see that someone at Sun Microsystems implemented all this procedures in the early 90's (I'm sure even long before that). You will find this implementations in R, Julia, musl, you name it!

I like it specially the use the "GET_HIGH_WORD" and friends:

https://git.musl-libc.org/cgit/musl/tree/src/math/j0.c

It's a lot of fun to figure out what they meant by things like:

> if (ix >= 0x7ff00000) return 1/(x*x);

[+] adgjlsfhk1|3 years ago|reply

you should check out Bessels.jl which has really high performance implications.

158 comments