top | item 23084819

(no title)

piadodjanho | 5 years ago

In practice, subnormal are very rarely used. Most compiler disable subnormals when compiling with anything other than -O0. It takes over a hundred cycle to complete an operation.

Demo: #include <stdio.h>

    int
    main ()
    {
    
        volatile float v;
        float acc = 0;
        float den = 1.40129846432e-45;
    
        for (size_t i; i < (1ul<<33); i++) {
            acc += den;
        }
    
        v = acc;
    
        return 0;
    }
With -01: $ gcc float.c -o float -O1 && time ./float ./float 8.93s user 0.00s system 99% cpu 8.933 total

With -O0: $ gcc float.c -o float -O1 && time ./float ./float 20.60s user 0.00s system 99% cpu 20.610 total

discuss

order

GuB-42|5 years ago

I fixed your example a bit and here is what I get

  /tmp>cat t.c
  #include <stdio.h>
  int main() {
      float acc = 0;
      float den = 1.40129846432e-45;
      for (size_t i = 0; i < (1ul<<33); i++) acc += den;
      printf("%g\n", acc);
      return 0;
  }
  /tmp>gcc -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  5.94s user 0.00s system 99% cpu 5.944 total
  /tmp>gcc -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  1.50s user 0.00s system 99% cpu 1.502 total
So subnormal numbers are supported in -O3 unless you specify -ffast-math. And it definitely makes a difference (gcc 9.3, Debian 11, Ryzen 7 3700X).

EDIT That one is interesting too

  /tmp>clang -O3 t.c && time ./a.out
  2.35099e-38
  ./a.out  6.04s user 0.00s system 99% cpu 6.044 total
  /tmp>clang -O3 -ffast-math t.c && time ./a.out
  0
  ./a.out  0.10s user 0.00s system 99% cpu 0.101 total
clang (9.0.1) performs about the same without -ffast-math; but with it, it managed to optimize the loop away.

piadodjanho|5 years ago

I looked at the asm generate from my original example and they generate very different codes, gcc applies other optimization when compiled with -O1.

I've been fighting the compiler to generate a minimal working example of the subnormals, but didn't have any success.

Some things take need to be taken in account (from the top of my head):

- Rounding. You don't want to get stuck in the same number. - The FPU have some accumulator register that are larger than the floating point register. - Using more register than the architecture has it not trivial because the register renaming and code reordering. The CPU might optimize in a way that the data never leaves those register.

Trying to make a mwe, I found this code:

    #include <stdio.h>
    
    int
    main ()
    {
        double x = 5e-324;
        double acc = x;
    
        for (size_t i; i < (1ul<<46); i++) {
                acc += x;
    
        }
    
        printf ("%e\n", acc);
    
        return 0;
    }

Runs is fraction of seconds with -O0:

    gcc double.c -o double -O0
But takes forever (killed after 5 minutes) with -O1:

    gcc double.c -o double -O1
I'm using gcc (Arch Linux 9.3.0-1) 9.3.0 on i7-8700

I also manage to create a code that sometimes run in 1s, but in others would take 30s. Didn't matter if I recompiled.

Floating point is hard.

sigjuice|5 years ago

Shouldn't i be initialized?

sigjuice|5 years ago

Shouldn't i be initialized?