top | item 42901078

(no title)

psanchez | 1 year ago

I think I would call it naive algorithm rather than greedy.

It looked like an interesting problem so I spent some time this morning exploring if there would be any performance improvement by pregenerating an array of X items (where X around 1M to 16M items) and then randomly returning one of them at a time. I explored the project and copied the functions to be as faithful as the original implementation as possible.

Generating 10M unit sphere (best of 3 runs, g++ 13, linux, Intel i7-8565U, one core for tests):

  - naive/rejection: ~574ms
  - analytical: ~1122ms
  - pregen 1M elements: ~96ms

That's almost 6x faster than rejection method. Setup of the 1M elements is done once and does not count on the metrics. Using double type, using float yields around 4x improvements.

After looking at those results I decided to try on the project itself, so I downloaded, compiled and applied similar optimizations in the project, only updating circle and sphere random generators (with 16M unit vectors that are only created once on app lifetime) but got almost no noticeable benefits (marginal at most). Hard to tell because of the random nature of the raytracing implementation. On the bright side the image quality was on par. Honestly I was afraid this method would generate poor visuals.

Just for the record, I'm talking about something as simple as:

  std::vector<Vec3> g_unitSpherePregen;
  uint32_t g_unitSpherePregenIndex = 0;

  void setupUnitSpherePregen(uint32_t nElements) {
    g_unitSpherePregen.resize(nElements);
    for (auto i = 0; i < nElements; i++) {
      g_unitSpherePregen[i] = unitSphereNaive();  // call the original naive or analytical method
    }
  }

  Vec3 unitSpherePregen() {
    g_unitSpherePregenIndex = (g_unitSpherePregenIndex + 1) % g_unitSpherePregen.size();
    return g_unitSpherePregen[g_unitSpherePregenIndex];
  }

I tried as well using a psrng (std::mt19937 and xorshf96) in unitSpherePregen instead of the incremented variable, but increment was faster and yielded good visual results.

Next step would be profiling, but I don't think I will invest more time on this.

Edit: fix formatting

discuss

camel-cdr|1 year ago

I also came up with an alternative implementation: https://gist.github.com/camel-cdr/d16fd2be1fd7b71622649e6bc7...

The idea is based on the Ziggurat Method. You overlap the circle with n boxes that each encapsulate the same amount of area of the underlying circle, select a random box, and then do rejection.

With 128 boxes, this reduces the average number of additional iterations from 27% to 0.7%, which should massively reduce the number of branch miss predictions.

It ended up about 2x faster the simple rejection method.

I haven't applied this to spheres yet, but that should also be possible.

psanchez|1 year ago

Didn't know about Ziggurat algorithm, the use of a table to directly accept or reject is interesting, although I think I would need to implement myself to fully understand. Your comments in the code are great, but I still need I would need to dedicate some time to fully grasp it.

I'm wondering what if a 2D or 3D array was used instead, so that instead of working with the unit circle / unit sphere, you worked on a 256x circle/sphere.

Assuming the center of the circle/sphere was on the position (127, 127) or (127, 127, 127), then you could precompute which of those elements in the array would be part of that 256 sphere/circle radius and only the elements in the boundary of the circle/sphere would need to be marked as special. You would only need 3 values (2 bits per item).

   0 = not in the circle/sphere
   1 = in the circle/sphere
   2 = might be in or out

Then you would only need to randomly pick a point and just a lookup to evaluate whether is on the 2d/3d array. Most of the times simple math would be involved and simple accept/reject would cause it to return a value. I guess it would also produce the number of additional retries to 0.7% on a circle (one circle intersection for every 128 tiems = 1/128 = 0.78%).

From my limited understanding, what I'm saying looks like a simpler implementation but would require more memory and in the end would have the same runtime performance as yours (assuming memory and processor caches were the same, which are probably not). Uhm... I guess the implementation you present is actually doing something similar but with a quarter of the circle, so you need less memory.

Interesting, thanks for sharing.