It's almost criminal Kaze isn't really welcome at most tech conferences because he plays fire with Nintendo's IP.
A huge audience is missing out learning to do these things themselves, so my plug is we're running an indie conference [0] with Kaze as the featured speaker.
I find him a bit abrasive even though I enjoy his videos (and actually submitted this a couple days ago).
It's alot of cool work, but the way he presents it to laymen is kind of annoying. Even if he had never heard of data orientated design in the original video series (where he claims to not know what its called when you organise data to improve throughput) he should by now because clearly he does huge amounts of research. (Case in point at 9:30[0] he talks about localised memory patterns as though he's the first to come up with it).
Again, love his work, watch every video and even submit to HN because I think others would. But that doesn't mean he's not very annoying.
Now that SM64 has been fully reversed and can be compiled back to a bit-perfect ROM, I wonder if there would be any merit in replacing the default lookup table with Kaze's new function(s):
- 200 microseconds saved
- 33333 microseconds per frame -> 0.6% of time saved
- 1KB of RAM saved
- Looked really cool doing it (Kaze's words & I agree)
Kaze has already done so, and picking the low-hanging fruit results in much greater improvements than the optimizations shown here (which frees up less than one millisecond per frame). Here's the previous video with explanations and video demonstration of their improvements: https://www.youtube.com/watch?v=t_rzYnXEQlE
Kaze has essentially been doing that for his own ROM hack, not just this improvement but a bunch of others that have enabled the game to run at a consistent 60fps.
I believe he's mainly working on the ROM hack at the moment but has talked about releasing a patch to backport all the 100% SM64 compatible fixes to the main game.
The question here is whether the new, optimized SM64 implementation is CPU-bound in the first place. Games on the N64 are often limited by memory bandwidth, which is taken up by rasterization, and SM64 was something of a special case—compiled with optimizations turned off due to GCC bugs. After recompiling with optimizations enabled, and maybe after swapping in newer versions of the RSP microcode, my guess is that further improvements to CPU efficiency would have very limited returns.
I like Kaze's content. There is something fun about how hacky and ridiculous it is to put that much effort into Super Mario 64 optimizations and brand new levels.
This video is more technical than the others, but I suggest checking out his other videos if the idea of Super Mario 64 development piques your interest.
A lot of calculations and informed guesses and words like "could" "would" "should" but no benchmarks. Disappointing. With things like these, especially once you start to involve not-really-deterministic things like cache, you need to do integration testing of performance through benchmarks, not back-of-the-napkin calculations of how many instructions you are executing.
> Mario "could" walk in a different direction than he is pointing in
But did he? Was that a real, noticable problem or just theoretical?
The logical next step is creating virtual machine and brute force generate function with less then 27 instructions and see if it produces something close to sine. Let it run for few months and surely it will find something.
If you're using a square root anyway it's also possible to only fit the part between 0 and pi/4, and use the square root to calculate the sine or cosine (whichever is greater).
However you need to be a bit careful that you return exactly sqrt(1/2) at pi/4 otherwise you get a discontinuity.
The video is very well produced and engaging, however (from skimming it), it seems that all the implementation methods it presents are suboptimal.
The N64 CPU had an FPU, so it seems obvious that the best method, from both a speed and accuracy (or whichever is deemed more important) standpoint would be the same method that's used in all the libms in libcs like Glibc, musl or LLVM-libc, or in the standard libraries of languages like Go or Java: approximation using a polynomial. This polynomial is not derived analytically, rather it's found using an iterative algorithm, traditionally using the Remez algorithm, but also see my WIP project here: https://gitlab.com/nsajko/FindMinimaxPolynomial.jl.
> This polynomial is not derived analytically, rather it's found using an iterative algorithm, traditionally using the Remez algorithm, but also see my WIP project here: https://gitlab.com/nsajko/FindMinimaxPolynomial.jl.
Is this your independent attempt? Because I think RLIBM did the essentially same thing recently [1].
I really don't think you can claim that the solution is suboptimal. Kaze has clearly spent a lot of time on this and has a real code to show for it. You're just theorizing that it's bad without providing any real evidence.
Polynomial approximation was used throughout the video, and minimax was one of the final methods. Timestamp is sometime around 18:00. There seemed to be numerical issues with the approximation producing sin values above 1.0.
By itself, Remez is not usually helpful for these kinds of low-accuracy polynomial approximations, because getting the minmax error beneath the quantization threshold is too strong a constraint. Usually one ends up doing some brute force polishing at the end to try and reduce the polynomial degree and/or the bits necessary to represent coefficients.
[+] [-] abnercoimbre|2 years ago|reply
A huge audience is missing out learning to do these things themselves, so my plug is we're running an indie conference [0] with Kaze as the featured speaker.
We should follow in his footsteps.
[0] https://handmadecities.com/boston
[+] [-] psychphysic|2 years ago|reply
It's alot of cool work, but the way he presents it to laymen is kind of annoying. Even if he had never heard of data orientated design in the original video series (where he claims to not know what its called when you organise data to improve throughput) he should by now because clearly he does huge amounts of research. (Case in point at 9:30[0] he talks about localised memory patterns as though he's the first to come up with it).
Again, love his work, watch every video and even submit to HN because I think others would. But that doesn't mean he's not very annoying.
[0] https://youtu.be/xFKFoGiGlXQ?t=570
[+] [-] accrual|2 years ago|reply
- 200 microseconds saved
- 33333 microseconds per frame -> 0.6% of time saved
- 1KB of RAM saved
- Looked really cool doing it (Kaze's words & I agree)
[+] [-] kibwen|2 years ago|reply
[+] [-] 22c|2 years ago|reply
I believe he's mainly working on the ROM hack at the moment but has talked about releasing a patch to backport all the 100% SM64 compatible fixes to the main game.
[+] [-] dietrichepp|2 years ago|reply
[+] [-] nstbayless|2 years ago|reply
[+] [-] DigiDigiorno|2 years ago|reply
This video is more technical than the others, but I suggest checking out his other videos if the idea of Super Mario 64 development piques your interest.
[+] [-] bruce343434|2 years ago|reply
> Mario "could" walk in a different direction than he is pointing in
But did he? Was that a real, noticable problem or just theoretical?
[+] [-] CyborgCabbage|2 years ago|reply
[+] [-] dvh|2 years ago|reply
[+] [-] contravariant|2 years ago|reply
However you need to be a bit careful that you return exactly sqrt(1/2) at pi/4 otherwise you get a discontinuity.
[+] [-] levodelellis|2 years ago|reply
[+] [-] nsajko|2 years ago|reply
The N64 CPU had an FPU, so it seems obvious that the best method, from both a speed and accuracy (or whichever is deemed more important) standpoint would be the same method that's used in all the libms in libcs like Glibc, musl or LLVM-libc, or in the standard libraries of languages like Go or Java: approximation using a polynomial. This polynomial is not derived analytically, rather it's found using an iterative algorithm, traditionally using the Remez algorithm, but also see my WIP project here: https://gitlab.com/nsajko/FindMinimaxPolynomial.jl.
[+] [-] lifthrasiir|2 years ago|reply
Is this your independent attempt? Because I think RLIBM did the essentially same thing recently [1].
[1] https://arxiv.org/pdf/2104.04043.pdf
[+] [-] sosodev|2 years ago|reply
[+] [-] tyg13|2 years ago|reply
[+] [-] thebooktocome|2 years ago|reply