giovannibajo1 | 5 years ago | on: Google unifies all of its messaging and communication apps into a single team
giovannibajo1's comments
giovannibajo1 | 6 years ago | on: How to speed up the Rust compiler in 2020
Not everything that you write in unsafe is necessarily unsafe: it only means that the compiler can't prove it. Another way of thinking: if unsafe let you only write unsafe code, then it would be useless because that unsafe code would eventually segfault.
giovannibajo1 | 6 years ago | on: Synthesizing Optimal 8051 Code
This is for instance Go's implementation which is readable and well documented:
https://github.com/golang/go/blob/master/src/cmd/compile/int...
giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]
So it's quite fast, but you need to remember that the main RDRAM is shared among the main CPU and the whole RCP (eg: it's also used as video memory for textures and frame buffers by the RDP), so contention is really high.
giovannibajo1 | 6 years ago | on: ‘War Dialing’ tool exposes Zoom’s password problems
giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]
Vector registers are 8 lanes, signed 16-bit, so they map quite well to per-pixel calculations on each plane (YUV), which is what video codecs do, as you can process 8 pixels at a time, and you have 16-bit precision to handle intermediate results.
The most complex hurdle is that RSP only has 4K of RAM so you need to DMA in and out macroblocks a lot (especially since I can't possibly rewrite a FULL h264 decoder in RSP assembly, not in this lifetime: I need to write only specific performance-sensitive algorithms, while the bulk of the decoder stays in C; this means that the same data ends up going in & out the RSP a lot, especially since the H264 decoder I'm using is not aware of this problem).
This said, RSP DMA is even rectangle based, so it's another perfect fit: I can DMA a macroblock by specifying the pointer in RAM, width and height (usually 16x16, but some algos works on sub-partitions of 8x8 or 4x4) and the stride (screen width), so that a single DMA call will transfer the block from the middle a frame, skipping the rest of the data.
Vector multiplications in RSP were designed to write DSP-like filters, so they map quite well to the pixel filters required by H264. There are several different multiplication instructions for different fixed point precisions, and there's even one that automatically adds 0.5 (in the correct fixed point precision) which is also a common pattern in FIR filters, and also used in H264.
Saturation (VCH/VGE/VLT opcodes) is also supported; this is useful as most algorithms eventually need to saturate the calculated value in the 0-255 range, so that's another thing which usually require 1 clock cycle for 8 pixels.
When working with 4x4 partitions, half of the vector lanes are ignored; when writing back to memory, you need to do a read / combine / write sequence (as you may want to write 4 pixels and keep the existing 4 pixels, but vector writes will write 8 pixels); in this case, the VMRG instruction is used, which basically allow to combine two vector registers into one, with a bitmask to specific where to get each lane frame.
For IDCT, it comes very handy that most RSP opcodes allows to do partial broadcasts of the lanes of one of the input registers; this allows to keep a 4x4 matrix into 2 consecutive registers and then play some tricks with broadcast to multiply by rows and by columns (which is required by IDCT where you need to compute A' x B x A, with A&B being 4x4 matrices, so if you expand that you will see that you need to rotate vectors a lot).
So well, it's actually a pretty good fit.
PS: in the Gamasutra article, it shows the RSP code used to do colorspace conversion (YUV->RGB). The article says that it give a big boost (and I can believe it: especially in MPEG1, CSC is like 30% of decoding time), but I brought it to basically 0% by letting the RDP do it (RDP is the GPU in N64). In fact, the RDP supports YUV textures: so in my H264 player, the RSP just does the interleaving (that is, merges the 3 separate Y, U, V planes into one) and then asks the RDP to blit a textured rectangle in the correct format. The RDP even runs in parallel to both RSP and CPU. It might be that, back in 2000, this wasn't fully documented by Nintendo, though I found several references in old Nintendo docs. I can't see otherwise why it wasn't used. Once you reverse engineer how to pass the correct constants, it works really well and brings the CSC cost to basically zero.
giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]
giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]
I've actually since moved to port a H264 implementation to N64. It's been a long journey and I'm now at around 18 FPS, after nights of manual RSP assembly optimizations, vectorizing most of the intra-prediction and inter-prediction algorithms. I want to reach 30FPS so there's still some work to do.
giovannibajo1 | 6 years ago | on: How the Zoom macOS installer does its job without you clicking ‘install’
giovannibajo1 | 6 years ago | on: How the Zoom macOS installer does its job without you clicking ‘install’
https://chrome.google.com/webstore/detail/google-meet-grid-v...
Which is even more infuriating because it shows that missing tiling in Meet is just a frontend issue.
I’m completely baffled that this is not implemented.
giovannibajo1 | 6 years ago | on: New System76 Laptop: Lemur Pro
giovannibajo1 | 6 years ago | on: New System76 Laptop: Lemur Pro
giovannibajo1 | 6 years ago | on: Two Years with Rust
Compilation time is terrible, I need something between 20 and 40 seconds for edit-compile-cycle (MacBook Pro 2016, i7, 16gb ram) which kills productivity for me especially since as a hobby I tend to have small windows. I should probably dedicate time to investigate how I can speed it up (split it in more crates than naturally required, refactor how tests are written, etc) but this is a turn down by itself.
Productivity is still low. Even after writing so many lines I still can’t feel productive; I still face problems writing code that compiles, and am forced to workaround issues with partial borrowing of structures, or similar issues. This happens anytime I need to do something “new” form an architectural point of view (writing a new “component”), or if I attempt something “smart” (eg, trying to refactor to reduce duplication). This matches what this article said: Go is far more productive for me. If the architecture is fixed and I just “write the code in the right place”, then I can get decent speed (modulo the compile time issue).
giovannibajo1 | 6 years ago | on: EOF is not a character
giovannibajo1 | 6 years ago | on: Italy is extending its coronavirus quarantine measures to the entire country
giovannibajo1 | 6 years ago | on: Rust Ownership Rules
giovannibajo1 | 6 years ago | on: Rust Ownership Rules
I’ve googled everywhere but I can’t find a single source of information that completely explains this, possibly showing realistic source code and elaborating the way it could generate memory errors or be miscompiled in case of multiple mutable references.
giovannibajo1 | 6 years ago | on: 2020 Leap Day Bugs
giovannibajo1 | 6 years ago | on: Microsoft Teams outage due to expired certificate
Even with letsencrypt you still MUST have a monitoring system, notifications going to the right person, and in general an organization that can act on this. The problem is often not technical: your organization must be structured in a way that those notifications are acted upon. If anything, letsencrypt lessens the frequency of those notifications so I posit that it’s even worse from the point of view of validating your organization: with regular certs, you get notified once every year or two.
Anyway, letsencrypt is a good thing: it’s just not a solution to this problem.
giovannibajo1 | 6 years ago | on: The Throw Keyword Was a Mistake
This forces you to think: what should I do with this Result error code? Should I propagate it or handle it? This thinking is good and produces good design and good error handling.
For beginners it is very useful, that, for instance, any function doing I/O does return a Result: it forces you to think how your system should behave in case of I/O error.
In C++ or Python, exception control flow is implicit. All functions could theoretically partecipate in an error chain and you have no idea what, if and why a function will raise an exception. Functions doing I/O are completely indistinguishable from the purest functions, and only experience tells you where and if errors should be handled. The consequence is that it’s very hard to find mature code based with sensible error handling.
I should notice that Go is basically the same here: the main difference is that it’s more verbose in error propagation, and requires a configured environment (linter errcheck) to detect when functions only returning err are called without error handling/propagation. Rust is superior here, but Go has the same basic architecture, just less evolved.