giovannibajo1's comments

giovannibajo1 | 6 years ago | on: How to speed up the Rust compiler in 2020

unsafe lets you exit the small subset of code patterns whose safety the compiler can prove.

Not everything that you write in unsafe is necessarily unsafe: it only means that the compiler can't prove it. Another way of thinking: if unsafe let you only write unsafe code, then it would be useless because that unsafe code would eventually segfault.

giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]

The DMA transfers 64-bit words per each bus clock cycle between the main shared memory (RDRAM) and the internal RSP 4K DMEM (or IMEM, to transfer code).

So it's quite fast, but you need to remember that the main RDRAM is shared among the main CPU and the whole RCP (eg: it's also used as video memory for textures and frame buffers by the RDP), so contention is really high.

giovannibajo1 | 6 years ago | on: ‘War Dialing’ tool exposes Zoom’s password problems

Meet has a 10-letters ID for meetings over HTTP and a 9-numbers ID (like zoom) for phoning in. It sounds complicated but in practice every Meet invitation has a single-tap phone link that dials the correct number and input the conference ID after a pause, all encoded in the link. It works flawlessly and so it doesn’t matter if that number is different from the concernce URL you click on a computer.

giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]

For H264? Basically everything.

Vector registers are 8 lanes, signed 16-bit, so they map quite well to per-pixel calculations on each plane (YUV), which is what video codecs do, as you can process 8 pixels at a time, and you have 16-bit precision to handle intermediate results.

The most complex hurdle is that RSP only has 4K of RAM so you need to DMA in and out macroblocks a lot (especially since I can't possibly rewrite a FULL h264 decoder in RSP assembly, not in this lifetime: I need to write only specific performance-sensitive algorithms, while the bulk of the decoder stays in C; this means that the same data ends up going in & out the RSP a lot, especially since the H264 decoder I'm using is not aware of this problem).

This said, RSP DMA is even rectangle based, so it's another perfect fit: I can DMA a macroblock by specifying the pointer in RAM, width and height (usually 16x16, but some algos works on sub-partitions of 8x8 or 4x4) and the stride (screen width), so that a single DMA call will transfer the block from the middle a frame, skipping the rest of the data.

Vector multiplications in RSP were designed to write DSP-like filters, so they map quite well to the pixel filters required by H264. There are several different multiplication instructions for different fixed point precisions, and there's even one that automatically adds 0.5 (in the correct fixed point precision) which is also a common pattern in FIR filters, and also used in H264.

Saturation (VCH/VGE/VLT opcodes) is also supported; this is useful as most algorithms eventually need to saturate the calculated value in the 0-255 range, so that's another thing which usually require 1 clock cycle for 8 pixels.

When working with 4x4 partitions, half of the vector lanes are ignored; when writing back to memory, you need to do a read / combine / write sequence (as you may want to write 4 pixels and keep the existing 4 pixels, but vector writes will write 8 pixels); in this case, the VMRG instruction is used, which basically allow to combine two vector registers into one, with a bitmask to specific where to get each lane frame.

For IDCT, it comes very handy that most RSP opcodes allows to do partial broadcasts of the lanes of one of the input registers; this allows to keep a 4x4 matrix into 2 consecutive registers and then play some tricks with broadcast to multiply by rows and by columns (which is required by IDCT where you need to compute A' x B x A, with A&B being 4x4 matrices, so if you expand that you will see that you need to rotate vectors a lot).

So well, it's actually a pretty good fit.

PS: in the Gamasutra article, it shows the RSP code used to do colorspace conversion (YUV->RGB). The article says that it give a big boost (and I can believe it: especially in MPEG1, CSC is like 30% of decoding time), but I brought it to basically 0% by letting the RDP do it (RDP is the GPU in N64). In fact, the RDP supports YUV textures: so in my H264 player, the RSP just does the interleaving (that is, merges the 3 separate Y, U, V planes into one) and then asks the RDP to blit a textured rectangle in the correct format. The RDP even runs in parallel to both RSP and CPU. It might be that, back in 2000, this wasn't fully documented by Nintendo, though I found several references in old Nintendo docs. I can't see otherwise why it wasn't used. Once you reverse engineer how to pass the correct constants, it works really well and brings the CSC cost to basically zero.

giovannibajo1 | 6 years ago | on: Achieving full-motion video on the Nintendo 64 (2000) [pdf]

Hi, I'm the guy working on that.

I've actually since moved to port a H264 implementation to N64. It's been a long journey and I'm now at around 18 FPS, after nights of manual RSP assembly optimizations, vectorizing most of the intra-prediction and inter-prediction algorithms. I want to reach 30FPS so there's still some work to do.

giovannibajo1 | 6 years ago | on: Two Years with Rust

I’m writing a largish (~15k) hobby program in Rust. I’ll focus on negative points, there are many positive.

Compilation time is terrible, I need something between 20 and 40 seconds for edit-compile-cycle (MacBook Pro 2016, i7, 16gb ram) which kills productivity for me especially since as a hobby I tend to have small windows. I should probably dedicate time to investigate how I can speed it up (split it in more crates than naturally required, refactor how tests are written, etc) but this is a turn down by itself.

Productivity is still low. Even after writing so many lines I still can’t feel productive; I still face problems writing code that compiles, and am forced to workaround issues with partial borrowing of structures, or similar issues. This happens anytime I need to do something “new” form an architectural point of view (writing a new “component”), or if I attempt something “smart” (eg, trying to refactor to reduce duplication). This matches what this article said: Go is far more productive for me. If the architecture is fixed and I just “write the code in the right place”, then I can get decent speed (modulo the compile time issue).

giovannibajo1 | 6 years ago | on: EOF is not a character

I think TYPE would also treat ^Z as a terminator of the file. I think it was common in DOS to have binary files with a textual header followed by ^Z, that would hide the binary part.

giovannibajo1 | 6 years ago | on: Rust Ownership Rules

Is there any explanation of this? Specifically i would like to understand how a single mutable reference is required for correctness in a single thread scenario.

I’ve googled everywhere but I can’t find a single source of information that completely explains this, possibly showing realistic source code and elaborating the way it could generate memory errors or be miscompiled in case of multiple mutable references.

giovannibajo1 | 6 years ago | on: 2020 Leap Day Bugs

I think .replace() is a mistake, and it shouldn't exist in the first place. The way dates work, replacing a single component is almost always going to create problems in specific cases.

giovannibajo1 | 6 years ago | on: Microsoft Teams outage due to expired certificate

Automated renewal doesn’t mean that you don’t need to have monitoring. I’ve seen many letsencrypt-baked servers generating certificate errors because the automated renewal system broke (misconfigurations, borked updates, corporate-level firewalls/proxies blocking requests, etc.).

Even with letsencrypt you still MUST have a monitoring system, notifications going to the right person, and in general an organization that can act on this. The problem is often not technical: your organization must be structured in a way that those notifications are acted upon. If anything, letsencrypt lessens the frequency of those notifications so I posit that it’s even worse from the point of view of validating your organization: with regular certs, you get notified once every year or two.

Anyway, letsencrypt is a good thing: it’s just not a solution to this problem.

giovannibajo1 | 6 years ago | on: The Throw Keyword Was a Mistake

The big difference is that in Rust control flow is still explicit. All functions returning errors must declare so with Result, which in turns makes it impossible to forget to handle it on the call site.

This forces you to think: what should I do with this Result error code? Should I propagate it or handle it? This thinking is good and produces good design and good error handling.

For beginners it is very useful, that, for instance, any function doing I/O does return a Result: it forces you to think how your system should behave in case of I/O error.

In C++ or Python, exception control flow is implicit. All functions could theoretically partecipate in an error chain and you have no idea what, if and why a function will raise an exception. Functions doing I/O are completely indistinguishable from the purest functions, and only experience tells you where and if errors should be handled. The consequence is that it’s very hard to find mature code based with sensible error handling.

I should notice that Go is basically the same here: the main difference is that it’s more verbose in error propagation, and requires a configured environment (linter errcheck) to detect when functions only returning err are called without error handling/propagation. Rust is superior here, but Go has the same basic architecture, just less evolved.

page 4