psykotic | 2 years ago | on: One thing that has changed in the professional game industry is that
psykotic's comments
psykotic | 2 years ago | on: Mold 2.0
psykotic | 3 years ago | on: Rust Atomics and Locks: Low-Level Concurrency in Practice
The author is right. ARMv8 supports relaxed memory ordering but its memory model does not support acquire-release ordering without sequential consistency. (Update: Support for weaker acquire-release ordering was added in a later revision.)
I'll quote a relevant excerpt from B2.3.11:
Where a Load-Acquire appears in program order after a Store-Release, the memory access generated by the Store-Release instruction is Observed-by each PE to the extent that PE is required to observe the access coherently, before the memory access generated by the Load-Acquire instruction is Observed-by that PE, to the extent that the PE is required to observe the access coherently.
Sequential consistency is needed to avoid store/load reordering in cases like this: // Thread 1
flag1.store(1, SeqCst)
if flag2.load(SeqCst) == 0 {
// Guarded action
}
// Thread 2
flag2.store(1, SeqCst)
if flag1.load(SeqCst) == 0 {
// Guarded action
}
If these were instead implemented with acquire/release ordering as defined by the C++ or Rust memory model, the resulting happens-before constraints would not prevent both threads from executing their guarded actions.The excerpt from the Architecture Reference Manual says that if you use their load-acquire (ldar) and store-release (stlr) instructions, it is not possible for the store-release to be moved after the load-acquire, as observed by PEs (processing elements, their abstraction of hardware threads).
Let's look at how C++ compilers implement acquire-release vs sequential consistency on x86 and ARMv8:
https://godbolt.org/z/3fd5jse18
The machine code on ARMv8 is identical for thread_acq_rel and thread_seq_cst. Whereas on x86 the thread_seq_cst code has to use xchg (an alternative to store + mfence) to achieve sequential consistency.
Update: shachaf pointed out that ARMv8 more recently added support for weaker acquire-release semantics in the ARMv8.3 revision. It looks like the first processor to ship with ARMv8.3 support was the A12X from Apple in 2018, which is 5 years after Herb's talk. If we take the code from before and compile for ARMv8 with all architectural features enabled, you will see different machine code for thread_acq_rel which uses the newer ldaprb instruction:
https://godbolt.org/z/dnP9sebcz
This illustrates a difficulty with talking about "ARMv8" as a fixed thing. It's much more of a rapidly moving target than x86. That said, the ARMv8.3 addendum should have been mentioned, at least parenthetically; I emailed the author suggesting an info box.
psykotic | 3 years ago | on: Why Infer Types?
The fact that
f(g(x))
and { let y = g(x); f(y) }
are equally robust to changes in g(x)'s type is an advantage of type inference. It doesn't seem reasonable to place an additional burden on the person who chose to assign a name to the subexpression value.> Personally I really like the spot readability of explicit types.
You can use an IDE that displays inferred types as type inlays (for variables) and as sidebar info (for expressions). But I've often wanted a way to export such sideband data (including subexpression types and cross-reference links) so you could display it for code reviews, etc, without teaching those tools about your language specifically, so I'm sympathetic to the general goal.
Regarding explicitness and readability, I like Rust's idea of letting you partially specify types by using _ as a placeholder to say "infer this part". For example, you can specify that something is a Vec without specifying its item type:
let v: Vec<_> = iter.collect();psykotic | 3 years ago | on: How to compare floats in Python
I'm curious what obstacle you have in mind for "multiple numbers".
The problem is equivalent to a distance query on a data structure, i.e. enumerate everything within a distance tol of a query point x in an appropriate distance metric. There are many data structures for this problem (it's a basic building block in computational geometry). But I assume the hashing solution you have in mind is what we usually call a hash grid in gamedev and graphics, which works well when tol is fixed for the lifetime of the data structure. [1] Namely, you define a grid with cell size tol and map a point x to the index cell(x) of its containing cell, which is just a rounding operation. Then you can use a hash table keyed by cell(x) to store the contents of an unbounded, sparse grid. To perform a distance query from x you need to check not just x's cell but some of the immediately neighboring cells and then filter all of those candidates based on the pairwise distance. [2] [3]
This approach works with any distance metric (including the usual suspects L^2, L^1 and L^inf) and in any dimension d although the worst-case number of cells to check in the hash is 2^d, so the curse of dimensionality is present and in high-dimensional spaces another data structure not based on rectilinear partitioning would be preferable. But hashing does work with "multiple numbers" if by "multiple numbers" you mean a vector (with not too many components) where tolerance is defined by a distance metric.
[1] Actually, hash grids work well any time the query radius is on roughly the same scale as the cell size. But if the query radius is a lot smaller than the cell size then the grid is too coarse to perform adequate pre-filtering. And if the query radius is larger than the cell size you have to check a bigger neighborhood of cells (i.e. more hash lookups) to enumerate all the candidates.
[2] Depending on the read vs write ratio, you can actually flip this around by storing each point in multiple cells so that queries only need one hash table lookup at the expense of slightly less precise pre-filtering.
[3] Instead of doing multiple hash table lookups, you can also have each occupied cell store pointers to neighboring cells (null for unoccupied neighbors), which replaces hash table lookups for the neighbors with plain memory loads. A variation on this trick is to instead store bit flags to avoid doing hash table lookups for neighboring cells which are known to be empty; this takes up far less memory.
psykotic | 4 years ago | on: Hyper-realistic digital humans in Unity
psykotic | 4 years ago | on: Fleet, a Lightweight IDE from JetBrains
That said, your thesis about the relative difficulty of doing this for C++ vs Java seems hard to argue against, and the rapid, continuous growth in C++'s surface area and complexity since C++11 surely hasn't helped.
psykotic | 4 years ago | on: Git ls-files is Faster Than Fd and Find
psykotic | 4 years ago | on: The Pyret Programming Language
psykotic | 4 years ago | on: B-Trees: More Than I Thought I'd Want to Know
Yes, and it took a surprisingly long time before anyone bothered to do the theoretical analysis. Some good papers to check out if anyone is interested are Deletion Without Rebalancing in Multiway Search Trees (http://sidsen.azurewebsites.net/papers/relaxed-b-tree-tods.p...) and Deletion Without Rebalancing in Binary Search Trees (http://sidsen.azurewebsites.net//papers/ravl-trees-journal.p...). Jeff Erickson also has some good notes on using local and global rebuilds for search trees: https://jeffe.cs.illinois.edu/teaching/algorithms/notes/10-s...
It's intuitively clear that worst-case performance [1] can degrade to a function of the number of insertions since a rebuild (as opposed to the number of currently live items) when you use relaxed deletion. If you can afford to globally rebuild the tree you get to play the usual amortization tricks, but global rebuilds can be problematic in a large-scale live database; one solution is to rebuild from a snapshot in the background (on another thread or machine) and then replay late arriving operations before switching over to the rebuilt tree. But if someone is okay with the amortization spikes from rehashing a hash table, they should be okay with non-incremental global rebuilds for search trees; a hash table where deletions leave behind tombstones and the table is rehashed when the ratio of tombstones gets too high is basically the same idea.
psykotic | 4 years ago | on: AVX512/VBMI2: A Programmer’s Perspective
psykotic | 4 years ago | on: Slitter: A slab allocator that trusts, but verifies
[1] Ideally before most shared libraries have been loaded. I should do a simulation to get some quantitative estimates, but my hunch is that ASLR hurts badly even if each loaded library only takes up one 4 kB page. E.g. if you have an empty address space, allocate a 4 kB page at a random address and split the space in two around it, then try to densely pack as many 8 GB ranges in there as possible, you lose one potential 8 GB range (even if the ranges only require 4 kB alignment for their base address). Whenever such splits don't land in the same 8 GB clump, you lose a potential 8 GB range per split.
psykotic | 4 years ago | on: B-tree Path Hints
With 8-byte keys and the usual 64-byte cache line size, you end up with 8 keys per interior node's key array. That's a pretty bad range for the "simple linear search" most people would write, which is going to be dominated by the expected branch mispredict cost. The idea of using a hint to guess a starting position for either linear or binary search has the same issue. You need branches to capitalize on the best case vs worst case spread, and the mispredict cost will dominate for smaller arrays like this.
If you care about speed for this case, use a fast branchless binary search or a vectorized linear search (the latter will have an edge). An additional minor consideration for B-tree traversal is the latency of vector vs scalar loads since you don't know the load address (except for the root node) until you've finished searching the parent node's key array. That is not necessarily true for vectorized searches of small arrays in general, where the vector load addresses may be key independent (like the B-tree root node is here) and so in that case the few extra cycles of L1 load-to-use latency for vector loads (and higher load latencies in general) can often be hidden.
[1] By B-tree I am referring to in-memory B+ trees. A disk-based B+ tree would usually be built around page-granularity nodes. In that regime a branchless binary search for integer/float keys (with software prefetch to populate the cache ahead of the data dependency chains) that finishes off with a vectorized linear search is likely going to be the winner. The B-tree slack interval (M/2, M] helps since it guarantees a fully unrolled binary search with lg(M) steps won't waste any steps. You can still support "underflowing" B-tree nodes with the same branchless code path.
psykotic | 4 years ago | on: Beating the L1 cache with value speculation
Anyway, invariant RDTSC's tick rate is completely separate from the core clock. So the main issue you have to worry about with invariant RDTSC is having your process unscheduled or having ticks "stolen" by interrupts (which includes firmware invisible to the kernel or hypervisor).
psykotic | 4 years ago | on: Your CPU may have slowed down on Wednesday
psykotic | 4 years ago | on: RISC-V Evolving to Address Supercomputers and AI
If you start with a word in a GPR and want to compute its popcount into another GPR, you need to move the GPR into a vector register, do the CNT, do the horizontal sum, and move the result back into a GPR.
It's been a while since I did the test, but I think the latency I measured on the Apple M1 perf cores (Firestorm) for that instruction sequence was around 15 cycles. [1] Compare that to POPCNT being 3 cycles on Intel and it really hurts for latency-sensitive uses of word popcount. However, CNT is great for doing large reductions since you can do 32 rounds of CNT + vector add before the bytewise counters can overflow, so you can amortize away the cost of the horizontal reduction. (People end up doing similar things on x86 for large streaming popcount calculations.)
A lot of my micro-optimized code relies on low-latency word popcount, so I selfishly want every processor to have an Intel-style popcount. :)
[1] You can actually beat the latency of that very easily with an L1-resident byte popcount lookup table + sum reduction. This table is only 256B, so it's very practical. You can shave off 1-2 cycles of latency by going to a halfword popcount lookup table, which is 64KB. That kind of cache hogging will perform better in microbenchmarks than in real applications but with the M1's large L1 cache of 128KB per core it might be a net win if you have a tight loop that's heavily affected by popcount latency.
psykotic | 4 years ago | on: Helix: a post-modern modal text editor
psykotic | 4 years ago | on: A Dive into Ray Tracing Performance on the Apple M1
psykotic | 5 years ago | on: Handles are the better pointers (2018)
For something like B-trees you want to make a distinction between internal and external references. Internal references ("child pointers") can just be 32-bit indices into an array of fixed-size items (each the size of a cache line or a multiple thereof) or even naked pointers if you don't mind the inefficiency on 64-bit platforms. Depending on the requirements, external references are often best represented as some kind of handle with explicit registration. This is especially helpful if you need to manage the lifetime for persistent B-trees (where different trees can share subtrees) since the external reference handles serve as GC roots, so you require them to be explicitly registered while the internal references are owned by the B-tree heap and can therefore be direct while still being traceable or refcountable; this also works well for dealing with reclamation in a concurrent system.
So, you almost always want to distinguish internal vs external references, and handles are primarily a solution candidate for external references.
psykotic | 5 years ago | on: Cranelift, Part 3: Correctness in Register Allocation
As a self-taught who later went to university for pure mathematics (which I got interested in through game programming), that was not my experience with the late 90s, early 2000s generation of self-taught game programmers. I think you have to go back 5 to 10 years earlier for that stereotype to have merit; the transition from 2D to 3D filtered out some people.
In the post-Quake generation, the sexy thing for budding game programmers was 3D graphics programming. For the most part you had no choice [1] but to read, understand and implement SIGRAPH papers and doctoral dissertations to learn the ropes, which in turn forced us to learn a lot of the formal prerequisites albeit unevenly. And there were a lot of advanced data structures involved with visibility determination (and later level of detail), which was the focal point of graphics in the Quake 1 and 2 era before it fell to the wayside once the dominance of GPUs made more brute-force approaches the right choice. So whether you liked it or not, you had to master a decent swath of university-level CS and math.
The main thing self-taughts generally lacked when joining the industry were the same thing all new employees lacked: they don't yet know how to work in teams, they don't know how to manage time, scope and complexity (though it helps if you've worked on larger personal projects of your own), they cannot make long-term engineering trade-offs (reliability, cost of making changes early vs late, flexibility vs performance, etc) because they haven't gone through full project cycles, and so on.
[1] There were a few useful resources like Abrash's Dr. Dobbs articles on Quake which were partially pre-digested for easier consumption by programmers, but it was still daunting if you didn't understand the math. And the rate of progress was so high in this period that the few reliable sources of how-to information were so quickly out of date that you _had_ to keep up with the academic research and experiment by implementing and improving it yourself.