Storing data in pointers

[+] marcus_holmes|2 years ago|reply

I looked at the Go implementation of this in "tagged pointers" [0]

The amount of data that can be used for the tag is architecture-dependant, and the routine discards any tag bits that don't fit into the tagged pointer without telling the caller.

To me, this seems ridiculous - why not just use a struct with a tag and a pointer, and not run the risk of your tag being destroyed without you knowing because the architecture can't fit that many bits?

But the Go folks are smart, and must be doing this for a reason. Can anyone explain the thinking here?

[0] https://github.com/golang/go/blob/master/src/runtime/tagptr_...

[+] wheels|2 years ago|reply

There are a lot of data structures where doubling the size of the struct would be a big deal. I've worked on big graphs, where I used similar tricks, because most of the storage of a graph is pointers, and limits of what can be quickly computed are bound by how much of the graph you can get in memory.

[+] Splizard|2 years ago|reply

Alignment, if you add a uint8 tag for example, you'll end up effectively doubling your pointer size.

[+] saagarjha|2 years ago|reply

Auxiliary data is not free; by stuffing your data into a pointer you don’t need to pay the cost of extra storage.

[+] none_to_remain|2 years ago|reply

One good use case is if you are doing atomic operations on pointers, you can atomically modify the tag as well.

[+] yencabulator|2 years ago|reply

It's up to the low-level code to stay within taggedPointerBits, and the value is arch-dependent.

For netpoll, it looks to be just a cyclic epoch counter, so the size doesn't really matter: https://github.com/golang/go/blob/4956c3437bd2f4448bcec51321...

For use in malloc, it's assumed to be >=10 and verified at init: https://github.com/golang/go/blob/4956c3437bd2f4448bcec51321...

[+] intelVISA|2 years ago|reply

It is not a Go thing, tagged pointers have been around forever!!

[+] unknown|2 years ago|reply

[deleted]

[+] JonChesterfield|2 years ago|reply

There's a trick here I hadn't noticed. Good times.

If the plan is 8 byte aligned data and use the three low bits for other stuff, you mask them off before loads/stores. An alternative is to use the high three bits, store the pointer/8, and multiply by 8 to retrieve.

That's appealing on x64 because memory operations can include a *8 in the encoding.

Specifically, I've long wondered whether the pointer masking tricks mess up prefetch/speculation. It seems plausible that making the dereference look more like a normal memory operation is helpful for that.

(It also means the low-alignment-bits and high-unused-bits can be considered contiguous modulo the *8 on decode which is probably less annoying when using both ranges)

[+] dkersten|2 years ago|reply

Note that Intels 5-level paging uses 57 bit addresses, so on these processors, you can only safely assume the top 7 bits are unused. https://en.m.wikipedia.org/wiki/Intel_5-level_paging

I also believe that when masking the bits off before loads/stores, you need to set them to the same value of the highest used bit (bit 47 or bit 56), while this often is 0, it’s not necessarily the case. Something to be aware of.

Finally, when using C++, you gotta be careful of strict aliasing. In C++20 you can use std::bit_cast https://en.cppreference.com/w/cpp/numeric/bit_cast

[+] KMag|2 years ago|reply

It's really unfortunate that all of the mainstream OSes run userspace in the lower portions of the address space.

Setting the most-significant 13 bits (really, setting the second to 12th bits and at least one of the following bits) of an IEEE-754 float will result in a NaN bit pattern. That means that any pointer to the top 2 petabytes of a 64-bit address space, if cast to an IEEE-754 double will be NaN. This means the NaN-boxing used in Safari and Firefox's JavaScript engines, LuaJIT, etc. would be no-ops. (Safari and Firefox use different mechanisms, but they'd become the same if moved to the top of the address space.)

It's not enough of a performance difference to re-jigger everything in mainstream OSes, but I imagine if someone were to come up with a unikernel/exokernel OS specifically for JITing some dynamic language, there's some performance to be had by having all of the dynamic language objects in the upper 2 petabytes of the address space.

[+] jandrese|2 years ago|reply

Very old Macs used this trick to squeeze their ROM routines down a bit, operating with 24 bit addressing and using the top bits for flags and whatnot. Of course they ran into trouble when machines with 16MB of memory started appearing. If you do this you might be making more work for yourself in the future when you buy a new machine with 256EB of main memory.

[+] drfuchs|2 years ago|reply

What’s old is new again: IBM 360 architecture also only used 24 bits for addresses back in the 1960s, so you could stuff data in the upper byte of pointers. Eventually, programs had to declare themselves to be in “Extended Mode” if they wanted to get access to 31-bit addressing instead.

[+] vidarh|2 years ago|reply

I'm guessing the problem occurred specifically with CPUs newer than the 68000 irrespective of amount of RAM?

(Microsoft's) Amiga Basic also did this and stopped working on newer CPUs, as the 68000 only has 24 address lines and ignores the top 8 bits of a 32 bit address, but 68020 and up uses all 32 (I don't remember about the 68010, but that was pin compatible with the 68000 so I'm guessing not)

[+] Dylan16807|2 years ago|reply

x86_64 (among others) specifically avoided that incompatibility, by the CPU forcing programs to mask those bits out instead of ignoring them. So programs are very compatible into the future, they just need to limit the range their memory allocator uses.

Then it later added explicit automatic masking, which also avoids the problem. As long as your program can make do with smaller amounts of memory, there are no downsides.

[+] KMag|2 years ago|reply

And in later versions of Classic Mac OS, for each executable, metadata was stored on whether to set the hardware for 32-bit or 24-bit addressing. As a kid, I definitely monkeyed with the checkbox there not understanding what it did. Having seen no change in behavior, I guess all the applications I used were 32-bit clean.

[+] eschneider|2 years ago|reply

Yeah, this was definitely a thing on old macs. Us olds also remember the pain and agony of cleaning this up for 32 bit addressing. Cool as these hacks are, do yourself a favor and avoid the (eventual) agony.

[+] utopcell|2 years ago|reply

Slight correction: 256 TiB, not EiB.

[+] jandrewrogers|2 years ago|reply

This is a nice summary of the practical aspects and considerations. It isn’t something anyone should be doing explicitly on a regular basis but there are occasions, particularly in libraries, where it is the perfect tool for the job.

There is also the inverse use case: smuggling pointers inside status codes, enums, and similar. For example, optionally encoding a pointer to additional error metadata for non-zero result codes. In C++ it isn’t that uncommon to also see Result types implemented similarly when the type allows it.

[+] alpaca128|2 years ago|reply

There's also the Rust library smartstring which allows creating short strings without heap allocations: a Rust string consists of three values (pointer, capacity, length), so 24B on a 64bit architecture. With one byte used as tag this leaves 23 bytes for storing the string in the same space.

[+] skitter|2 years ago|reply

Because Rust strings are utf-8, not all byte sequences are valid strings, so you can actually store a 24-byte string inline (which is what compact_str does).

[+] OskarS|2 years ago|reply

C++'s std::string does some version of this as well (at least on all major standard libraries), it's known as the "small string optimization".

[+] powera|2 years ago|reply

I don't want to say you should never do this.

But if you aren't writing a compiler or an embedded system, you should never do this.

[+] spacechild1|2 years ago|reply

It is also very useful for some lockfree algorithms, e.g. to fit a pointer and a sequence number within a 64-bit value.

[+] JonChesterfield|2 years ago|reply

Dynamic language runtimes really like this. It gives you type information without dereferencing the pointer.

[+] KolmogorovComp|2 years ago|reply

Any graph-heavy application constrained by memory potentially gain a lot from those kind of optimizations.

[+] UncleEntity|2 years ago|reply

> But if you aren't writing a compiler or an embedded system, you should never do this.

I have this hex trie library that uses this to differentiate between a leaf and a node that will soon be optimizing single-value nodes as tagged pointers. __edit__ umm... single-value nodes are leaves!?! Yeah, not enough coffee apparently.

Not to mention the tinyscheme interpreter I occasionally poke at that TFA gave me a bunch of ideas to try out.

[+] unknown|2 years ago|reply

[deleted]

[+] dmt314159|2 years ago|reply

This will be a problem when trying this on Capability Hardware Enhanced RISC Instructions (CHERI) https://www.cl.cam.ac.uk/research/security/ctsrd/cheri/ systems. Currently the only CPU with this implemented is Arm's Morello prototype.

Pointers are replaced by 128-bit capabilities containing what operations are valid, a 64-bit base address and a size. These capabilities are unforgeable, so trying to play with "unused" parts of 64-bit pointers simply won't work.

FreeBSD & Gnome are running on hardware after minor source-code changes, along with a significant proportion of FreeBSD ports collection.

As well as ARM, Microsoft is interested: https://www.microsoft.com/en-us/research/project/portmeirion...

[+] rr808|2 years ago|reply

My favorite hack back in MFC days was a combo box which I stored the pointer address in the text (to the right after a lot of spaces so was hidden). When a user chooses an item, parse the pointer and de-reference it back to an object.

[+] CoastalCoder|2 years ago|reply

I predict that some day, you'll be beaten to death by an angry mob of screen-reader users :)

[+] userbinator|2 years ago|reply

I guess you didn't come across CB_SETITEMDATA / CB_GETITEMDATA ?

[+] jonasmerlin|2 years ago|reply

"I think it's quite well known that on a 64-bit system, the maximum bit-width of a virtual address is somewhat lower (commonly 48-bits)." might actually be a perfect example of the Average Framiliarity xkcd[0]. It's perfectly fine to write it that way and the article obviously knows its intended audience. But I'm wondering what percentage of readers here actually knew this beforehand. (I learned it only pretty recently myself. And yes, I should have known this sooner, but ... didn't.)

[0] https://xkcd.com/2501/

[+] asb|2 years ago|reply

Article author here. Thanks for calling that out - the intent wasn't to make readers who might not have been aware feel bad. It really came from my worry about being seen to write about something that is trivial and everyone knows already!

Perhaps I'd be better just dropping "I think it's quite well known that"?

[+] ForOldHack|2 years ago|reply

My first thought was to use the Address calculation logic for an additional ALU, then... my second thought was trying to justify this during a code review... and lastly, why Microsoft used the LDT upper byte to make Xenix 286 incompatible... and the headache that changing architectures made for poor programming.

[+] _a_a_a_|2 years ago|reply

The alpha only accessed 8-byte aligned mem addresses (originally anyway). The bottom 3 bits were ignored (masked out) on lookups, per the spec, to allow users to stuff juicy extra info into these.

[+] asb|2 years ago|reply

Do you have a reference on that? This summary of the Alpha AXP <https://danluu.com/dick-sites-alpha-axp-architecture.pdf> states "Normal load or store instructions that specify an unaligned address take a precise data alignment trap to PALcode (which may do the access using two aligned accesses or report a fatal error, depending on the operating system design)"

[+] unknown|2 years ago|reply

[deleted]

66 comments