(no title)
ygoldfeld | 1 year ago
Using manual mapping (same address values on both sides, as you mentioned) was one idea that a couple people preferred, but I was the one who was against it, and ultimately this was heeded. So that meant:
Raw pointer T* becomes Allocator<T>::pointer. So if user happens to enjoy using raw pointers directly in their structures, they do need to make that change. But, beats rewriting the whole thing… by a lot.
container<T> becomes container<T, Allocator<T>>, where `container` was your standard or standard-compliant (uses allocator properly) container of choice. So if user prefers sanity and thus uses containers (including custom ones they developed or third-party STL-compliant ones), they do need to use an allocator template argument in the declaration of the container-typed member.
But, that’s it - no other changes in data structure (which can be nested and combined and …) to make it SHM-sharable.
We in library “just” have to provide the SHM-friendly Allocator<T> for user to use. And, since stateful allocators are essentially unusable by mere humans in my subjective opinion (boost.interprocess authors disagree apparently), use a particular trick to work with an individual SHM arena. “Activator” API.
So that leaves the mere topic of this SHM-friendly fancy-pointer type, which we provide.
For SHM-classic mode (if you’re cool with one SHM arena = one SHM segment and both sides being able to write to SHM; and boost.interprocess alloc algorithm) —- enabled with a template arg switch when setting up your session object —- that’s just good ol’ offset_ptr.
For SHM-jemalloc (which leverages jemalloc, and hence is multi-segment and cool like that, plus with better segregation/safety between the sides) internally there are multiple SHM-segments, so offset_ptr is insufficient. Hence we wrote a fancy-pointer for the allocator, which encodes the SHM segment ID and offset within the 64 bits. That sounds haxory and hardcore, but it’s not so bad really. BUT! It needs to also be able to be able to point outside SHM (e.g., into stack which is often used when locally building up a structure), so it needs to be able to encode an actually-raw vaddr also. And still use 64 bits, not more. Soooo I used pointer tagging, as not all 64 bits of a vaddr carry information.
So that’s how it all works internally. But hopefully to the user none of these details is necessary to understand. Use our allocator when declaring container members. Use allocator’s fancy-pointer type alias (or similar alias, we give ya the aliases conveniently hopefully) when declaring a direct pointer member. And specify which SHM-backing technique you want us to internally use - depending on your safety and allocation perf desires (currently available choices are SHM-classic and SHM-jemalloc).
elBoberido|1 year ago
We started with mapping to the shm to the same address but soon noticed that it was not a good idea. It works until some application already mapped something to the same address. It's good that you did not went that route.
I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :) Replacing the raw-pointer with fancy-pointer is indeed much simpler than replacing the whole logic.
Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?
Hehe, we have something called 'relative_ptr' which also tracks the segment ID + offset. It is a struct of two uint64_t though. Later on, we needed to condense it to 64 bit to prevent torn writes in our lock-free queue exchange. We went the same route and encoded the segment ID in the upper 16 bits since only 48 bits are used for addressing. It's kind of funny that other devs also converge to similar solutions. We also have something called 'relocatable_ptr'. This one tracks only the offset to itself and is nice to build relocatable structures which can be memcopied as long as the offset points to a place withing the copied memory. It's essentially the 'boost::offset_ptr'.
Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate? We did the same for iceory1 but moved to a submission-queue/completion-queue architecture to reduce complexity in the allocator and free the memory in the same process that did the allocation. With iceoryx2 we also plan to be more dynamic and have ideas to implement multiple allocators with different characteristics. Funnily, jmalloc is also on the table for use-cases where fragmentation is not a big problem. Maybe we can create a common library for shm allocating strategies which can be used for both projects.
ygoldfeld|1 year ago
> I hoped you had an epiphany and found a nice solution for the raw-pointer problem without the need to change them and we could borrow that idea :)
Well, almost. But alas, I am unable to perform magic in which a vaddr in process 1 means the same thing in process 2, without forcing it to happen by using that mmap() option. And indeed, I am glad we didn't go down that road -- it would have worked within Akamai due to our kernel team being able to do such custom things for us, avoiding any conflict and so on; but this would be brittle and not effectively open-sourceable.
> Since the raw-pointer need to be replaced by fancy-pointer, how do you handle STL container? Is there a way to replace the pointer type or some other magic?
Yes, through the allocator. An allocator is, at its core, three things. 1, what to execute when asked to allocate? 2, what to execute when asked to deallocate? 3, and this is the relevant part here, what is the pointer type? This used to be an alias `pointer` directly in the allocator type, but it's done through traits, modernly. Point being: An allocator type can have the pointer type just be T; or* it can alias it to a fancy-pointer type. Furthermore, to be STL-compliant, a container type must religiously follow this convention and never rely on T* being the pointer type. Now, in practice, some GNU stdc++ containers are bad-boys and don't follow this; they will break; but happily:
- clang's libc++ are fine;
- boost.container's are fine (and, of course, implement exactly the required API semantics in general... so you can just use 'em);
- any custom-written containers should be written to be fine; for example see our flow::util::Basic_blob which we use as a nailed-down vector<uint8_t> (with various goodies like predictable allocation size behavior and such) for various purposes. That shows how to write such a container that properly follows STL-compliant allocator behavior. (But again, this is not usually something you have to do: the aforementioned containers are delightful and work. I haven't looked into abseil's.)
So that's how. Granted, subtleties don't stop there. After all, there isn't just "one" SHM arena, the way there is just one general heap. So how to specify which SHM-arena to be allocating-in? One, can use a stateful allocator. But that's pain. Two, can use the activator trick we used. It's quite convenient in the end.
> Btw, when you use jemalloc, do you free the memory from a different process than from which you allocate?
No; this was counter to the safety requirements we wanted to keep to, with SHM-jemalloc. We by default don't even turn on writability into a SHM-arena by any process except the one that creates/manages the arena - can't deallocate without writing. Hence there is some internal, async IPC that occurs for borrower-processes: once a shared_ptr<T> group pointing into SHM reaches ref-count 0, behind the scenes (and asynchronously, since deallocating need not happen at any particular type and shouldn't block user threads), it will indicate to the lending-process this fact. Then once all such borrower-processes have done this, and the same has occurred with the original shared_ptr<T> in the lender-process (which allocated in the first place), the deallocation occurs back in the lender-process.
If one chooses to use SHM-classic (which -- I feel compelled to keep restating for some reason, not sure why -- is a compile-time switch for the session or structure, but not some sort of global decision), then it's all simplicity itself (and very quick -- atomic-int-quick). offset_ptr, internally-stored ref-count of owner-processes; once it reaches 0 then whichever process/piece of code caused it, will itself deallocate it.
The idea of its design is that one could plug-in still more SHM-providers instead of SHM-jemalloc or SHM-classic. It should all keep working through the magic of concepts (not formal C++20 ones... it's C++17).
---
Somewhere above you mentioned collaboration. I claim/hope that Flow-IPC is designed in a pragmatic/no-frills way (tried to vaguely imitate boost.interprocess that way) that exposes whichever layer you want to use, publicly. So, to give an example relating to what we are discussing here:
Suppose someone wants to use iceoryx's badass lock-free mega-fast one-microsecond transmission. But, they'd like to use our SHM-jemalloc dealio to transmit a map<string, vector<Crazy_ass_struct_with_more_pointers_why_not>>. I completely assure you I can do the following tomorrow if I wanted:
- Install iceoryx and get it to essentially work, in that I can transmit little constant-size blobs with it at least. Got my mega-fast transmission going.
- Install Flow-IPC and get it working. Got my SHM-magic going.
- In no more than 1 hour I will write a program that uses just the SHM-magic part of Flow-IPC -- none of its actual IPC-transmission itself per se (which I claim itself is pretty good -- but it ain't lock-free custom awesomeness suitable for real-time automobile parts or what-not) -- but uses iceoryx's blob-transmission.
It would just need to ->construct<T>() with Flow-IPC (this gets a shared_ptr<T>); then ->lend_object<T>() (this gets a tiny blob containing an opaque SHM-handle); then use iceoryx to transmit the tiny blob (I would imagine this is the easiest possible thing to do using iceoryx); on the receiver call Flow-IPC ->borrow_object<T>(). This gets the shared_ptr<T> -- just the like the original. And that's it. It'll get deallocated once both shared_ptr<T> groups in both processes have reached ref-count 0. A cross-process shared_ptr<T> if you will. (And it is by the way just a shared_ptr<T>: not some custom type monstrosity. It does have a custom deleter, naturally, but as we know that's not a compile-time decision.)
So yes, believe it or not, I was not trying to out-compete you all here. There is zero doubt you're very good at what you do. The most natural use cases for the two overlap but are hardly the same. Live and let live, I say.