In case anyone wonders why they are called German strings: the article mentions the "research predecessor" of Cedar, Umbra. Umbra is a project of TU (technical university) Munich, Germany.
I knew the C++ strings were optimized so. I do not like calling them "German". First time I see them called so (I know it, you can search yourself, as SSO -short string opt.), and looks as some kind of nationalist pride thing to me. Is certainly not a unique or new idea, many scheme/lisps implementations do that for strings AND numbers.
Downvotes coming from other connationals :) love you! I know… never say anything bad about Vaterland
Interesting to see a deepdive about string formats. I hadn't thought very deeply about it before.
I do agree with the string imutable argument. Mutable and imutable strings have different usecases and design tradeoffs. They perhaps shouldn't be the same type at all.
The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)
So beyond the database usecase it's a clever string format.
It would be nice to have an ISO or equivalent specification on it though.
> The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)
It's not anything special? That's just `string_view` (C++17). Java also used to do that as an optimisation (but because it was implicit and not trivial to notice it caused difficult do diagnose memory leaks, IIRC it was introduced in Java 1.4 and removed in 1.7).
I never really put much thought into it either, until I started playing with Rust, which pretty much supports every common way to use strings out there. Mostly for compatibility sake, but still, it's wild all the same.
"Optimized" string types are everywhere and I bet that multiple people have already created string types almost identical to German strings. But the memory savings are small and they are not more efficient than ordinary strings. For string comparison you compare the pointers, which is cheaper than comparing two pairs of registers. If the pointers mismatch you compare the (cached) hashes and only if they match do you need to compare characters. For the prefix query, starts_with(content, 'http'), just store a string of the four-character prefix. With immutable strings the overhead is just one pointer.
Do you have a pointer to real world data about the effectiveness of these optimizations? I learned about it (SSO, in std lib which is basically the same) in an article which really made it look as that would make anything in C++ blazing fast. In the codebases I worked, a couple of times, I did measure (what you shoud do before optimizing) and the results where between absolutely negligible to worst when active. But that were 3 data points. Mind you one in a real time database.
> We would like to have a string that is very cheap to construct and points to a region of memory that is currently valid, but may become invalid later without the string having control over it.
> This is where transient strings come in. They point to data that is currently valid, but may become invalid later, e.g., when we swap out the page on which the payload is stored to disk after we’ve released the lock on the page.
> Creating them has virtually no overhead: They simply point to an externally managed memory location. No memory allocation or data copying is required during construction! When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid. So if you need to access it later, you need to copy it to memory that you control.
Hm. What if I don't bother with that and I just read from the transient string? It's probably still good.
> In C, strings are just a sequence of bytes with the vague promise that a \0 byte will terminate the string at some point.
> This is a very simple model conceptually, but very cumbersome in practice:
> What if your string is not terminated? If you’re not careful, you can read beyond the intended end of the string, a huge security problem!
This sounds like a problem that transient strings were designed to exemplify. How do they improve on the C model?
-----
I was interested that the short strings use a full 32-bit length field. That's a lot of potential length for a string of at most 12 characters.
If we shaved that down to the four bits necessary to represent a number from 0-12, we'd save 28 bits, which is 3.5 characters. Adding three characters to the content would bring the potential length of a short string up to 15, requiring 0 additional length bits. And we'd have four bits left over.
I assume we aren't worried about this because strings of length 13-15 are already rare and it adds a huge amount of complexity to parsing the string, but it was fun to think about.
If you want to improve equality matching for longer strings, you could even store a 4B hash of the entire string instead of the prefix. I guess that should work well if you equality match on URLs since their prefix is always "http".
This is actually really similar to how SQL Server has long encoded it's varchar(max) format as I understand it. Short text is stored on the row page, but longer text is bumped to a different page.
Postgres does the same thing, however AFAIK postgres does not use a fixed-size string which happens to have inline string data: text is always variable, and stored inline up to 127 bytes (after compression).
These are different because the inline segment is fixed-size, and always exposes a 4 bytes prefix inline even when the buffer is stored out of line.
Joel had a very nice quote - the whole history of C/C++ is them trying to deal with strings. In a way it is both worrying and encouraging that 50 years in there is still development in the area.
The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.
The length is not the number of characters anymore but just the size of the string.
> To solve these problems, Umbra, the research predecessor of CedarDB, invented what Andy Pavlo now affectionately (we assume ;)) calls “German-style strings”.
This is how Borland Turbo Pascal stored strings as far back as the first version in mid-80s.
I think is about the kind of union they use, to store it differently depending on the string length, not the fact of length+data. Anyway is/was also nothing remotely new (the idea) as many lisp and scheme implementations have done so for strings and numbers basically for ages.
German-style strings is a way to store array of strings for columnar dbs. The idea is to have an array of metadata. Metadata has a fixed size (16 bytes) The metadata includes the string length and either a pair of pointer + string prefix or the full string for short strings. For some operations the string prefix is enough in many cases avoiding the indirection.
ahartmetz|1 month ago
weinzierl|1 month ago
f1shy|1 month ago
Downvotes coming from other connationals :) love you! I know… never say anything bad about Vaterland
aDyslecticCrow|1 month ago
I do agree with the string imutable argument. Mutable and imutable strings have different usecases and design tradeoffs. They perhaps shouldn't be the same type at all.
The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)
So beyond the database usecase it's a clever string format.
It would be nice to have an ISO or equivalent specification on it though.
masklinn|1 month ago
It's not anything special? That's just `string_view` (C++17). Java also used to do that as an optimisation (but because it was implicit and not trivial to notice it caused difficult do diagnose memory leaks, IIRC it was introduced in Java 1.4 and removed in 1.7).
tracker1|1 month ago
Rygian|1 month ago
https://news.ycombinator.com/item?id=41176051
bjourne|1 month ago
f1shy|1 month ago
unknown|1 month ago
[deleted]
thaumasiotes|1 month ago
> This is where transient strings come in. They point to data that is currently valid, but may become invalid later, e.g., when we swap out the page on which the payload is stored to disk after we’ve released the lock on the page.
> Creating them has virtually no overhead: They simply point to an externally managed memory location. No memory allocation or data copying is required during construction! When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid. So if you need to access it later, you need to copy it to memory that you control.
Hm. What if I don't bother with that and I just read from the transient string? It's probably still good.
> In C, strings are just a sequence of bytes with the vague promise that a \0 byte will terminate the string at some point.
> This is a very simple model conceptually, but very cumbersome in practice:
> What if your string is not terminated? If you’re not careful, you can read beyond the intended end of the string, a huge security problem!
This sounds like a problem that transient strings were designed to exemplify. How do they improve on the C model?
-----
I was interested that the short strings use a full 32-bit length field. That's a lot of potential length for a string of at most 12 characters.
If we shaved that down to the four bits necessary to represent a number from 0-12, we'd save 28 bits, which is 3.5 characters. Adding three characters to the content would bring the potential length of a short string up to 15, requiring 0 additional length bits. And we'd have four bits left over.
I assume we aren't worried about this because strings of length 13-15 are already rare and it adds a huge amount of complexity to parsing the string, but it was fun to think about.
xnorswap|1 month ago
I wonder if they also have the concept of a reverse string which stores the (reversed) suffix instead and stores the short strings backward.
Niche, but would be fast for heavy ends-with filters.
msichert|1 month ago
kardianos|1 month ago
masklinn|1 month ago
These are different because the inline segment is fixed-size, and always exposes a 4 bytes prefix inline even when the buffer is stored out of line.
orphea|1 month ago
tracker1|1 month ago
ReptileMan|1 month ago
cubefox|1 month ago
nkrisc|1 month ago
unknown|1 month ago
[deleted]
jmclnx|1 month ago
StopDisinfo910|1 month ago
The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.
The length is not the number of characters anymore but just the size of the string.
Apart from that, it should work exactly the same.
masklinn|1 month ago
f1shy|1 month ago
WaitWaitWha|1 month ago
This is how Borland Turbo Pascal stored strings as far back as the first version in mid-80s.
Length followed by the string.
f1shy|1 month ago
mau|1 month ago
This is different from Pascal strings.
afandian|1 month ago
xnorswap|1 month ago
Pascal strings are: { length, pointer }
In these strings:
For short strings it's storing:
for longer strings, it's storing