A good day to trie-hard: saving compute 1% at a time

[+] amluto|1 year ago|reply

If you had asked me to make a wild guess as to how Cloudflare stores internal headers and then removes them, I would have come up with some options:

- An entire separate dictionary or other data structure.

- One single header containing all internal metadata.

- All headers have a prefix, and the internal ones start with I and the external ones start with E.

- All internal headers start with “CFInt”.

I would not have come up with a scheme in which headers in a particular list are internal. (What if someone else uses this name? What if something forgets to sanitize? What if different simultaneously running programs disagree in the list? What if the Connection header names a Cloudflare-internal header? What if the set-difference algorithm is annoyingly slow?)

The web is already full of obnoxiously ambiguous in-band signaling and header naming, and I find it bizarre that a company with Cloudflare’s scale uses such a tedious and error-prone mechanism internally.

[+] e63f67dd-065b|1 year ago|reply

Former employee here; the interesting (horrifying?) fact is that you can set some of these internal headers (I remember a notable bug with cf-cache-status) in workers (the serverless offering) and cause all kinds of bad things to happen :)

[+] efitz|1 year ago|reply

Ive worked at several huge corporations in IT security, where we care about headers a lot, and they all use headers in a manner similar to CloudFlare.

Including using proxies at the edge to strip out internal headers bidirectionally- yes, inbound too.

[+] hn_throwaway_99|1 year ago|reply

I feel like I can point out lots of similar problems to the other solutions you suggest (heck, I thing some of those problems you list even apply to those other solutions).

The list approach has some downsides, but it also has a bunch of upsides. I feel like when people like to point out potential flaws of these approaches, they're ignoring the history and difficulties that comes with Cloudflare's scope. An enumerated list is the simplest and most flexible of the approaches, and it also doesn't require any a priori agreement on the structure of a header key - this is probably important when you think about the sheer number of teams at Cloudflare, potential technology acquisitions, etc.

[+] jgrahamc|1 year ago|reply

Yeah, me too, but systems grew over time and grew and grew and we were using HTTP headers for all sorts of stuff. This optimization makes the situation better, but the solution (which is underway) is to use a different mechanism for IPC and get rid of the use of HTTP headers completely.

[+] taeric|1 year ago|reply

Other ideas:

  - Ensure all headers added internally that are not for export at the front of the header list
  - Preseed the hashtable of all requests with internal headers you plan to remove.

In fact, if you preseed, you are basically combining these ideas but fixing how many internal headers are on each request. At that point, you can use a linked hash table that preserves creation order and just remove the first N from the final list that you send back to clients.

[+] anonymoushn|1 year ago|reply

X years ago the manager of the cache team raised the lack of data plane and control plane separation as an issue and we came up with various plans to fix things, but I guess nothing has happened since

[+] danbruc|1 year ago|reply

This might of course introduce quite a bit of overhead, but the clean solution would arguably be to not mix requests to begin with, the external requests you are processing are the payload of the requests you are sending around internally. This would also allow you to cook up an optimized encapsulation protocol that might be more efficient to process than parsing and modifying HTTP. This admittedly comes from someone with essentially zero knowledge of what exactly Cloudflare is doing to the requests they receive.

[+] giancarlostoro|1 year ago|reply

> What if someone else uses this name?

Couldn't you bypass this by pre-fixing the ones that aren't yours? Or prefixing your internal ones with something unique enough?

[+] unknown|1 year ago|reply

[deleted]

[+] 620gelato|1 year ago|reply

Or perhaps even, insert yet another header with just the list of internal headers being added to the request, assuming this happens at a single place, otherwise a recipe for disaster.

I have a slightly different example of this, where a rpc framework used in my company disallows the service owner from modifying certain headers (say request identifier), and will instead create a duplicate header with a suffix. In that scenario at least, I can see this as a fairly reasonable tradeoff, as the goal is to control certain headers from being modified, not because they are platform internal but because there are certain assumptions associated with those headers that are followed company-wide.

I'll go check what mechanism is used for this matching.

[+] pragma_x|1 year ago|reply

It took me a moment to ponder the effectiveness of mapping utf8 characters into a bitmask. At first, it seemed unwise. Then I realized that 32 bits would get you `a-z` plus six special characters, and 64 bits would get you 'A-Z' (uppercase) plus six more. That's plenty of room for HTTP headers, and for a blazing fast matching algorithm since it just masks and compares a couple integers. Even figuring out which bit corresponds to which character is a simple lookup into a 256-word table.

One thing the author leaves out is this technique is technically a Bloom Filter. I find these kinds of things fascinating since they came about in an era where computing was far more resource bound than today (1970 in this case). But here we are, still finding real-world corners to use the same old optimizations.

https://en.wikipedia.org/wiki/Bloom_filter

[+] samatman|1 year ago|reply

There are significant differences between trie-hard and a Bloom filter. Bloom filters are probabilistic, and use hashing. They're great for when rare false positives are acceptable in exchange for no false negatives, which isn't uncommon, but isn't what's needed here: this is exact, and hashing is the step to beat.

Rather, this is a refinement/variation on Liang's algorithm, used in TeX for storing a hyphenation dictionary. The thesis mentions Bloom's algorithm, which it calls "superimposed coding", and very much hearkens back to a time when memory was often the most precious commodity. I think you'll like it. ^_^

https://tug.org/docs/liang/liang-thesis.pdf

[+] gcbirzan|1 year ago|reply

The problem is that bloom filters are good when your dataset is large and calculating several hashes per item is not as expensive as looking it up in the original dataset. Here, they say that even one hash is expensive, so bloom filters would be much slower than the original hash map.

[+] gregsexton|1 year ago|reply

The presented data for demonstrating a win doesn't have enough power to actually show this - not enough samples were taken.

A very simple analysis in R:

    > prop.test(c(9, 4), c(1103,1171))

    2-sample test for equality of proportions with continuity correction

    data:  c(9, 4) out of c(1103, 1171)
    X-squared = 1.4915, df = 1, p-value = 0.222
    alternative hypothesis: two.sided
    95 percent confidence interval:
    -0.002409831  0.011897193
    sample estimates:
        prop 1      prop 2 
    0.008159565 0.003415884

A p-value of 0.22 isn't below the magic 0.05 and the 95% confidence interval suggests that the trie might actually be slightly worse.

I imagine the trie is better, given the prior analysis, and there is weak evidence for this. But take (a lot) more samples and know with confidence how much better.

[+] Validark|1 year ago|reply

It's silly to use Big-O to describe the number of comparisons when the goal is to analyze performance. Comparisons cost 1 cycle, and you can do many of them per cycle with instruction level parallelism and SIMD. The main bottleneck, the actual source of slowness, is memory. It costs thousands of cycles to go to memory, sometimes tens of thousands or hundreds of thousands (if you need a TLB walk or OS interrupt). If you want to use Big-O, use it to estimate the number of cache misses.

I'd probably go with a custom perfect hash function. And Phil Bagwell's popcount trick. That would be faster than everyone else's solution that involves multiple lookups in memory.

The CPU is fast, memory is slow.

[+] mightyham|1 year ago|reply

I'm not very well versed in data structure optimization, but I was surprised they dismissed hash tables so quickly, especially when the table they are searching through is static. I find it somewhat hard to believe that a specially optimized hash table would not be faster than their trie implementation.

[+] evmar|1 year ago|reply

This post uses a fancy data structure to construct a to_remove set, then iterates through it to remove these from the underlying header map.

It appears the "remove_header" call is this one: https://docs.rs/pingora-http/0.3.0/src/pingora_http/lib.rs.h... which calls .remove() on two other data structures, both of which bottom out in this mountain of code: https://docs.rs/http/latest/src/http/header/map.rs.html#1550

[+] miso2024|1 year ago|reply

Finally a blog post with a trie. Those trie leetcodes didn't go in vain ;)

[+] aidenn0|1 year ago|reply

Given that the set if items to match is static, I wonder if they tried a perfect hash table; that would reduce it to a few arithmetic operations followed by a single string compare. It would be interesting to see how it compares to the trie.

[+] sjf|1 year ago|reply

That is what I immediately thought as well. The problem though is that a perfect hash is still going to be O(key length) for the hash function, which they are trying to avoid.

In theory, if they used a regex it would be using a state machine to do the matching, which should have similar performance to the trie- only O(k) in the worst case. But from what I understand regex libraries don't actually build the state machine, they use backtracking so the performance is no longer O(k).

I'm surprised they couldn't find a performant regex library that already existed that uses state machines. It should have similar performance to the trie. But in reality, the performance is affected more deeply by things like memory access patterns and the performance of specific arithmetic operations, so it's impossible really to speculate.

[+] Scaevolus|1 year ago|reply

Did you try a (tiny) bloom filter? Doing a quick convolution of the header key and a test against a bloom filter (SIMD? Using builtin CRC ops? 256-bit bloom filter size?) might avoid walking the trie altogether in most cases at the cost of a few cycles.

[+] jgrahamc|1 year ago|reply

I do not know if the team did that but Bloom filters aren't a panacea: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

[+] pragma_x|1 year ago|reply

Considering that each node's filter in their trie implementation is bloom-like already, the solution is awful close. I agree that testing wider swaths of the string at once might be a dramatic accelerator.

A more naive approach might also be warranted. Some frequency analysis of hit/miss for trie nodes might allow them to find specific character positions with a higher miss rate than the first one. Testing such special positions first would speed things up. That assumes, of course, that the header data is fairly regular in nature.

[+] unknown|1 year ago|reply

[deleted]

[+] jkaptur|1 year ago|reply

I think that would require calculating the hash, which was ruled out as too slow.

[+] unusualmonkey|1 year ago|reply

My 2c:

1) Is this worthwhile? Looks like ~500 CPU cores were saved (are these real cores, or does this include hyperthread cores?). I don't know cloudflare's costs, but this seems like single digit servers and probably savings only in the $XX,XXX range. Not nothing, but do you expect a positive ROI on engineering?

2) If you do want to go to this detail, did you consider in putting the filter at the deserialization step, preventing the headers from being created in the first place?

[+] aetherspawn|1 year ago|reply

It might only be $XX,XXX but it is forever.

Power savings forever.

Lower carbon emissions.

etc

Nice to see a company that cares about making things 1% faster rather than trying to analyse headers with AI to sell me sandals or something stupid like that.

[+] cryptonym|1 year ago|reply

That's such a tiny corner case but yet something you can easily publish. Fuelling marketing content to ends with hundred geeks discussing if what CF is doing makes sense or how much they care about perf.

That's million dollars.

[+] _zoltan_|1 year ago|reply

I don't understand your point about the ROI. Let's say it's 40k a year for 5 years that's 200k. That's a multi year salary Hungary/Poland (I have no idea of CF has offices there.)

[+] saagarjha|1 year ago|reply

If an engineer spends a couple of days on it then the savings seem worthwhile?

[+] hyperpape|1 year ago|reply

I'm interested to know why the regex crate didn't do better. I believe it should compile a search for multiple literal strings into an Aho-Corasick automaton, which is structured very much like a trie.

[+] burntsushi|1 year ago|reply

regex crate author here. From a quick look at trie-hard, my guess is that trie-hard can do lookups on short strings much more quickly than the regex crate can do matches on short strings. The regex crate has a fair bit of overhead for selecting the right matching engine and applying other optimizations appropriate to general purpose matching.

But the aho-corasick crate, and especially the specific automatons, would be interesting to try. There's much less overhead there. But, there are some differences:

* The contiguous NFA in aho-corasick has a lot more going on inside its "next transition" routine: https://github.com/BurntSushi/aho-corasick/blob/cd400ad792d6... --- This is primarily because an Aho-Corasick automaton isn't just a trie. It also has failure transitions to support unanchored search. You can search without respecting failure transitions (and thus treat it as a trie), but the code is still there to deal with failure transitions. So that may have deleterious effects.

* The contiguous NFA also specializes its transition function depending on the size of the node. It would be interesting to see whether these techniques would be useful for trie-hard to play with. But there is definitely a benefit to keeping one simple and fast code path for all node types. (Less branching.)

* The aho-corasick crate does also expose a DFA directly: https://docs.rs/aho-corasick/latest/aho_corasick/dfa/struct.... --- In theory this should do less work per transition than trie-hard, and so it would be very interesting to measure its perf when compared to trie-hard. The main downside is that it's a DFA and can use up a lot more memory and take longer to build. It seems like Cloudflare cares less about that in favor of optimizing read perf though, so it's not clear why they didn't go this route.

* The aho-corasick crate doesn't let you store arbitrary values. You only get back a PatternID when a match occurs. So to get equivalent functionality as trie-hard, you'd need a separate map of PatternID->YourValue. Since that isn't stored with the data in the trie, this alone might penalize you quite a bit with poorer CPU cache performance.

* trie-hard seems to provide some customization options for its representation. i.e., Let's you use a `u8` to store its transitions versus a `u128` or `u256`. This might also help with cache usage for a small enough trie.

[+] cwilby|1 year ago|reply

> Internally, we call the request’s destination server its “origin”

Origin: The point at which something comes into existence or from which it derives or is derived.

How can the request's destination server be the origin, if it is the destination server?

[+] fowl2|1 year ago|reply

Depends on your perspective :)

"Origin" is a term from web browsers, from their point of view it refers to where a web page came from.

see https://developer.mozilla.org/en-US/docs/Web/Security/Same-o...

[+] tills13|1 year ago|reply

On the internet, the origin is the server sending the response to the user. I suppose you can look at it from the perspective of the owner of the server -- from their frame of reference, their journey _starts_ when they receive, process, and respond to the request.

Granted, computer scientists are infamously known for being terrible at naming things.

[+] toast0|1 year ago|reply

The origin is the origin of the content.

[+] sophacles|1 year ago|reply

So this trie uses a custom u256 type that is a wrapper around an array of 4 u64s. Is rust smart enough to vecorize it into AVX instructions for the bit twiddling?

What about for the other integer sizes that the trie supports?

[+] FridgeSeal|1 year ago|reply

A lot of Rusts' optimisations are from LLVM, so if that's smart enough to, then I assume it also benefits.

[+] Validark|1 year ago|reply

If the bit twiddling is independent in each u64, then probably. If it's treated as one giant bitstring, probably not.

[+] maciek12|1 year ago|reply

Am i missing something or is the big O notation in this article wrong? For example in "Sorted sets like BTreeSet use comparisons for searching, and that makes them logarithmic over key length O(log(L)), but they are also logarithmic in size too" how is the search done in logarithmic time? You could have a header that differs from an internal one only on the last character and then you have to read the whole thing. Also space-wise B-trees are linear, not O(nlogn). Additionally, at the end when talking about the trie, how do you achieve O(log(L)) for misses? Tries are not balanced, they do not halve the possible set on comparison (as I think the author states) and even if they did I don't see how that achieves the logarithmic time.

[+] vlovich123|1 year ago|reply

> You could have a header that differs from an internal one only on the last character and then you have to read the whole thing

When people talk about BigO they usually talk about average Big-o and worst/best case Big-o is explicitly called out if mentioned, so you can't just think in terms of the worst possible case. Regardless I think they made several mistakes here as you note correctly.

BTreeSet is log(N) in number of nodes compared. But for strings that comparison is O(L) since you at least have to compare it to the string you find. So I believe it's at best O(L log(N)) where L is the average string length and N is the number of nodes which is worse than the hash table which is just O(L). It's tricky though if most strings don't match your string. In that case I believe you end up degrading to an O(L) + O(log(N)).

Similarly, you are correct that they can't be logarithmic in size and must be linear since you have to store the data. Tries can be sublinear depending on the input since it's a form of compression as well.

Similarly you are right about trie complexity not being O(log(L)) for trie misses. I think what they're observing is a massive speedup because mismatches error out on the first character usually. But it wouldn't be logarithmic as there's unlikely to be a logarithmic relation between headers that mismatch and the universe of matching words.

[+] y04nn|1 year ago|reply

Why adding and then removing HTTP headers of a existing request for routing instead of adding a dedicated (custom) routing protocol above the HTTP stack? This would allow to just drop the added protocol on the outbound and be much more space efficient.

[+] intelVISA|1 year ago|reply

The fabled tech talent shortage, probably.

[+] MehdiHK|1 year ago|reply

Slightly off topic: I was a bit surprised nobody ported space-efficient marisa trie to Rust, or any other languages.

Anybody knows why?

https://github.com/s-yata/marisa-trie

[+] IncreasePosts|1 year ago|reply

Off topic: how does 60M requests a second stack up with other big players like google, Facebook, TikTok?

[+] alexchamberlain|1 year ago|reply

Jumping on the bandwagon of other ideas... I wonder if it would be quicker to filter pit the internal headers when you write the request to the network (buffer)? ie something like `request_headers.filter(not_is_internal).for_each(...)`; that way you don't have to remove anything from the data structure at all, but does require you to break down a layer of abstraction for performance benefits.

[+] vlovich123|1 year ago|reply

No, you'd just be smearing the work at best. The expensive part is determining set containment not updating the data structure. It doesn't matter if you do it in a loop and update the data structure or do that check before writing.

[+] wolf550e|1 year ago|reply

I don't know Rust. Can someone explain how this data structure stores whether a node is itself a valid word or whether it only leads to more nodes? For example the node "do" in their (“and”, “ant”, “dad”, “do”, & “dot”) example. I guess it's a discriminated union provided by Rust algebraic data types or something like that, but why not show the full bit pattern in memory?

[+] bhawks|1 year ago|reply

Alternatively at request ingress time you could take all original header names and save them, and at egress time you could use that list to drive which keys to emit?

Then you wouldn't need to maintain a list of what is internal and you wouldn't have to do any trie lookups.

173 comments