Compression efficiency with shared dictionaries in Chrome

jgrahamc|2 years ago

The very first project I worked on at Cloudflare but in 2012 was a delta compression-based service called Railgun. We installed software both on the customer's web server and on our end and thus were able to automatically manage shared dictionaries (in this case version of pages sent over Railgun were used as dictionaries automatically). You definitely get incredible compression results.

https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...

I am glad to see that things have moved on from SDCH. Be interesting to see how this measures up in the real world.

Scaevolus|2 years ago

Delta compression is a huge win for many applications, but it takes a careful hand to make it work well, and inevitably it gets deprecated as the engineers move on and bandwidth stops being a focus-- just like Railgun has been deprecated! https://blog.cloudflare.com/deprecating-railgun

Maybe the basic problem is with how hard it is to find engineers passionate about performance AND compression?

lynguist|2 years ago

I might be naive but isn’t that what rsync is doing?

saagarjha|2 years ago

Even putting aside CORS because I don’t even want to think about how this plays well with requests to another (tracking?) domain, this still doesn’t seem worth it. The explicit use case seems to be that it basically tells the server when you last visited the site based on which dictionary you have and then it gives you the moral equivalent of a delta update. Except, most browsers are working hard to expire data of this kind for privacy reasons. What’s the lifetime of these dictionaries going to be? I can see it being ok if it’s like 1 day but if this outlives how long cookies are stored it’s a significant privacy problem. The user visits the site again and essentially a cookie gets sent to the server? The page says “don’t put user-specific data in the request” but like nobody is stopping a website from doing this.

twotwotwo|2 years ago

I think fingerprinting using this is mostly like the more direct ways to fingerprint with the cache, and the defenses against one are the defenses against the other.

For the cross-site thing, cache partitioning is the defense. If the cache of facebook.com/file is independent for a.com and b.com, Facebook can't link the visits.

An attacker using the hash of a cached resource as a pseudo-cookie could previously use the content of the resource as the pseudo-cookie. The Use-As-Dictionary wildcard allows cleverer implementations, but it seems like you can fingerprint for the same time period/in the same circumstances as before. In both cases you might do your tracking by ignoring how you're supposed to be using the feature; as you note, no one's stopping you.

Before and after the compression feature, it is true anti-tracking laws, etc. should address tracking with persistent storage in general not only cookies, much as they need to handle localStorage or other hiding places for data. Also true that for a browser to robustly defend against linking two visits to the same domain (or limit the possibility of tracking to a certain time period, session, origin, etc.), caching is one of the things it has to limit.

I think if they get the expiry, partitioning, etc. right (or wrong) for stopping cache fingerprinting, they also get it right (or wrong) for this.

I was admittedly a fan of the original SDCH that didn't take off, figuring that inter-resource redundancy is a thing. It's a neat spin on it to use the compression algo history windows instead of purpose-built diff tools, and use the existing cache of instead of a dictionary store to the side. Seems easier to implement on both ends compared to the previous try. I could see this being helpful for quickly kicking off page load, maybe especially for non-SPAs and imperfectly optimized sites that repeat a not-tiny header across loads.

hinkley|2 years ago

I think I’d feel better with a fixed set of dictionaries based on a corpus that gets updated every year to match new patterns of traffic and specifications. Even if it’s less efficient.

charcircuit|2 years ago

Currently the max is temporarily capped at 30 days otherwise it would work as long as the dictionary is in the cache.

https://source.chromium.org/chromium/chromium/src/+/main:ser...

frankjr|2 years ago

> Dictionary entries (or at least the metadata) should be cleared any time cookies are cleared.

So it seems it should not get you anything you cannot already do with cookies.

https://github.com/WICG/compression-dictionary-transport/blo...

patrickmeenan|2 years ago

The dictionaries are partitioned by document and origin so a "tracking" domain will only be able to correlate requests within a given document origin and not across sites.

They are also cleared any time cookies are cleared and don't outlive what you can do today with cookies or Etags (and are using the most restrictive partitioning for that reason).

jauntywundrkind|2 years ago

The Request For Position on Mozilla Zstd Support (2018) has a ton of interesting discussion on dictionaries. https://github.com/mozilla/standards-positions/issues/105

The original proposal for Zstd was to use a predefined stastically generated dictionary. Mozilla rejected the proposal for that.

But there's a lot of great discussion on what Zstd can do, whic.h is astoundingly flexible & powerful. There's discussion on dynamic adjustment if cinpression ratios. And discussion around shared dictionaries and their privacy implications. That Mozilla turned around & started supporting Zstd & has stamped a positive indicator, worth prototyping on shared dictionaries is a good initial stamp of approval to see! https://github.com/mozilla/standards-positions/issues/771

One of my main questions after reading this promising update is: how do pick what to include when generating custom dictionaries? Another comment mentions that brotli has a standard dictionary it uses, and that's some kind of possible starting place. But it feels like tools to build one's custom dictionary would be ideal.

patrickmeenan|2 years ago

The brotli repo on github has a dictionary generator: https://github.com/google/brotli/blob/master/research/dictio...

I have a hosted version of it on https://use-as-dictionary.com/ to make it easier to experiment with.

eyelidlessness|2 years ago

I agree with other comments concerned with fingerprinting, and it was my second thought reading through the article. But my first thought was how beneficial this could be for return visitors of a web app, and how it could similarly benefit related concerns, such as managing local caches for offline service workers.

True, for documents (as is another comment’s focus) this is perhaps overkill. Although even there, a benefit could be imagined for a large body of documents—it’s unclear whether this case is addressed, but it certainly could be with appropriate support across say preload links[0]. But if “the web is for documents, not apps” isn’t the proverbial hill you’re prepared to die on, this is a very compelling story for web apps.

I don’t know if it’s so compelling that it outweighs privacy implications, but I expect the other browser engines will have some good insights on that.

0: https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes...

patrickmeenan|2 years ago

Even in the "documents" case of the web there can be pretty significant savings if users tend to visit more than one page and they share some amount of structure.

On the first entry to the site you trigger the load of an external dictionary that contains the common parts of the HTML across the site and then future document loads can be delta-compressed against the dictionary, effectively delivering just the page-specific bits.

You need to amortize the cost of loading the dictionary across the other page loads but it's usually pretty compelling once users visit more than 2-3 pages.

lukevp|2 years ago

This seems so ludicrous to me when all we really need is a way to share a resource reference across sites. Like “I need react 18.1 on this page, and the SHA should be abcdefghi “. If you don’t have it, I can give it to you from my server, or you can follow this link to a CDN, but the resource itself can be deduplicated based on the hashed contents instead of the URI. Why isn’t this a thing when basically everything uses frameworks nowadays? This shared dictionary seems like a more obtuse and roundabout way to solve these. If there was caching by hashes, browsers could even preload the latest versions of new libraries before any sites even referenced them.

ColonelPhantom|2 years ago

One potential issue is tracking. By sharing caches across websites it becomes possible to use timing attacks to track different users. This is why browsers are working to isolate caches per site: https://developer.chrome.com/blog/http-cache-partitioning

EE84M3i|2 years ago

Privacy issues.

You can use the presence of an item in the cache to correlate visits between sites.

matsemann|2 years ago

How could a dictionary in the browsers that are pre-made with JS in mind fare? Aka instead of making a custom dictionary per resource I send to the user, I could say that "my scripts.js file uses the browser's built-in js-es2023-abc dictionary". So the browser's would have some dictionaries others could reuse.

What's the savings on that approach vs a gziped file without any dictionary?

saagarjha|2 years ago

So Brotli already contains a dictionary that is trained on web traffic. I think the thing here is that Google wants to make sending YouTube 1.1 more efficient if you already have YouTube 1.0, but they can’t put YouTube 1.0 into the browser.

ComputerGuru|2 years ago

This seems like a possibly huge user/browser fingerprint. Yes, CORS has been taken into account, but for massive touch surface origins (Google, Facebook, doubleclick, etc) this certainly has concerning ramifications.

It’s also insanely complicated. All this effort, so many possible tuples of (shared dictionary, requested resource), none of which make sense to compress on-the-fly per-request, mean it’s specifically for the benefit of a select few sites.

When I saw the headline I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.

wongarsu|2 years ago

Don't want to set session cookies? Just provide user-specific compression dictionaries and use them as your session id! After all, how is the user supposed to notice they got a different dictionary than everyone else

dspillett|2 years ago

> I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.

More convoluted, but I expect using an old version as the source for the dictionary will yield significantly better results than a generic dictionary for that type of file.

Of course it doesn't help the first load, which might be more noticeable than subsequent loads when not every object has been modified. Perhaps having a standard dictionary for each type for the first request and using a specific one when the old version if available, would give noticeable extra benefit for those first requests for minimal extra implementation effort.

strongpigeon|2 years ago

> [...] mean it’s specifically for the benefit of a select few sites.

It does seem like the ones who benefit from this are large web application that often ship incremental changes. Which, to be fair are the ones that can use the most help.

This has the potential of moving the needle between: "the app takes 10 seconds to load" to "it loads instantly" for these scenarios. Say what you want about the fact that maybe they should optimize their stuff better, this does give them an easy out.

That being said, yeah this is really convoluted and does seem like a big fingerprinting surface.

unknown|2 years ago

[deleted]

falsandtru|2 years ago

Doesn't the fact that resources send different data mean that SRI(Subresource Integrity) checks cannot be performed? As for fingerprinting, it would not be a problem since it is the same as with Etag.

https://developer.mozilla.org/en-US/docs/Web/Security/Subres...

charcircuit|2 years ago

SRI hashes the decompressed resource

TacticalCoder|2 years ago

> Available-Dictionary: :pZGm1Av0IEBKARczz7exkNYsZb8LzaMrV7J32a2fFG4=:

The savings are nice in the best case (like in TFA: switching from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64 encoded hash is not compressible and so this line basically adds 60+ bytes to the request.

Kinda ouch for when it's going to be a miss?

dspillett|2 years ago

Maybe.

Though from the client side 60 bytes is likely not really noticeable¹ as a delay in the request send. Perhaps server side, which is seen many many client requests, will see an uptick in incoming bandwidth used, but in most cases servers responding to HTTP(S) requests see a lot more outgoing traffic (response sizes are much larger than requests sizes, on average), so have enough incoming bandwidth “spare” that it is not going to be saturated to the point where this has a significant effect.

--

[1] if the link is slow enough that several lots of 60 bytes is going to have much effect² it likely also has such high latency that the difference is dwarfed by the existing delays.

[2] a spotty GRPS connection? is anything slower than that in common use anywhere?

sethev|2 years ago

If 60 bytes per request is a material overhead, then your workload is unlikely to benefit from general purpose compression of any kind.

nevir|2 years ago

What are the chances that the ~60 bytes are going to push the request over the frame size and end up splitting into another packet?

adrianmonk|2 years ago

Aren't misses pretty preventable?

The only reason the client is even asking is that the server sent them a header saying it might be beneficial to do so.

And the client definitely has the dictionary data. The only thing it needs is for the server to accommodate the request after leading it down that path in the first place.

I can picture how it could happen, though. If you didn't realize the cost, you might not try to prevent misses. Or you could have a configuration error like sending the header but forgetting to generate pre-compressed data in your build.

If this is a significant issue, a server could collect stats and generate warnings about situations where it's not pulling its weight. Or even automatically disable it if hit rates are terrible.

tarasglek|2 years ago

chrome team usually trials changes like this with extensive a/b testing via telemetry. Got to be a large overall win even with this.

sillysaurusx|2 years ago

Clearly we’ll need to use a shared dictionary to compress this.

lozenge|2 years ago

It might be compressible. HTTP/3 includes compression of request headers. Base64 doesn't use the top two bits in a byte so it's compressible.

ramses0|2 years ago

This plus native web-components is an incredible advance for "the web".

Fingerprinting concerns aside (compression == timing attacks in the general case), the fact that it's nearly network-transparent and framework/webserver compatible is incredible!

raggi|2 years ago

What I really want: dictionaries derived from the standards and standard libraries (perhaps once a year or somesuch), which I'd use independently of build system gunk, and while it wouldn't be the tightest squeeze you can get, it would make my non-built assets get very close to built asset size for small to medium sized deployments.

IshKebab|2 years ago

Ah damn I thought this was going to be available to JavaScript. Would be amazing for one use case I have (an HTML page containing inline logs from a load of commands, many of which are substantially similar).

patrickmeenan|2 years ago

Maybe eventually (as a different spec). We've talked about wanting to support it in the DecompressionStream API or something similar at some point.

If you need it to be able to do compression though then it might be a harder sell since the browser doesn't ship with the compression code for zstd or brotli and would have to justify adding it.

jauntywundrkind|2 years ago

That would be an excellent web standard!!

There's wasm modules that do similar but having it bakes into the browser could allow for further optimization than what's possible with wasm. https://github.com/bokuweb/zstd-wasm

I have no idea if it's possible but I wonder if a webgpu port could be made? Alternatively, for your use case, maybe you could try applying something like Basis Universal; a fast compression system for textures, that it seems there are some webgpu loaders for... Maybe that could be bent to encoding/deciding text?

netol|2 years ago

The part I'm missing is how these dictionaries are created. Can I use the homepage to create my dictionary, so all other pages that share html are better efficiently compressed? How?

patrickmeenan|2 years ago

For a delta update of one version of a resource to the next, the resource itself is the dictionary (i.e. JS file).

For stand-alone dictionaries, the brotli code on github has a dictionary_generator that you can use to generate a dictionary. You give it a dictionary size and a bunch of input files and it will generate one. I have a version of it hosted on https://use-as-dictionary.com/ that you can pass up to 100 URLs to and it will generate a dictionary for you (using the brotli tool).

Sigliotio|2 years ago

That should be used together with ML models.

Image compression for example or voice and video compression like what nvidia does.

But i do like this implementation focusing on libs, why not?

jwally|2 years ago

Dumb question, but with respect to fingerprinting - how is this any worse than cookies, service workers, or localstorage?

skybrian|2 years ago

I wonder if this would be a good alternative to minimizing JavaScript and having separate sourcemaps?

madeofpalk|2 years ago

Not really.

Compressing JavaScript already gives you tonnes of benefits, but syntax-aware compression (modify js) gives you more.

Besides, this is a form of more efficient caching on that it only benefits subsequent visits.

kevingadd|2 years ago

JS minification will probably never die, because it makes parsing meaningfully faster.

tsss|2 years ago

This _screams_ sidechannel attack.

patrickmeenan|2 years ago

How so? SDCH had sidechannel issues which is part of why it was unshipped. I don't know that someone won't find a way to attack it but the CORS requirement already requires that the dictionary and compressed-resource be readable and the dictionary has to be same-origin as the resources that it compresses.

Combined they mitigate the known dictionary-specific attack vectors.

judith48|2 years ago

[deleted]

hwbunny|2 years ago

[deleted]

callalex|2 years ago

[deleted]

kazinator|2 years ago

With shared dictionaries you can compress everything down to under a byte.

Just put the to-be-compressed item into the shared dictionary, somehow distribute that to everyone, and then the compressed artifact consists of a reference to that item.

If the shared dictionary contains nothing else, it can just be a one-bit message whose meaning is "extract the one and only item out of the dictionary".

cuckatoo|2 years ago

What stands out to me is that this creates another 'key' that the browser sends on every request which can be fingerprinted or tracked by the server.

I do not want my browser sending anything that looks like it could be used to uniquely identify me. Ever.

I want every request my browser makes to look like any other request made by another user's browser. I understand that this is what Google doesn't want but why can't they just be honest about it? Why come up with these elaborate lies?

Now to limit tracking exposure, in addition to running the AutoCookieDelete extension I'll have to go find some AutoDictionaryDelete extension to go with it. Boy am I glad the internet is getting better every day.

jsnell|2 years ago

The obvious answer is that they are not lying.

You're making three assertions, none backed by any evidence. That this is a tracking vector, that it's primarily intended to be a tracking vector, and that they're lying about their motivations.

But your reasoning fails already at the first step, since you just assumed malice rather than do any research. This is not a useful tracking vector. The storage is partitioned by the top window, and it is cleared when cookies are cleared. It's also not really a new tracking vector, it's pretty much the same as ETags.

unknown|2 years ago

[deleted]

74 comments