The very first project I worked on at Cloudflare but in 2012 was a delta compression-based service called Railgun. We installed software both on the customer's web server and on our end and thus were able to automatically manage shared dictionaries (in this case version of pages sent over Railgun were used as dictionaries automatically). You definitely get incredible compression results.
Delta compression is a huge win for many applications, but it takes a careful hand to make it work well, and inevitably it gets deprecated as the engineers move on and bandwidth stops being a focus-- just like Railgun has been deprecated! https://blog.cloudflare.com/deprecating-railgun
Maybe the basic problem is with how hard it is to find engineers passionate about performance AND compression?
Even putting aside CORS because I don’t even want to think about how this plays well with requests to another (tracking?) domain, this still doesn’t seem worth it. The explicit use case seems to be that it basically tells the server when you last visited the site based on which dictionary you have and then it gives you the moral equivalent of a delta update. Except, most browsers are working hard to expire data of this kind for privacy reasons. What’s the lifetime of these dictionaries going to be? I can see it being ok if it’s like 1 day but if this outlives how long cookies are stored it’s a significant privacy problem. The user visits the site again and essentially a cookie gets sent to the server? The page says “don’t put user-specific data in the request” but like nobody is stopping a website from doing this.
I think fingerprinting using this is mostly like the more direct ways to fingerprint with the cache, and the defenses against one are the defenses against the other.
For the cross-site thing, cache partitioning is the defense. If the cache of facebook.com/file is independent for a.com and b.com, Facebook can't link the visits.
An attacker using the hash of a cached resource as a pseudo-cookie could previously use the content of the resource as the pseudo-cookie. The Use-As-Dictionary wildcard allows cleverer implementations, but it seems like you can fingerprint for the same time period/in the same circumstances as before. In both cases you might do your tracking by ignoring how you're supposed to be using the feature; as you note, no one's stopping you.
Before and after the compression feature, it is true anti-tracking laws, etc. should address tracking with persistent storage in general not only cookies, much as they need to handle localStorage or other hiding places for data. Also true that for a browser to robustly defend against linking two visits to the same domain (or limit the possibility of tracking to a certain time period, session, origin, etc.), caching is one of the things it has to limit.
I think if they get the expiry, partitioning, etc. right (or wrong) for stopping cache fingerprinting, they also get it right (or wrong) for this.
I was admittedly a fan of the original SDCH that didn't take off, figuring that inter-resource redundancy is a thing. It's a neat spin on it to use the compression algo history windows instead of purpose-built diff tools, and use the existing cache of instead of a dictionary store to the side. Seems easier to implement on both ends compared to the previous try. I could see this being helpful for quickly kicking off page load, maybe especially for non-SPAs and imperfectly optimized sites that repeat a not-tiny header across loads.
I think I’d feel better with a fixed set of dictionaries based on a corpus that gets updated every year to match new patterns of traffic and specifications. Even if it’s less efficient.
The dictionaries are partitioned by document and origin so a "tracking" domain will only be able to correlate requests within a given document origin and not across sites.
They are also cleared any time cookies are cleared and don't outlive what you can do today with cookies or Etags (and are using the most restrictive partitioning for that reason).
The original proposal for Zstd was to use a predefined stastically generated dictionary. Mozilla rejected the proposal for that.
But there's a lot of great discussion on what Zstd can do, whic.h is astoundingly flexible & powerful. There's discussion on dynamic adjustment if cinpression ratios. And discussion around shared dictionaries and their privacy implications. That Mozilla turned around & started supporting Zstd & has stamped a positive indicator, worth prototyping on shared dictionaries is a good initial stamp of approval to see! https://github.com/mozilla/standards-positions/issues/771
One of my main questions after reading this promising update is: how do pick what to include when generating custom dictionaries? Another comment mentions that brotli has a standard dictionary it uses, and that's some kind of possible starting place. But it feels like tools to build one's custom dictionary would be ideal.
I agree with other comments concerned with fingerprinting, and it was my second thought reading through the article. But my first thought was how beneficial this could be for return visitors of a web app, and how it could similarly benefit related concerns, such as managing local caches for offline service workers.
True, for documents (as is another comment’s focus) this is perhaps overkill. Although even there, a benefit could be imagined for a large body of documents—it’s unclear whether this case is addressed, but it certainly could be with appropriate support across say preload links[0]. But if “the web is for documents, not apps” isn’t the proverbial hill you’re prepared to die on, this is a very compelling story for web apps.
I don’t know if it’s so compelling that it outweighs privacy implications, but I expect the other browser engines will have some good insights on that.
Even in the "documents" case of the web there can be pretty significant savings if users tend to visit more than one page and they share some amount of structure.
On the first entry to the site you trigger the load of an external dictionary that contains the common parts of the HTML across the site and then future document loads can be delta-compressed against the dictionary, effectively delivering just the page-specific bits.
You need to amortize the cost of loading the dictionary across the other page loads but it's usually pretty compelling once users visit more than 2-3 pages.
This seems so ludicrous to me when all we really need is a way to share a resource reference across sites. Like “I need react 18.1 on this page, and the SHA should be abcdefghi “. If you don’t have it, I can give it to you from my server, or you can follow this link to a CDN, but the resource itself can be deduplicated based on the hashed contents instead of the URI. Why isn’t this a thing when basically everything uses frameworks nowadays? This shared dictionary seems like a more obtuse and roundabout way to solve these. If there was caching by hashes, browsers could even preload the latest versions of new libraries before any sites even referenced them.
One potential issue is tracking. By sharing caches across websites it becomes possible to use timing attacks to track different users. This is why browsers are working to isolate caches per site: https://developer.chrome.com/blog/http-cache-partitioning
How could a dictionary in the browsers that are pre-made with JS in mind fare? Aka instead of making a custom dictionary per resource I send to the user, I could say that "my scripts.js file uses the browser's built-in js-es2023-abc dictionary". So the browser's would have some dictionaries others could reuse.
What's the savings on that approach vs a gziped file without any dictionary?
So Brotli already contains a dictionary that is trained on web traffic. I think the thing here is that Google wants to make sending YouTube 1.1 more efficient if you already have YouTube 1.0, but they can’t put YouTube 1.0 into the browser.
This seems like a possibly huge user/browser fingerprint. Yes, CORS has been taken into account, but for massive touch surface origins (Google, Facebook, doubleclick, etc) this certainly has concerning ramifications.
It’s also insanely complicated. All this effort, so many possible tuples of (shared dictionary, requested resource), none of which make sense to compress on-the-fly per-request, mean it’s specifically for the benefit of a select few sites.
When I saw the headline I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.
Don't want to set session cookies? Just provide user-specific compression dictionaries and use them as your session id! After all, how is the user supposed to notice they got a different dictionary than everyone else
> I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.
More convoluted, but I expect using an old version as the source for the dictionary will yield significantly better results than a generic dictionary for that type of file.
Of course it doesn't help the first load, which might be more noticeable than subsequent loads when not every object has been modified. Perhaps having a standard dictionary for each type for the first request and using a specific one when the old version if available, would give noticeable extra benefit for those first requests for minimal extra implementation effort.
> [...] mean it’s specifically for the benefit of a select few sites.
It does seem like the ones who benefit from this are large web application that often ship incremental changes. Which, to be fair are the ones that can use the most help.
This has the potential of moving the needle between: "the app takes 10 seconds to load" to "it loads instantly" for these scenarios. Say what you want about the fact that maybe they should optimize their stuff better, this does give them an easy out.
That being said, yeah this is really convoluted and does seem like a big fingerprinting surface.
Doesn't the fact that resources send different data mean that SRI(Subresource Integrity) checks cannot be performed? As for fingerprinting, it would not be a problem since it is the same as with Etag.
The savings are nice in the best case (like in TFA: switching from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64 encoded hash is not compressible and so this line basically adds 60+ bytes to the request.
Though from the client side 60 bytes is likely not really noticeable¹ as a delay in the request send. Perhaps server side, which is seen many many client requests, will see an uptick in incoming bandwidth used, but in most cases servers responding to HTTP(S) requests see a lot more outgoing traffic (response sizes are much larger than requests sizes, on average), so have enough incoming bandwidth “spare” that it is not going to be saturated to the point where this has a significant effect.
--
[1] if the link is slow enough that several lots of 60 bytes is going to have much effect² it likely also has such high latency that the difference is dwarfed by the existing delays.
[2] a spotty GRPS connection? is anything slower than that in common use anywhere?
The only reason the client is even asking is that the server sent them a header saying it might be beneficial to do so.
And the client definitely has the dictionary data. The only thing it needs is for the server to accommodate the request after leading it down that path in the first place.
I can picture how it could happen, though. If you didn't realize the cost, you might not try to prevent misses. Or you could have a configuration error like sending the header but forgetting to generate pre-compressed data in your build.
If this is a significant issue, a server could collect stats and generate warnings about situations where it's not pulling its weight. Or even automatically disable it if hit rates are terrible.
This plus native web-components is an incredible advance for "the web".
Fingerprinting concerns aside (compression == timing attacks in the general case), the fact that it's nearly network-transparent and framework/webserver compatible is incredible!
What I really want: dictionaries derived from the standards and standard libraries (perhaps once a year or somesuch), which I'd use independently of build system gunk, and while it wouldn't be the tightest squeeze you can get, it would make my non-built assets get very close to built asset size for small to medium sized deployments.
Ah damn I thought this was going to be available to JavaScript. Would be amazing for one use case I have (an HTML page containing inline logs from a load of commands, many of which are substantially similar).
Maybe eventually (as a different spec). We've talked about wanting to support it in the DecompressionStream API or something similar at some point.
If you need it to be able to do compression though then it might be a harder sell since the browser doesn't ship with the compression code for zstd or brotli and would have to justify adding it.
There's wasm modules that do similar but having it bakes into the browser could allow for further optimization than what's possible with wasm. https://github.com/bokuweb/zstd-wasm
I have no idea if it's possible but I wonder if a webgpu port could be made? Alternatively, for your use case, maybe you could try applying something like Basis Universal; a fast compression system for textures, that it seems there are some webgpu loaders for... Maybe that could be bent to encoding/deciding text?
The part I'm missing is how these dictionaries are created. Can I use the homepage to create my dictionary, so all other pages that share html are better efficiently compressed? How?
For a delta update of one version of a resource to the next, the resource itself is the dictionary (i.e. JS file).
For stand-alone dictionaries, the brotli code on github has a dictionary_generator that you can use to generate a dictionary. You give it a dictionary size and a bunch of input files and it will generate one. I have a version of it hosted on https://use-as-dictionary.com/ that you can pass up to 100 URLs to and it will generate a dictionary for you (using the brotli tool).
How so? SDCH had sidechannel issues which is part of why it was unshipped. I don't know that someone won't find a way to attack it but the CORS requirement already requires that the dictionary and compressed-resource be readable and the dictionary has to be same-origin as the resources that it compresses.
Combined they mitigate the known dictionary-specific attack vectors.
With shared dictionaries you can compress everything down to under a byte.
Just put the to-be-compressed item into the shared dictionary, somehow distribute that to everyone, and then the compressed artifact consists of a reference to that item.
If the shared dictionary contains nothing else, it can just be a one-bit message whose meaning is "extract the one and only item out of the dictionary".
What stands out to me is that this creates another 'key' that the browser sends on every request which can be fingerprinted or tracked by the server.
I do not want my browser sending anything that looks like it could be used to uniquely identify me. Ever.
I want every request my browser makes to look like any other request made by another user's browser. I understand that this is what Google doesn't want but why can't they just be honest about it? Why come up with these elaborate lies?
Now to limit tracking exposure, in addition to running the AutoCookieDelete extension I'll have to go find some AutoDictionaryDelete extension to go with it. Boy am I glad the internet is getting better every day.
You're making three assertions, none backed by any evidence. That this is a tracking vector, that it's primarily intended to be a tracking vector, and that they're lying about their motivations.
But your reasoning fails already at the first step, since you just assumed malice rather than do any research. This is not a useful tracking vector. The storage is partitioned by the top window, and it is cleared when cookies are cleared. It's also not really a new tracking vector, it's pretty much the same as ETags.
jgrahamc|2 years ago
https://blog.cloudflare.com/cacheing-the-uncacheable-cloudfl...
I am glad to see that things have moved on from SDCH. Be interesting to see how this measures up in the real world.
Scaevolus|2 years ago
Maybe the basic problem is with how hard it is to find engineers passionate about performance AND compression?
lynguist|2 years ago
saagarjha|2 years ago
twotwotwo|2 years ago
For the cross-site thing, cache partitioning is the defense. If the cache of facebook.com/file is independent for a.com and b.com, Facebook can't link the visits.
An attacker using the hash of a cached resource as a pseudo-cookie could previously use the content of the resource as the pseudo-cookie. The Use-As-Dictionary wildcard allows cleverer implementations, but it seems like you can fingerprint for the same time period/in the same circumstances as before. In both cases you might do your tracking by ignoring how you're supposed to be using the feature; as you note, no one's stopping you.
Before and after the compression feature, it is true anti-tracking laws, etc. should address tracking with persistent storage in general not only cookies, much as they need to handle localStorage or other hiding places for data. Also true that for a browser to robustly defend against linking two visits to the same domain (or limit the possibility of tracking to a certain time period, session, origin, etc.), caching is one of the things it has to limit.
I think if they get the expiry, partitioning, etc. right (or wrong) for stopping cache fingerprinting, they also get it right (or wrong) for this.
I was admittedly a fan of the original SDCH that didn't take off, figuring that inter-resource redundancy is a thing. It's a neat spin on it to use the compression algo history windows instead of purpose-built diff tools, and use the existing cache of instead of a dictionary store to the side. Seems easier to implement on both ends compared to the previous try. I could see this being helpful for quickly kicking off page load, maybe especially for non-SPAs and imperfectly optimized sites that repeat a not-tiny header across loads.
hinkley|2 years ago
charcircuit|2 years ago
https://source.chromium.org/chromium/chromium/src/+/main:ser...
frankjr|2 years ago
So it seems it should not get you anything you cannot already do with cookies.
https://github.com/WICG/compression-dictionary-transport/blo...
patrickmeenan|2 years ago
They are also cleared any time cookies are cleared and don't outlive what you can do today with cookies or Etags (and are using the most restrictive partitioning for that reason).
jauntywundrkind|2 years ago
The original proposal for Zstd was to use a predefined stastically generated dictionary. Mozilla rejected the proposal for that.
But there's a lot of great discussion on what Zstd can do, whic.h is astoundingly flexible & powerful. There's discussion on dynamic adjustment if cinpression ratios. And discussion around shared dictionaries and their privacy implications. That Mozilla turned around & started supporting Zstd & has stamped a positive indicator, worth prototyping on shared dictionaries is a good initial stamp of approval to see! https://github.com/mozilla/standards-positions/issues/771
One of my main questions after reading this promising update is: how do pick what to include when generating custom dictionaries? Another comment mentions that brotli has a standard dictionary it uses, and that's some kind of possible starting place. But it feels like tools to build one's custom dictionary would be ideal.
patrickmeenan|2 years ago
I have a hosted version of it on https://use-as-dictionary.com/ to make it easier to experiment with.
eyelidlessness|2 years ago
True, for documents (as is another comment’s focus) this is perhaps overkill. Although even there, a benefit could be imagined for a large body of documents—it’s unclear whether this case is addressed, but it certainly could be with appropriate support across say preload links[0]. But if “the web is for documents, not apps” isn’t the proverbial hill you’re prepared to die on, this is a very compelling story for web apps.
I don’t know if it’s so compelling that it outweighs privacy implications, but I expect the other browser engines will have some good insights on that.
0: https://developer.mozilla.org/en-US/docs/Web/HTML/Attributes...
patrickmeenan|2 years ago
On the first entry to the site you trigger the load of an external dictionary that contains the common parts of the HTML across the site and then future document loads can be delta-compressed against the dictionary, effectively delivering just the page-specific bits.
You need to amortize the cost of loading the dictionary across the other page loads but it's usually pretty compelling once users visit more than 2-3 pages.
lukevp|2 years ago
ColonelPhantom|2 years ago
EE84M3i|2 years ago
You can use the presence of an item in the cache to correlate visits between sites.
matsemann|2 years ago
What's the savings on that approach vs a gziped file without any dictionary?
saagarjha|2 years ago
ComputerGuru|2 years ago
It’s also insanely complicated. All this effort, so many possible tuples of (shared dictionary, requested resource), none of which make sense to compress on-the-fly per-request, mean it’s specifically for the benefit of a select few sites.
When I saw the headline I thought that Chrome would ship with specific dictionaries (say one for js, one for css, etc) and advertise them and you could use the same server-side. But this is really convoluted.
wongarsu|2 years ago
dspillett|2 years ago
More convoluted, but I expect using an old version as the source for the dictionary will yield significantly better results than a generic dictionary for that type of file.
Of course it doesn't help the first load, which might be more noticeable than subsequent loads when not every object has been modified. Perhaps having a standard dictionary for each type for the first request and using a specific one when the old version if available, would give noticeable extra benefit for those first requests for minimal extra implementation effort.
strongpigeon|2 years ago
It does seem like the ones who benefit from this are large web application that often ship incremental changes. Which, to be fair are the ones that can use the most help.
This has the potential of moving the needle between: "the app takes 10 seconds to load" to "it loads instantly" for these scenarios. Say what you want about the fact that maybe they should optimize their stuff better, this does give them an easy out.
That being said, yeah this is really convoluted and does seem like a big fingerprinting surface.
unknown|2 years ago
[deleted]
falsandtru|2 years ago
https://developer.mozilla.org/en-US/docs/Web/Security/Subres...
charcircuit|2 years ago
TacticalCoder|2 years ago
The savings are nice in the best case (like in TFA: switching from version 1.3.4 to 1.3.6 of a lib or whatever) but that Base64 encoded hash is not compressible and so this line basically adds 60+ bytes to the request.
Kinda ouch for when it's going to be a miss?
dspillett|2 years ago
Though from the client side 60 bytes is likely not really noticeable¹ as a delay in the request send. Perhaps server side, which is seen many many client requests, will see an uptick in incoming bandwidth used, but in most cases servers responding to HTTP(S) requests see a lot more outgoing traffic (response sizes are much larger than requests sizes, on average), so have enough incoming bandwidth “spare” that it is not going to be saturated to the point where this has a significant effect.
--
[1] if the link is slow enough that several lots of 60 bytes is going to have much effect² it likely also has such high latency that the difference is dwarfed by the existing delays.
[2] a spotty GRPS connection? is anything slower than that in common use anywhere?
sethev|2 years ago
nevir|2 years ago
adrianmonk|2 years ago
The only reason the client is even asking is that the server sent them a header saying it might be beneficial to do so.
And the client definitely has the dictionary data. The only thing it needs is for the server to accommodate the request after leading it down that path in the first place.
I can picture how it could happen, though. If you didn't realize the cost, you might not try to prevent misses. Or you could have a configuration error like sending the header but forgetting to generate pre-compressed data in your build.
If this is a significant issue, a server could collect stats and generate warnings about situations where it's not pulling its weight. Or even automatically disable it if hit rates are terrible.
tarasglek|2 years ago
sillysaurusx|2 years ago
lozenge|2 years ago
ramses0|2 years ago
Fingerprinting concerns aside (compression == timing attacks in the general case), the fact that it's nearly network-transparent and framework/webserver compatible is incredible!
raggi|2 years ago
IshKebab|2 years ago
patrickmeenan|2 years ago
If you need it to be able to do compression though then it might be a harder sell since the browser doesn't ship with the compression code for zstd or brotli and would have to justify adding it.
jauntywundrkind|2 years ago
There's wasm modules that do similar but having it bakes into the browser could allow for further optimization than what's possible with wasm. https://github.com/bokuweb/zstd-wasm
I have no idea if it's possible but I wonder if a webgpu port could be made? Alternatively, for your use case, maybe you could try applying something like Basis Universal; a fast compression system for textures, that it seems there are some webgpu loaders for... Maybe that could be bent to encoding/deciding text?
netol|2 years ago
patrickmeenan|2 years ago
For stand-alone dictionaries, the brotli code on github has a dictionary_generator that you can use to generate a dictionary. You give it a dictionary size and a bunch of input files and it will generate one. I have a version of it hosted on https://use-as-dictionary.com/ that you can pass up to 100 URLs to and it will generate a dictionary for you (using the brotli tool).
Sigliotio|2 years ago
Image compression for example or voice and video compression like what nvidia does.
But i do like this implementation focusing on libs, why not?
jwally|2 years ago
skybrian|2 years ago
madeofpalk|2 years ago
Compressing JavaScript already gives you tonnes of benefits, but syntax-aware compression (modify js) gives you more.
Besides, this is a form of more efficient caching on that it only benefits subsequent visits.
kevingadd|2 years ago
tsss|2 years ago
patrickmeenan|2 years ago
Combined they mitigate the known dictionary-specific attack vectors.
judith48|2 years ago
[deleted]
hwbunny|2 years ago
[deleted]
callalex|2 years ago
[deleted]
kazinator|2 years ago
Just put the to-be-compressed item into the shared dictionary, somehow distribute that to everyone, and then the compressed artifact consists of a reference to that item.
If the shared dictionary contains nothing else, it can just be a one-bit message whose meaning is "extract the one and only item out of the dictionary".
cuckatoo|2 years ago
I do not want my browser sending anything that looks like it could be used to uniquely identify me. Ever.
I want every request my browser makes to look like any other request made by another user's browser. I understand that this is what Google doesn't want but why can't they just be honest about it? Why come up with these elaborate lies?
Now to limit tracking exposure, in addition to running the AutoCookieDelete extension I'll have to go find some AutoDictionaryDelete extension to go with it. Boy am I glad the internet is getting better every day.
jsnell|2 years ago
You're making three assertions, none backed by any evidence. That this is a tracking vector, that it's primarily intended to be a tracking vector, and that they're lying about their motivations.
But your reasoning fails already at the first step, since you just assumed malice rather than do any research. This is not a useful tracking vector. The storage is partitioned by the top window, and it is cleared when cookies are cleared. It's also not really a new tracking vector, it's pretty much the same as ETags.
unknown|2 years ago
[deleted]