W3C Web Performance WG, 30 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: WebPerfWG call - 2023 03 30 - Compression dictionaries

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right there we go okay, so compression dictionaries and, as you all have mentioned, this has been a while in the making and a lot of it. Thanks to yov I think he's been through every iteration of this as it's been in the making, uh including doing the the original explainer for the new version of it.

A

um Yeah I feel a lot like Don Quixote, chasing the windmills uh for all of the different versions of Delta compression uh original version in Chrome. Sdch uh sandwich uh was mostly used for dynamic resources, and it was in the critical fetch path. You created a dictionary which was basically a VC diff, the server compressed a resource and said hey and this resource uses this dictionary and the browser would, if it didn't have, it have to go out of band fetch the dictionary and there were Beyond deployment and code issues with it.

A

There were challenges around privacy um and that's sort of what killed those first, two iterations and it actually worked out well. Generative AI does about as well with windmills as it does with fingers, uh and so those are AI generated, windmills uh which is kind of how it feels like the original attempts for compression dictionaries went um the the HTTP version.

A

We tried HTTP 2, we tried for cross stream shared dictionaries uh to at the transport level, get some of the benefit of bundles of JS for a bunch of separate resources, um both of those ran into all sorts of privacy concerns when crime and breach and all of the side.

A

Channel timing attacks became a thing, uh and so they were shelved for a while um in the last year, or so you have came up with an explainer that we're running with now that we think uh fits the the web model much better and doesn't have the the security baggage that comes with it, and so at its most fundamental um there's sort of two mental models you can think of for the shared dictionaries.

A

There's static resources that don't change a whole lot and can be pre-compressed and uh for that model we're looking at using one version of a resource as the dictionary for the next version of the resource. When you do an upgrade um and then there's Dynamic resources, which you use a side, built, custom, dictionary kind of like sdch for dynamic resources, but they both have effectively the same serving model and it's very much first, the browser, whatever it is, fetches a resource.

A

In this case, it's the the YouTube JavaScript player, which is 10 megabytes of JavaScript uh shipped multiple times per week, as it gets upgraded and broadly compression, gets it down to around 1.8 Megs.

A

um The response for the resource that you want to use as a dictionary for some point in the future carries a use as dictionary header that specifies the path uh uh some level of wild card support to allow version numbers in the URL path, for example, for builds of ajs resource, and in this case the player JS has a wild card after it. So the version number hash uh is immaterial and this player.js can be used as a dictionary for any future for fetches for any player.js files on the same origin.

A

Some point later user comes back, the player has been updated, uh browser does a fetch for player.js version 2, and it says Hey and by the way I have this dictionary and it sends the sha-256 hash of the previous version of the player that it already has in its cache. The server recognizes the the previous version has a Delta compressed version of the new player that is much much smaller than the broadly compressed.

A

It's only 180k and it sends down the Delta compressed version that the browser can then use the old version in Cache as a dictionary to decompress the new version and effectively what this gives. You is Delta compression, where the server only sends down the bits of the player that changed versus what the browser has in its cache.

A

It's a little more overhead than just pure Delta compression, but it effectively comes out that and so the amount the size of the the delivery is effectively how much change since the user last visited um it's worth, noting that as far as the browser and everything else is concerned, above the networking stack, um it's still a 10 megabyte Javascript file that you're parsing and running so it doesn't make the player any lighter, but it does significantly reduce the delivery size.

A

In the dynamic case, it looks very similar though the one difference is- and here we have a search use case, but it's any HTML page or any resource. For that matter. That's not doing version to version upgrades you, a fresh, fresh user, comes visited. This visits, the page uh gets the HTML like it normally would in that HTML or in a response.

A

Header, there's a link tag that says: Rel equals dictionary and points to a dictionary URL to use, and it's basically a way of side loading a dictionary uh instead of having one as a static resource at some point in the future. The browser, when it's idle downloads, that dictionary and that dictionary will have the same response that the static resource has the use as dictionary, and it has a path of same origin, requests that this dictionary is appropriate to be used for.

A

So in this case, the dictionary response says: hey I apply to any search query uh pages, and so we can have a custom dictionary that was built for the search page templates or you can have a mobile and a desktop uh if you're, an e-commerce site, you have landing pages product listing, Pages search results, page templates all have separate dictionaries, for example that can apply and have the effectively the custom, boilerplate headers SEO tags, Footers kind of stuff in the custom dictionary that has some Future Point.

A

The browser makes a request for a URL that matches the path of a dictionary that it has in this case it's the search path that has the dict.dat dictionary as a side, Channel dictionary for it, and it advertises hey available dictionary. The hash of the dictionary and the server can Delta compress the dynamic HTML at serve time uh with the the static dictionary that was downloaded out of channel, so the actual dictionary matching rules serving rules compression logic. All of that is the same. The main difference between Dynamic and static is the dynamic use case.

A

Really just has a a way to out of band fetch a dictionary.

A

um In theory, if dictionary.dat was a resource that was on the page, it could have also said link Rel, equals dictionary and being been used for other search results pages.

A

um It's important when you generate these dictionaries that they not include private data and I. Think I've got it on a slide, but uh I'll talk about the the Privacy Protections in a minute. As far as the results that we've been seeing. uh This is all mostly lab testing um and there is in the the examples page in the explainer, there's also a link to a tool.

A

If you want to test against arbitrary static or dynamic pages of your own, to see sort of what kind of benefits you get on top of, broadly compression we're seeing anywhere from 50 to 90 even 99 savings, depending on how big the Deltas are additional compression savings over broadly, and so it's one of those orders of magnitude smaller, even than the best compression that we get today, um both for dynamic and static, bundles.

A

There's the mouse um there's a bunch of reasons why this works better than previous attempts. Probably the main one is that it's no longer hidden at the transport level and done automatically it's collaboration and working with the pages and the serving architecture of the web.

A

um It's effectively cash enhanced compression, so things that you have in your cash partitioned as your cache is normally partitioned with all the security protections and everything else can be used as effective Delta compression for future fetches. Doesn't when we're not targeting to do anything about the first load? This is all about future engagement, so version upgrades of scripts or bundles swasm huge was a map updates uh or multi-page visits.

A

When you're browsing an e-commerce site where you can amortize the dictionary cost over a bunch of different page loads, um the dictionaries themselves are managed similar to cached resources and cookies, because they can be abused as identifiers and so they're they're stored. They expire, uh they're cleared uh whenever caches or cookies are cleared um because of the side Channel attacks. The dictionaries can be only be used for core's, readable responses, so the dictionaries themselves be need to be course readable by the page and the resources they compress need to be course readable by the page.

A

It's a little bit of a dance around the Privacy problem. It's basically saying. Yes, you can attack with timing attacks if you wanted to and get to the contents of this, and so we're going to scope it to only be applied to things that you could normally read anyway and just wipe away uh the privacy concerns for the most part that solves just about every use case that we can think of and should be the the 90 case um and it's content independent um in theory.

A

You can use it against binary content, but even just using it against JavaScript CSS wasm Json, svgs, HTML sort of all of the big heavy text, resources that tend to have a lot of common template content or repeat content.

A

um We expect to see a lot of the the gains as far as deployment uh the static build uh version. The the static file version, I think, is probably going to be the easiest path for initial ramp up, because you can do it as a build step.

A

um As part of your build as a post build step, you store the artifacts of the build, the JS, the CSS, whatever into a version store, you've got somewhere that has all of or some rolling number of previous builds and then after the build is done, you can press each one of the current artifacts against the previous versions of that file uh to have pre compiled dictionary versions of the assets and then at Surf time.

A

You basically just say: hey uh use these resources as a dictionary for future versions of this resource on the same path, and when a request comes in that has a sec available, dictionary request header, you can literally just look on disk and see if you have an artifact on disk already that was Delta compressed with that available dictionary. If not, you just respond with the the normal, broadly encoded full Resource, as you would normally me.

A

But if you do, then it's just literally a pick a file uh operation and send down the tiny version of the Delta compressed file.

A

The dynamic case is a little more complicated because you actually have to have the the server or the CDN actively applying the compression with the external dictionary at serve time.

A

Still hugely valuable um and I do expect we're going to see a lot of uptake, especially where there's a lot of value in that use case, but the static case was kind of a no-brainer for almost anyone to to add to the build process as far as CDN requirements, um they're sort of the cdns can actively participate, but it's also there's a risk of cdn's breaking support for it, um just by virtue of how they normally work and the biggest risk for that is the content encoding.

A

They tend to like to be the place that the content encoding is done broadly or gzip or whatever, and they operate on a lot of the content that flows through for this to work. They need to either recognize shared, broadly content or they need to at least pass the acceptancoding SBR and the content encoding SBR responses through and cache uh resources separately, uh using the very uh for SEC available dictionary, so that Delta, compressed versions of each resource are stored separately in the cache and responded to appropriately.

A

So this is kind of the the Baseline minimum. If you want to get this working with a CDN and not have it break, they need to at least be configured to pass these things through, and there are, you know, future opportunities for cdns to be much more involved uh in the in the process.

A

Instead of you doing or the site owner uh doing, the builds it statically at build time. In theory, you could just add the relative or the the appropriate useless dictionary headers to the resources on the server side, and then the uh middle boxes can observe those uses. Dictionaries store those as Shaw hashes in a key Value Store. When the browser requests a resource and says SEC available dictionary, the middle box could do a batch process.

A

Async offline to recompress those Resources with the key, the versions in key value store and make it available for future requests.

A

And so it fits fairly well with the offline compression model of cdns today, um just with a little extra step of watching the headers as they flow back and forth and managing uh dictionaries in a key, Value Store- and you know, there's sort of future uh uh ideas where you know cdn's being more active, could also provide a UI mechanisms or some mechanism for you to provide regexes of paths for my app resources and they do all the header management themselves as well and they could handle the the dynamic dictionary uh use case as well.

A

There's a lot of flexibility built into the current proposal, um the the actual compression, the shared broadly, which is the the use case. We expect to go out. The door first is completely independent of the negotiation of the dictionaries, and so the the uses header available dictionary uh can happen, uh independent of accepting coding and content encoding. And so, if there's a new super duper was I'm, aware, offsets, uh aware compression mechanism or Z standard, or something like that.

A

You can keep the existing dictionary mechanism and just add it uh to the the content uh negotiation.

A

um There's opportunities on the build side uh to provide or to produce built assets that are more consistent from build to build being aware of Delta compression and so static, uh variable maps and function, name, maps and things like that to reduce the size of the Delta compressed assets and there's all sorts of research opportunities for generic uh out-of-band dictionary creation. The broadly project itself has a dictionary generator where you can give it a whole bunch of files uh resources and have it generate dictionaries using different algorithms.

A

But that feels like an area that, what's the optimal dictionary to generate off of these resources, is probably still ripe for a lot of uh research and work. There's a lot of stuff. This is still very early. We're planning to prototype it in Chrome and start experimenting with it. We do have a YC ycg uh explainer um that we're starting the discussions and bike shedding on everything, but there's a bunch of stuff going on.

A

um We have to be careful to minimize the the cardinality of the very response. um So since we're going to vary on accepting coding and the available dictionaries, we need to make sure that we don't end up with clients with thousands of different available dictionaries.

A

If you have like the dictionary set to go for a year and you're doing multiple releases per day, you could get clients coming back all sorts of times and have your your middle box caches, just sort of explode, and so, when you're setting the expirations for dictionaries, uh you want to keep that in mind and go okay. What's a reasonable time window to do Delta compression against um all sorts of bike shedding on the naming. You know we, we came up with headers that we think makes sense and values and compression tags and dictionaries.

A

um But you know everyone's got opinions on names, um so those may change. Those are what we're using as placeholders for now, um but who knows um the path matching right now uh we started out with just doing a prefix match. We learned pretty quickly that won't work. Lots of people have build numbers as a directory effectively in the path for the static resources, and so you need to have at least path matching that allows some sort of wild card in the middle of the string.

A

For now we're using the same um Wild Card support that uh the extensions manifest uh V2 and V3 support for URL matching, which is basically yeah. You can put wild cards stars in the the URLs and that'll just expand, but is there a need for something more complex than that or not I? Don't know the hash algorithm and how it's represented.

A

uh Shaw 256 is what we're using right now. It should be perfectly good for Unique file matching, especially within an origin.

A

uh The protocol does allow for future expansions of what uh algorithms to use, but it's good that hash be encoded as base64 in the request headers to keep it smaller or base 16 in the request, headers, uh which is slightly larger, but it's all ASCII that can be encoded in a file path, whereas base64 includes slashes as available characters and so with base 16 a lot of command line tools, output base 16, and so you could literally just take the header value and look for that value as a file extension on disk and do a quick mapping and so we're leaning towards base 16, even though it's slightly longer just for ease of deployment um broadly support.

A

uh So the shared dictionary support was added based on all of the attempts that we've been trying to do so big thanks to the broadly team for this. But it was added in August 2021., the last stable official release of broccoli was in August of 2020..

A

um So if you do apt or Brew install of broadly today, you don't get the version that has the dash d uh command line option for external dictionary support. You have to build it from Source. There will be a new release of the broadly code in the next month or two that includes all of the optimizations they've done to the code over the last few years as well, and so hopefully soon. The tooling is easier to install for experimenting and doing with this.

A

As the code currently stands, uh external dictionaries are only used in the broadly compressor. If you set quality level, five or higher, um it's arbitrary there's, no real restriction to it, and so we're going to work with the broadly team to see hey. What does it look like when you use external dictionaries and q1?

A

Can you get the benefit of low CPU overhead, because a lot of dynamic compression with broadly today on the edge is done at broadly level, one mostly to be able to say yes, we're doing broadly get a lot of the bulk of the savings, but with minimal CPU overhead, and if we can cut it in half, while using dictionaries and still not increase CPU significantly, that will probably be in that when um standards path, this should be rather challenging. uh It's coming to Chrome in our origin trial.

A

Soon uh you will all be informed um figuring out. What are the rough edges, the deployment? How does it work in the wild uh sort through any CDN issues, and things like that?

A

um There's the ietf HTTP working group we've already talked to them about getting started on it on this last Saturday on last Sunday, at ietf, in Japan um feedback. So far is very positive. I, don't remember, hearing any negative feedback, which is rare for this kind of thing.

A

I expect the bulk of the standards work is going to be in the HTTP working group, because that's where the bulk of the actual uh compression and negotiation happens, but there are pieces that touch Fetch and HTML specs around the browser, specific cores and privacy requirements and the link tag for fetching external resources.

A

um We're optimistic that this is the time uh this feels very much like it's shippable and Deployable without a lot of work on developer's side and should solve the bulk of the the use cases. So um both joov and I are extremely excited about the the possibilities that this can bring. Was it yeah that excitement and I'm stopping the recording, and then we can discuss.