Internet Engineering Task Force 99, 17 Jul 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF99-NETVC-20170717-1550

Description

NETVC meeting session at IETF99
2017/07/17 1550

https://datatracker.ietf.org/meeting/99/proceedings/

A

This mic is yeah, but stand over here and I. Don't need help to be louder.

A

Welcome everyone to Neffe C.

A

No, don't don't give me instructions that gonna make it awful.

A

Hello, hello, hello that works right, okay, good, haha so welcome everyone to Internet video codec or video codec net BC. This is our any session this week. So we have. We have a relatively packed agenda, but let's go through our introduction slides. So we need to jabber jabber, scribe or missive jabber subscribe. Our jabber room is net vc @ jabber duty after org. Do we have a Jabba scribe? It's an easy job. ah Thank you very much, sir Thank You Jonathan. For for for nominating yourself to be jabbar scribe. We also need a note-taker.

A

We just need the light notes and so decisions. Various things people are discussing.

A

I'll pay the jeopardy music in a minute, don't make me sing yay. Thank you! So much sorry, what's your name Tessa! Thank you very much, so thank you. Tessa, as the test would be taking taking notes. However, you wish Tessa it's up to you. Blue sheets are going round as per normal. We will have actually do we haven't. Do we have some?

A

We do have one remote presentation today, right, oh okay, so please say your name clearly loudly at the mic, so that Tessa can get it down on the notes properly and that the remote presenters will know who is speaking I will try and speak slower as well. Okay, next slide! Thank you note. Well, people should be. She should be quite familiar with the note.

A

Well, if you're, not it's there, it's also on the site in probably much larger font and legible and updated, and so please look at the note well on the idea site I mean generally it's the same thing.

A

Okay, please take a look at the note well on the IAT outside it's very important for you to understand these. Do you not pay attention to what's on the screen in front of you? Okay next slide, I think all right. So an agenda, and this have we updated this recently, is this relatively updated.

B

During the the chair, slides Aleksei will jump in to present the requirements document status.

A

Great okay, I'm. Sorry: let's people take a look at the agenda. Any agenda bashing we're here we're set to go I, know.

B

One thing as a team you're going to do both, but you can do thomas's testing document and and the dollar. Maybe one update right. Okay, so one sub okay.

A

We're ready to jump straight in them, okay, so.

A

Okay, so I'm improvin.

B

Well, technically, he's he's not! You may have thought it starts at four o'clock anyway. It's it's.

C

A brief update.

B

I thought the there were promise document. That's really I'll. Just back up.

B

How's everyone enjoying this on the remote side, periodic boons of okay. Now that everybody's muted their headphones, the the requirements document is basically ready for progressing, so we're at version six right now, and there was a few changes in that version. It was mostly around section three one, one calling out that that the objectives for compression efficiency are really applied to all of the use cases that are defined it earlier in section two, so before it was trying to call out some specific use.

B

Cases like natural content, as well as screen sharing content and rather than you know, enumerate better to go ahead and reference section two which has all of the use cases. So the compression efficiency targets apply to all of those use cases and there were no other substantive changes so where we are with the document right now is uh we completed the work group last call after after this update the O six update, that was back at the end of May and the current status?

B

Is that we'll be doing the shepherd right up and passing this on to our ad, which is where Adam is in the room or not?

B

ah Okay, you must have been standing outside listening, okay, so we do have an ad yeah.

C

Was there a question who there's.

B

No now just a heads up that the the requirements will be progressing to the Shepherd Rider will be progressing to you for this week. All.

A

B

Other comments on the requirements and one of the questions earlier was about whether or not we published this and the main main impetus for publishing it was that it's also being used in other bodies as a set of requirements for for a new codec design.

B

So there's a value in publishing it even beyond ITF members.

B

Okay, that's it for requirements. So that's Tim, you wanna come up for testing.

D

Excellent people hear me that doesn't really sound like it's on.

E

D

Alright, that better, yes, I, can hear that now all right, um so I've looked at these slides for about 30 seconds and didn't make them. So this should be great.

D

So there hasn't been a lot of changes to the the testing document. What we have done is is gone through and and started exercising some of the subjective tests, portions of it.

D

So it's a suggestive testing procedure defined basically using the same codec and command line configuration as all the objective tests, but we only we only select one quantizer for both high and low latency to test to test visually the added a subjective test set, which is basically just a small subset of the full objective test set because we have to actually manually look at these, so smaller is better um and we have implemented a tool in our we compressed yet and the analyzer which supports subjective testing.

D

So basically it gives you a split view or lets you flip back and forth between the videos and then you, it randomizes the presentation order and you vote, for which one you think looks best um or if, if there's a tie and there's instructions that it shows the the voters for all that right slide.

D

um So the statistical analysis side generally, you need about twelve viewers to get results that are significant. That doesn't mean it's not a guarantee that if you have 12 you'll have significant results, but you need probably at least that many, so the the all the voting is is prefer a or prefer B. So there's no indication of how strong that preferences, and then we test for significance using assigned tests for all the people who vote hi.

D

It basically counts as half of up four and half a vote against or half a vote for a or half a vote for B. However, you want to think about it um with the the main effect is, as you get lots of ties, it becomes much harder to have a significant result and then all p-values under point zero. Five are considered significant.

D

All right um so subjective one is a test set, is a subset of objective to slow, and these are the five videos in there.

D

One thing we should point out is that using a QP of 54, sinto actually looked quite bad, so we're probably gonna revise that to use a lower quantizer, because even with even with 50, which is the highest quantizer, we we have of any of the videos.

D

It still looks like a blocky mess so that that's something that will probably change in the future next slide, and so we have a few examples of some of the tests that we've done so one of them was was a test against the current constrain, directional enhancement filter versus the the old constrained low-pass filter that was in Thor. So Stan was going to talk a little bit more detail about about those during his presentation.

D

But those are a couple of links. You can go. Click on or even type in this they're short enough, so you sort of get an example of the the kind of test that we've running. So these have all completed at this point. So, while you're welcome to go ahead and vote on things, we've already tell it the results there. If you're curious, see deaf wound up being significantly better than Co PF in in at least for several of the videos that we tested at a statistically significant level.

D

It was better than CLP effet for all of them and he'd, not statistically significant video level.

D

We also did some some deblocking filter tests, trying to determine whether we really need to keep a 15 tap, deblocking filter or a seven tap deblocking filter, and you can see some of those examples. There.

D

Israel televideo only twice and I, don't actually remember what the difference between those two were: I think they were just different versions of the code base, but somebody's.

B

Mics, because on there are people actually in that in the muñeco now right.

D

The question was: is it really do we really intend to have Syntel video only up there twice.

D

B

Think, that's! uh That's it so about if you want to I'll jump to an example. This is aa. People interested I have one load.

B

So that's what this objective test looks like in the interface there's a little tutorial. You guide you through it and if have you sees when you begin experience, objective testing we'll be happy to start forwarding all of this objective. Testing requests to the list and people can start evaluating the tools that we're looking at.

B

Yeah this uh the resolution on here is killing it. This is nothing designed for 4k.

D

So so Mozilla we actually have a 4k monitor set up in the kitchen. We make the insurance do a test before we give them cookies.

B

Well, I'm prefer this projector is not the problem. It's it's. How we're connecting here is it's gone through VGA, so it's it's getting some crap VGA resolution instead of I'm sure that's in any P projector.

B

Okay, well make sure to ask for that next time yeah did know it existed.

F

B

Say that's the example of the interface and you can you can do the attempts that the voting is closed for these current things, but for new for a new subjective tests will start forwarding into the list if people are willing to give their their feedback on them.

D

B

Thanks Jim, so next up we have.

B

mmm Steiner for the Thor 81 codec update.

B

Your slides are really because you had a lot of a lot of graphic.

B

It was only a few megabytes, some surprises too slow.

B

Probably too late the switch now, whatever I'm on I'm on.

G

Well, I will give an update on Thor and also have an update on the compression complexity, trade-offs that have for the ero 2 3 codecs.

G

Yeah this takes some time. It seems I, probably didn't compress the big graphics that world.

H

G

Hope this is not just a fish slide. No.

B

It's it's got bad previous of all of them, so it so. It knows the 13 slots coming down.

B

B

See it did select ietf hotel, so I am on a hotel.

B

Which is ironic that it associated to that ap then.

B

Supposedly, we're done.

B

C

B

Threat of a local cop, he got it going. Okay,.

G

G

There's no no in the github repository since Chicago, but there's still some work that has been done. I think the consensus in Chicago was that we should aim to have both or and dialogue converge, and so that would include merging the loop filters. The dalla Deering and ANSI OPF and John Mark presented how we did that for a v1 and so I began began doing that for Thor as well, but I haven't quite finished yet and also we should a Taurus lacking proper entropy coding. So that's also on the list.

G

So some work has been started on that, but that's going to require more work than finishing see, death and also on at least my wish wish list was a new tool for improving screen content in Thor that would really help, but that work hasn't started yet and also since last time. We have.

G

Changed the c-diff design slightly and I'll get back to that in the next slide.

G

So the original see that design had directions, filter which corresponds to the first dollar Dearing filter and then across filter corresponding to thoraseal DF and the second filter is applied on top the first filter that gave some hardware concerns over linebacker requirements, because both filters can have vertical filtering. So when you apply it on top of each other than the line buffer requirements increases. So that was originally addressed by restricting the second stage filter in certain cases, but I think that was really a quick fix.

G

So I tried to find a better way to do that and I think.

G

What I tried is both a simplification as well as solving this issue, so what I did was to combine the two pulses into one and by doing that, I get a new filter with a lot more tabs and the tabs are divided in two groups.

G

We have the primary tabs, which correspond to the original directional first pass, filter and secondary pass, which correspond to see of the F, except that it's now made directional just like the primary filter, and the clue is that you can specify different strengths for each of these groups of tabs, and that gave no significant change in BDR actually has very tiny improvements, and we got to actually gain in chroma next.

G

So these are the tabs. The first eight matrices are the tabs for are the primary tabs, which will have which will be weighted with the primary strengths and the lower matrices are the arrangement of these secondary tax, which will have a separate strength and in the single pass filter. I tried both this set of taps and also a few more tabs extending the upper eight matrices to 7x7 field so that there were two extra tabs, but that didn't change the objective results. So this is what I am currently implementing fourth or next slide.

G

So these are the objective results. Comparing the two paths with the one pass filter, so negative numbers means that one pass is better and in luma, P star is about 0.2 percent, better, which is close to the noise range, but at least on the right side of this zero. And if you look at the chrome numbers, they are better and in particular, if you look at the CIE de numbers, which combine luma and chroma, we get about half a percent, which I think is it's not much. But it's it's nice for something.

G

That's really a simplification next slide.

G

So, as Tim mentioned, we now have a test framework for doing subjective tests and one of the tests compared CLP F to see that- and this was the two paths Edith but I- don't think the one policy that will be that much difference.

G

The tests were done in everyone, but again I, don't think that will be much different from what we would see in thore and as they mentioned, there was a significant preference to see that in some cases were in the low latency cases for the high delay, our latency cases, there were no preference but still see deaf got more votes than sealed. Therefore, every sequence, both in low delay and high delay, the numbers burn just not see significant.

G

So, for some sequences see therefore in Sanford. Rest are now fully subjective advantages, but in all cases, I think the the objective scores perceive fe are slightly better next slide.

G

So these are the results, for all results are red, is the the vote counts for Co path? Gray, are the tie, counts and see? Death are the green bar and in all cases there are more votes for see death and in two cases the difference is significant, and this is for low latency. If you move on to high latency on the next slide, there's no significant difference but again see that has more votes than see opf.

G

The main difference is that there are more ties- and that's not that surprising since in in the the high latency case, through filtering as less less effects, so I think these results are to be expected. Next life.

G

So I also run some other experiments in RV compressed. Yet looking at the objective numbers I wanted to see how.

G

How the compression and complexity trader log trade-offs are looking, so I have been using to assess the the regular objecting. One fast I didn't use objective too fast, because from some time to time the objective to fast tests that breaks a v1, it might be- have been fixed now, but I made a test so that we could see how a t1 has been doing over time. So then I needed to do it with the old test sets and also I selected a subset of objective one test, which is just we did a video conferencing content.

G

So that would be the 720p subset of objective. One task and I found that everyone compassion has significantly improved since Chicago, both for low and high latency, but also the complexity has increased and quite a lot.

G

I also run I also ran vp9 and everyone in both errors, resilient and non resilient modes, because Thor is always a resilient. So in order to do a proper Apple to apples test, I I did both and where I compare the different codecs I use for in high complexity, low latency mode as the BDR anchor next time.

G

So this shows the compression history of AV warm over the past year. So on the left side we see how 81 that's that's every one one year ago used as the anchor so over the past twelve months.

G

The compression gains are slightly more than twenty percent and if you then move on the next slide, we will see what happens to the complexity so.

B

Just to clarify the the left is a year ago, and the right is today yeah.

A

B

So dropped from zero the anchor a year ago to minus twenty percent. Today, yeah.

G

Slightly more and most of that has come in the last three months.

G

So if you look at the complexity- and here the y axis is logarithmic and all it also shows the frames per minute and not frames per second, so it started a year ago with about twenty three frames per minute and the lated latest code will run the same sequences. That's 1.9 frames per second I'm, sorry per minutes.

G

So that basically means that, in order to get to a 20% gain, the complexity has gone up by about 1000 percent, so the compression definitely comes with a cost.

G

Next slide, good.

B

Question here: do you know what the uptick was surprised, that there was any lowering of complexity? Well,.

G

I suppose the reason is that some of the enable a tool just to get the tool in and did the optimizations later.

B

Every time knows better.

D

Yeah Tim terrier from Mozilla I'm, not sure exactly which commits steiner, measured, but one possibility is. There were some changes to to select which reference frames to use for each block, independently of searching all the possible coding modes for that block, which allowed you to make much quicker selections of which reference frames to use. So that was that was something that happened after we expanded from three reference frames to six reference frames, so the expansion probably made it much slower and then speeding up that selection may have made it faster again.

G

So over the last three or four months there have been a lot of new tools being added to the code base and and enabled by default and I'm sure that that has happened without the optimizations being fully done, and also some of the tools compete for the same gains, and there still is some work to get a proper integration. So.

G

The complexity could go down and.

B

For for this work, you took how many data points to sample you took Wow.

G

B

G

Know every two weeks or twice a month actually: okay,.

B

And the same configuration oh yeah, I'm right, so it's whatever people turned on by default, yeah, okay,.

G

B

G

There I could just have been unlucky in picking the bat's exactly to commit, but because that was selected, I selected, whatever was in the repository on the 1st and 15th of every month, I think, but it does show a trend and it roughly corresponds to the compression. So basically, higher compression comes with a cost.

G

Okay, so if we move on to the next one we'll compare the different codecs I'm, comparing for DP nine and AVL one I I wanted to have x265 in this plot as well, but currently that doesn't work with or be compressed. Yet so I didn't manage to get that in in time.

G

So the read graphs are a v1 and the purple vp9 and the black Thor and the dashed graphs are the non resilience runs other codecs so, and this is comparing Thor with the other context, other codex with a mixed content, which is the in this case the the objective one fast tested- and we see here that for and vp9 seem to have about the same complexity and compression trade-off, except that Thor can add some more compression at the cost of added complexity.

G

And everyone in this plot is it's performing better, but I think that if we can get the screen content to into Thor, the situation will change next I, please.

G

So if we limit the sequences that test set to just video conferencing content, then Thor is performing much better actually than big nine and it's quite close to everyone, not quite, but here we see that a v1 will add better compression, but again at the complexity, complexity cost.

G

So this is perhaps moral. This is perhaps what the previous slide would look like if we could add some good tools for screen content and perhaps a few other tools.

G

So basically I think this tells us that it's possible to get thought to perform roughly, as well as a v1, but with fraction of the tools and with logo complexity.

B

Do you know offhand what what amount of screen content is in this earlier test set this mixed content test set.

G

I, don't remember how many sequences but I do remember that there was at least one sequence which, in in in some cases, have I think around I performed I had a BD r score of 80 percent better than four. So.

B

Liars really skewing the disjunction, minecraft I can't.

G

Remember which one but one sequence added several percent to the total yeah.

D

So so Tim terrybear again you're, probably thinking of Wikipedia, which is right, a screen capture of somebody scrolling through Wikipedia article. There is also a few twitch videos, one including minecraft, which may benefit from the screen coding tools, but I, don't I, don't think it was nearly as large as the benefit for Wikipedia right and.

G

This is Thor without any of the work in progress, so with the CDF. We should get a slight improvement, not a big one, but I pay something and for a proper entropy culture. I, don't really know how much that will add, but it could be a cheaper sense.

G

Do you don't anticipate any very much complexity cost further those two tools didn't.

D

You knew once so.

C

G

Will add some more complexity, obviously a path, but it's it's not that huge speaking of a few percents of running time, depending on the complexity setting and for the entropy colder, it will likely add some complexity. But again it's not doubling or something like that and oh the screen content tool it hasn't been invented. Yet so it's hard to tell so.

B

Just acquire for there are screen content tools in 81 yeah.

G

So there's one tool giving a huge boost to the Wikipedia sequence and a few others. So that's worth looking into.

B

Any other questions all right, thanks Steiner. So next up we have Tim for the doll update.

B

And hopefully you didn't had a bunch of graphs and pictures and years nope, no much better. There we go that.

D

There are a few but they're small, okay, all right so I'm, not Thomas daddy, but he did most of the work for this, so his name's on the slide so I basically wanted to go over. This change was just something that we discovered while working with the vp9 r2b specification and thought: that's not great. Maybe we could make that better, so basically had a couple of requirements next slide.

D

If you want to do something like temporal scalability, you know you should be possible to determine and control which previously coated frames or dependencies of the current frame right. So if I have a bunch of layers like I want to know, when can I actually drop a frame and I want to be able to construct the layers in such a way that I can drop frames in the no braking evening.

D

um So, for example, you know oh wow, skipping every other layer to get some kind of every other frame to get some kind of temporal scalability next slide.

D

Conversely, if I want to have error resilience and it should be possible to determine if the decoder is missing, a frame, that's required for decoding that way, I can ask for it again or I can decide to drop some frames and that lets you build a decoder that never shows a broken frame. So this is sort of like the previous case, but instead of it said being intentionally deciding which frames to drop, you know sometimes I just won't get a frame and then I have to figure out how to handle it. Alright next slide.

D

So let's talk about how this works for vp9, um so there there are a bunch of reference frame dependencies. So basically, you're allowed up to three reference frames. Each frame can reference up to three different frames out of a pool of eight that the decoder maintains, and these are implicitly or explicitly signaled with picture IDs in the RTP mapping. um So the implicit version is basically, you just set up a pattern that gets used over the whole group of pictures and the explicit version just in the frame header.

D

It has a list of up to three picture. Ids and those are the ones you reference, um but then there's this other set of dependencies, which it come from. What vp9 calls frame context?

D

What these basically are our probabilities used for the entropy coding, so vp9 stores probabilities that are backwards, adapted based on data from previous frames, and the decoder maintains four independent sets of these probabilities and then each frame signals which one it wants to use and can optionally write back to that that same set on the updates, based on the data that was decoded from the current frame. So this choice is completely uncorrelated with your reference pictures or picture IDs or any of that other stuff.

D

So next slide, um you can imagine this. This creates some problems, so if you lose a frame in the air resiliency case, you don't know which slot it updated, so you'd actually no longer know if you could decode any frame, um but also the last frame to update the slot using might not have been one of your reference frames.

D

So if you're, going through your your your RTP, headers and saying okay, you know do I, have all the the frames I need to be able to decode this for the current layer or can I safely drop this frame and not break anybody else. um You don't actually know unless you parse into the packet and figure out which of these these frame context slots, it's updating and what other frames that effects.

D

So there are. There are a couple of ways that we could handle this, but basically what's happening here. Is we've introduced this potential hidden, fourth frame dependency um and for people who are designing RTP mappings? This is surprising because everybody thinks oh I told you what reference frame to use. That's all you needed to know right, but actually there's this extra mapping and the art extra dependency in the RTP mapping only signals three picture IDs. So there's a couple of ways we could fix this. We could signal a fourth picture ID.

D

We could impose some requirements in the encoder that you know the slot you use must have come from one of your reference frames. That's a discussion that that will have on the payload list to figure out how to handle that for for vp9, but we thought for maybe for for an X codec.

D

We could do better um and then the final problem is, is you can't fork probabilities and involve them independently because of the requirement that you can only write back to the slot you read from so so, if you, basically, what that means is that every layer, if it wants to have its own independent set of probabilities, it has to pay the cost of adapting them from the static defaults independently of all the other layers. So you don't get to share any of that overhead next slide.

D

So we've made a proposal for a v1, but the problems, a v1 basically has all these same problems and then then more problems. On top of that, so one thing one change that everyone did make is that it now explicitly signals the frame IDs in the codec payload instead of having in the RTP header, so that that's actually good. That means it gets consistently done the same way everywhere. It now allows up to six reference frames per frame still draw from a pool of eight.

D

It handles probabilities, basically the same way as vp9, but now you have eight slots to pick from instead of four and then it's added some new things on top of that. So so there are now motion vectors stored from previous frames to use for a temporal motion.

D

Vector prediction in the original design was that, if just always toad that always picked them from the last coded frame, and if you were, you were coding things in such a way that the last coated frame wasn't going to be available or you didn't want to rely on it being available. Then you just didn't: have temporal motion vector prediction? Sorry, you couldn't use it so that was sort of fixed up by this.

D

This temp of the signaling proposal, which I'll talk about in a minute, and then people kept adding more things like global motion data, which is now coated as Delta's relative to the last coded frame, which again you want to be able to drop the last coded frame as a problem.

B

I'm, mostly from the for Mike I, just took comment on the first one for resilience. There's also I'm, not sure what you meant by frame IDs there, but for resilience. We also have these frame numbers now that have been added that are beyond just you know, which one of eight you can actually have a much larger frame number so, like you know, a 10 bit 12 bit frame number that way. If you drop one, you actually know that you dropped one. You know that yeah.

D

Yeah yeah, that's that's what I meant so basically, this the same as the picture IDs in the vp9 are rqp mapping right, um I, think they're, not necessarily the same number of bits, but but similar idea. Okay,.

D

So so what we wanted to do is have have some consistent way of solving this problem, as we accumulate more and more of these hidden dependencies. So next slide.

D

So we came with a very simple proposal which is just make all the dependencies between frames track with the reference frame structure next slide and that's the wrong version of the slides.

D

So there was a nice diagram there that maybe Moe will be able to pull up.

F

D

It didn't successfully convert in the first version, I sent him and then I immediately saw that and sent him a second version.

B

It's give me a quarter quarter of my screen unless to the BGH okay, so the 66k one.

B

I am okay, yeah.

E

E

D

B

D

So so this is, this is basically the situation. Now you have this pool at the top of reference frames. Each one of them has a buffer of actual pixels in it and, as I said before, we have. We have these temporal motion vectors that that get saved for use of for motion, vector prediction and future frames and with the temp MB signaling, what they did is they just move that buffer into the reference frame buffer?

D

So every reference frame has its has a copy of the motion, vectors that were decoded with that reference frame, and then, when you, you pick your list of references to use for the current frame, then the first one becomes the one that you use, that you draw those those motion vectors from, but then down at the bottom. Here there are these: these frame contact slots that have all the probabilities in them, and you know those are you point to some some index in that table?

D

No, that's just coded in the header completely independently of all the reference frames, and then you have this global motion data, which is just always taken from the previous frame, and if you don't want to use the previous frame, then I'm, sorry, you don't get to use global motion all right, and so with our proposal, it looks more like this. So basically we move all the probabilities up into the reference frames as well, and also the global motion data, though this diagram yeah.

D

So so in practice, we've only implemented part of this proposal and in fact we have not moved the global motion data up yes, but we will so yeah. It pretend that it isn't pointing to the the previous frame down there at the bottom anymore.

D

So now what happens is is whatever is the first frame in your list of reference frames. You now draw not not only the the reference pixels, but also all of your motion, vectors. All of your probabilities, your your reference global motion, data that you predict from so everything just comes out of that that first slot you're pointing to John thanks.

D

Why did you make first slot rather than the selectable slot, because I wanna pay the bits for the selection cost um I don't want to pay the bits for the selection, cost and and coding fewer things in the header is generally good from from an IPR perspective.

C

D

Were previously coding, the probabilities.

C

D

Probability selections, yes,.

C

D

C

We were coding.

D

In index, for this, the which set of probabilities to use and now or not so you're saving, actually saving notes. Yes, you're saving a whole three bits for frame all right next slide.

D

So basically, we we remove all the frame contact slots and those just are now reference frame slots, remove all the syntax elements for saving and restoring frame context. So it's actually more than three, because I also had to code a bit to say whether I wanted to right back to that buffer and instead we always save a frame context with a reference buffer. So so, whenever we would store a reference frame into one of those slots, we also store the updated probabilities from the current frame.

D

The temporal motion vector is the global motion, data, etc, etc. um We also no longer need a syntax to reset frame context, so what happens is on a key frame? Then we just reset all the reference frames, which includes resetting the probabilities and everything else on an intra frame on an intro only frame. Then then we reset that one specific reference frame that the Internet reference the intra frame gets stored back to, but not any of the others. So that's an internal frame that is not a key frame next slide.

D

So there are a few complexities with this. So there's now, as Jonathan winks points out this, this interaction between the reference number in that list and what its function is so in our current encoder that basically the first reference was always the last frame, the most the most recent frame from the same layer, and so now you need to reorder that reference list to use probabilities from your your if you want to.

D

If you want to use probabilities from your long-term reference or from an alt for some other type of frame, um you have to reorder that lists. But that's you know just a mapping. In the encoder side, it shouldn't have any any functional changes.

D

So, as I said, there's if we have an intro only frame that is not a keyframe there's, there's currently no way to use a previous frame context. So basically, just your probabilities always get reset. That's the same way. Things worked in vp9 and we didn't change that because you think it was that useful and finally previously you could have probabilities from a non reference frame, and now you can't just because there's no way to code that.

D

But since we now can list up to six of our eight potential references as references for the current frame, you know the impact of that seems kind of low. It seems like one of which, whichever frame you want to draw probabilities from like it's probably gonna, have some useful pixels in it to predict from and, if not like. You know, there's some other frame that you could drop, that that wouldn't wouldn't have been that important anyway, with a list of six and eight real.

B

Quick, do you mean to say that you could, before you could update a context after decoding a non reference frame, and now you can't because the contexts for pixels and other data are all mashed together in one yeah.

D

So so, if you're, if you're decoding a not if a non reference frame um yeah, you could in fact update a context and then and then use that in some future frame and in correct now. If, if nothing ever references it, then you can't use those probabilities, so you have to you have to write them back to be able to use them.

D

All right next slide so, as I said, they're still filming some things to move into the frame context.

D

One is this: the the global motion thing is is relatively recent and so we're doing that as part of the global motion proposal, which is not complete yet so that hasn't happened and while we were working on this people started doing frame size prediction based on the previous frame as well, and so now we need to move that in there too, but but the main proposal was still put everything in this frame context inside of a reference frame and so that the the main idea doesn't change. So it can handle all these.

D

These new things that people are adding and I think that's everything on this proposal. Anyone have any questions about any of that works all right, then I will switch gears and talk about a completely separate topic, so I'm also not Luke Trudeau or David Michael Barr. But again those are the people who actually did all the work that I'm about to talk about and the tool I'm going to talk about is chroma from luma. So next slide. um Basically, we have been have changed this a lot from previous proposals.

D

The stuff I'm going to talk about now is is basically an evolution of the stuff we presented in draft egging at VCC FL over a year ago. We've changed essentially everything, and this is complementary to the proposal in draft MIT's Cogan MVC chroma pred, which is a variant of CFL used for inter prediction. So what I'm going to talk about is solely used for intra prediction right next slide: arms. For those you wondering what koma formula is. The idea is to try to exploit local correlation between different color planes.

D

So if we start with the original over there on the left, if I just reconstruct the luma and then do DC prediction for for chroma, I get sort of this flat, color flat, constant color there, which is not great, but if instead I build a linear model and try to predict the chroma planes from the luma planes, I actually get something that looks pretty close to the original, even with just a relatively crudely quantized linear model.

D

Alright, next slide.

D

So originally we had designed CFL to work with in dala, which is a primarily frequency domain based codec, so in dala, chroma from luma predicted frequency domain coefficients directly, that's hard to do in other codecs particular in a v1. For example, they're up to 16 different transform types and the luma transform type might not match the chroma transform type and the luma transform size may not match the chroma transform size, and you know we had a way of that.

D

Could that last one could sometimes happen in dala, but since everything was a DCT, we sort of had a way to do, mapping from one to the other. But now, if you have to expand that to work with all the different transform type combinations, this gets really complex and hard, and so we gave up and said. Maybe we should just do things in the spatial domain.

D

After all, however, there was one lesson we that we did learn from dala, that we are taking advantage of, which is that a lot of these chroma from luma proposals try to build the linear model implicitly based on previously decoded data, and it actually turns out that that's not great.

D

So when it works, it does ok, but when it doesn't work it can be really really bad, and so we said instead is how about. Instead, we just explicitly signal the model. When we did this in dala, we actually got a small small game compared to trying to build it implicitly so that we're gonna continue to you alright slide so sort of compare this against things. Other people have done LM mode is, is the the H UVC proposal on the original one Thor CFL is? Is the draft mitts go there and I talked about earlier?

D

Dolla CFLs are previous work and then the proposed thing over there on the right is what we're doing now. um So we've moved compared to dolla CFL. We moved back from the frequency domain to the spatial domain, um like dolla CFL. We are now doing explicit. Signaling of what the linear model is, the actual signaling is a little bit different because we're no longer using PV Q, which you people remember, is our perceptual vector quantization.

D

So we've basically just added a new interpretation mode that is only used for chroma planes. So it's it's UV specific, and so that signals when to use this and then said, we no longer require pv q because we're doing everything in the spatial domain. And now, when we on the encoder side, we don't do an explicit model fit. We actually just search all the possibilities that that we want to encode and then on the decoder side.

D

We just use the signal parameters that were sent, and so we don't have to do any decoder model model fitting, which makes our decoder nice and simple, which is always a good additional side, benefit next slide, and so on the encoder side, things things look like this, so we start off with the the reconstructed number pixels up there on the upper left and then we're going to average them over an entire transform block, which computes a a DC value right.

D

So the idea is is, is our linear model is going to have some constant offset and we want to factor that out and code it as just a regular DC residual, as you would with normal transform coefficients.

D

So we averaged blumen pixels over the whole transform block, and then we also do any subsampling that we need to do to convert from from 4 4 4 down to 4 to 0 and and subtract off that that constant offset and so now we're left with with basically just the the contribution to the the AC coefficients, but still in the spatial domain, and we feed that into a search for the best linear parameters, one for each of the two color planes, C, B and C, are um and then on the bottom. They are.

D

We take the original chroma pick tools and we also want to factor the the DC term out of chroma. But since the decoder doesn't know what the the reconstructed chroma looks like, um we just do: DC chroma prediction: that's the sort of the best guess is what the DC will be and we subtract that out and feed that into the search as well, and then the search searches over all of the the possible choices of alpha for each color plane and explicitly codes that up to the bitstream.

D

So there are a couple of choices here that we made for efficiency reasons. So when we have a prediction block, we can then subdivide that into multiple smaller transform blocks. So when we do our luma average, we do it over just a transform block which lets us do. Reconstruction transform block by transform block and basically minimizes the amount that needs to be buffered and hardware. Are things like that?

D

Conversely, when we predict DC for chroma, we actually want to predict it over the entire block at the start. So we could instead have have chosen to update that after each transform block is coded, which could potentially give you a more accurate prediction.

D

But then it makes your search really hard, because every time you every time you pick a different alpha, you have to do a full transform and reconstruction to figure out what the the DC prediction for the next transform block would be to figure out what the what the error impact of choosing your alpha would be for that transform block so doing the the DC prediction over the hole prediction block at once. Avoids that whole problem all right next slide, then the decoder side is again pretty simple.

D

We do the same same averaging and subsampling on of the luma pixels, and then we scale them by the signaled alpha values and add that to our chroma DC prediction, and that gives you a cfl prediction that we then use for the chroma lanes right and my slot advancer is just it up.

B

Question about your up moment about your Alfa's: do you uh do you ever look at component by component? Is they had an alpha for 4c, be computed? Does that ever feed into CR, or have you ever looked at to see whether or not one planes adjustment from luma is useful for the other plane.

D

So so, right now they're jointly coded. So basically what happens is. Is we code an angle in in a plane of alphas? You know, so you have a two-dimensional plane of alphas for CB in office for CR and we code a direction in that plane and then a magnitude along that direction so to the extent that they predict each other. What will happen is is is probabilities along the for those code. Points will increase right to the extent that those two are correlated okay,.

B

So we could diagonal stuck yeah.

D

D

um And, of course, there are complications, so the first one is for sub a byte block sizes for four to zero and also other chroma subsampling formats. So what happens for four to zero? If your your luma blocks are smaller than 8x8, we don't want to have transforms that are smaller than 8x8, so in our sub sample chromo, we use one 4x4 transform, which then covers the same spatial extent as multiple luma blocks.

D

So how do you decide what mode information to use and the answer is we look at the bottom right, luma block from whatever that that subdivision was and that determines, if you have an intra block or an interlock.

D

So as a result, that means that you can actually have some of the blocks in this sub API region are, inter coded, but the chroma winds up being intra-coded. So now we have to buffer luma from the inter coded pixels, as well as the intra code of pixels in in the sub 8x8 regions, and that might be a surprise and in fact, we've implemented this incorrectly at the moment, but we'll fix that and then the the next complication is. Is doing chroma DC prediction for non square blocks, so what happens?

D

Is that that these, the DC prediction works by basically summing up all the pixels to left and summing all up all the pixels above and then taking an average and when your blocks aren't square? The number of pixels in that sum is not a power of two. So now you have to actually do a division, but the number of different cases there is is pretty small.

D

So we can just implement that division with the lookup table, because dividing by either 2 or 3 is not that hard and av1 turns out to be adding rectangular transforms with rectangular inter prediction, so they're gonna have to solve this problem anyway, and so we'll probably wound up using the same mechanism. They did when it comes time for that slide and then, finally, there's there's all sorts of fun at the boundaries of the frames.

D

So what happens in a v1 is that your frame, sighs, gets rounded up to the nearest multiple 8, but block sizes. A navy one can actually be much larger than 8 by 8. They can be up to 64 by 64.

B

D

Yeah and maybe 128 by 128 someday, and so if you have a large prediction block which overlaps this boundary but smaller transform blocks inside that large prediction block, some of your transform blocks may be entirely outside that boundary and those just don't get coded, which okay? That's that's fine until you also realize that your chroma transform blocks can actually cover a larger area than the corresponding luma transform blocks. This might be easier to see if you go to the next slide, where this picture. um So here's an example of when this happens.

D

um If I have a 32 by 32 prediction, block with 8 by 8 transforms inside of it in the luma plane. um It looks like this. You know, as it runs into this this frame boundary. You know the the last four blocks there are just not coded like they don't just don't appear in the bitstream, um but for a 32 by 32 prediction. Block with with 8 byte transforms in the luma plane.

D

The corresponding transform size in the chroma plane for 4 to 0 is also 8 by 8, which means it actually covers 4 times the area of a corresponding luma transform block. So now those blocks partially overlap that boundary and since they're not completely outside, they still get coded, which now means I have a bunch of chroma pixels where there is no corresponding luma pixel to draw a prediction from, and so currently what we're doing is just taking the last row of luma pixels and extending them downwards, which you know is simple and seems to work.

D

Ok, ish this by the way boundary handling also complicates DC prediction. Quite a bit when you're your neighbors block sizes, don't match up with your own block sizes, but that complication is again not CFL. Specific I. Just thought I would point it out.

B

One pixel extrapolation is.

D

Literally one pixel extrapolation.

D

We could have tried to do something more expensive, but it didn't seem worth it.

D

It has a small effect on metrics I doubt anyone would notice, looking at the images and and ultimately I have a completely different set of proposals. That I hope will clear all of this up, but I have no idea of the work yet for simplifying how all this is handled and not just for CFL but for the whole codec.

D

So I don't want, invest a huge amount of time in over engineering. How CFL handles.

C

D

Like, ultimately, a your decoder isn't going to emit those things outside the favourabie anyway. Isn't it? So what does it matter? What you pick the chroma to be um yes, exactly and and the the answer is you still have to make an encoder smart enough to encode something for them that doesn't affect other things right, because transform coefficients ring across the whole block.

D

So I've included these nice outdated example images these are about a month old, so they're, not that outdated, but but things have changed since they were generated. So this is current a v1 and the next slide is when we add CFL I, don't know if you can see that. Basically, we just get a huge amount of additional detail.

D

So I think this is just the chroma red plane and it's about it's over 1 DB, you've improved psnr.

D

So this is a picture of the valley de coca, which is basically a bunch of cliffs in a lake.

D

It wasn't foggy, in fact.

B

This is just one crumb of line right, yeah.

D

This is just this is just the CR plain.

B

And in grayscale not the actual chroma color that you're.

D

Right and looking at the the objective results, so these are measured just on still images, since this is an intra prediction mode. So it does not have a large impact on inter frames, we're basically trading off about 0.3% BT rate for psnr to gain 5% for C, acid2 mm and so CID CIE D 2000 is a metric which actually contains both luma and chroma, approximately weighted perceptually.

D

So it's it's doing, CIE lab conversion and then computing Delta e, so it behaves very similar to psnr, but has a perceptual wait for chroma in it, and so you know the you usually sort of expect that that that sort of small small changes in luma can generate large changes in chroma, but once ad takes luma into account in in you know, two point: three four five seems like a pretty reasonable trade-off, um but we'd like to shift some of those gains back into luma, so we're also working on adjusting the the luma chroma balance in the encoder.

D

But currently none of that is is at all sane in the way the encoder works. It actually has different parts of the encoder using completely different weights and so we're trying to sort that mess out and then maybe we'll have a parameter. We can tune to move some of those gains from chroma back into luma and have green numbers across the board, but all right I think that's it. Are there any questions.

G

It's now I'm it's going from Cisco. Does it make sense to use implicitly derive alpha values as prediction for the single alpha values in order to reduce the bit cost.

D

So we tried that in dollar and basically didn't help that may be worth revisiting for for a v1, but you know it has a decoder, complexity, costs, and so the question is the savings worth the extra of CPU required to do the implicit model building. um The answer is, you know: maybe we haven't tried it.

B

So the question is the encoder and decoder both search for alphas and then you only and they predict them, and then only the error of the prediction is signaled. Yeah.

D

So I think that's what Steiner was suggesting you good drive, um so so we tried it in dolla. We have not tried it in 81. With this new proposal.

D

Alright, any other questions all right. Thank you.

B

So little over our agenda, but there's time left in this session. If anyone wants to raise anything else, otherwise you can take off early.

A

Did everyone sign the blue sheets? Anybody didn't find the blue sheets, ok, cool, Magnus sheets to you and I'm Tessa. If you don't want something, the notes, video codec I'll go and comes with you and yeah Thank You Jonathan for doones describe the job. Are you free go about your merry way.

B

Please make sure to get the blue sheets signed if I came in afterwards.

A