Internet Engineering Task Force 98, 30 Mar 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: IETF98-CODEC-20170330-0900

Description

CODEC meeting session at IETF98
2017/03/30 0900

A

Having better success than me, my network.

B

A

About my network keeps you off and off.

A

Then nothing would work.

C

The rocket stove.

A

I mean right now we have a problem as we don't have a traverse, Carver note-taker, okay,.

C

I'll take notes.

D

Bigger problem is I, don't think I'll get my own that cuz I was planning to see you manage it off.

A

Ei see if my graphics drivers actually work.

C

And she borrowed be jaded uncle, but it doesn't work.

B

That's kind of that's better.

A

Ashley Shawn marker are you on Jabbar? Are you on Jabbar? Do.

C

You describe a note-taker.

C

We only have chairs of presenters and eighties.

A

To take notes, so note-taking is pretty easy. All you have to do is use record conclusions like that doesn't have to be a 5-minute blow-by-blow. um There is, if you go.

B

A

Go to the agenda, there's there's an etherpad link.

C

If you just go to ITF, not just Google, IGF, 98 and then agenda you'll get HTML agenda and then look for codec.

C

This is the first meeting where we can actually go personally meet everyone in the room.

A

So somewhere over in one of these links on this side is.

D

C

Big X I think will take us to all the materials unfold.

A

Yeah, you should be able to just jot things directly in there.

E

Are you nearly backup.

A

D

A

D

C

Dot patch I, don't think, is a viable format to uh submit slides in you upload a dot patch proceedings. Yes, we did.

A

All right so I guess we'll go ahead and get started, I'm sure there's any remote participants um yeah. So this is this is the first meeting of the codec working group in a while I think. The last meeting we had was ITF 82 in in 2011.

A

So I'm temporary, that's Moe's a dealer, your chairs, um everyone should know the know. Well, if you have IPR on contributions that you make, you must disclose it, those that's the fine print.

A

This is our current agenda, which is just us talking for a few minutes, and then you guys giving your presentation.

A

Presume there are any last-minute addition see Genda all right, so just a quick review of our milestones.

A

We have two um one is the draft that you're talking about today, which was supposed to be done in November, so I guess we'll have to update that date and the other one was the bug fixed draft, which has past working group last call and was waiting on us to figure out how to do the the get a permanent URL for the new test, vectors, which I think has now been sorted, and so now it's just waiting on me to do the write up and hand it off to them, so that should open, hopefully happen pretty soon.

A

All right, that'll turn it over to you guys.

A

Yep give me second I.

F

A

So at least we we can review the draft among us but yeah. It feels like there should probably be more people here to do proper review of this, because I think we've already looked at this do at least some extent.

A

So I don't know if you want to keep going or what your plan is. Oh.

A

Yeah, okay, so we have at least one one person who's.

A

Yeah so yeah yeah go ahead.

G

So hi everyone I'm here to present a Masonic San Goku's container. This was specified in our draft on the codec for Emma sonics I'm, an terrowin, and so let's begin.

G

Okay, so the agenda is, is fairly straightforward. We're gonna start with a foundation on ambisonics what what that and what that format is and how it relates to spatial, audio and 3d audio and then we'll discuss what our spec proposition is and involving adding ambisonics to opus, including the new proposed mappings, as well as what kinds of calculations would be involved in those mappings.

G

So, let's start first with ambach a Masonic foundations. Just fundamentally, what is spatial audio special audio is the idea that you have a sort of psycho acoustic sense of when an object like this loudspeaker is existing in some space away from you in some position.

G

The arrival path of the signal from that speaker to your left, ear into your right ear differs, and it differs because of the difference in distance, as well as the diffraction around your head and and from your shoulder width and pinna, and all those kinds of things and all of that processing that that happens physically around your around your ears causes us different filtering in each ear and this different filtering, our brain interprets as a localization cue, and so then, what we do is we sync up those two left and right signals together.

G

We hear it as a composite signal and we interpret it as oh it's over here over there. This is how basically HR TFS had related transfer functions, work where you have a filter that describes that distance, that path distance for each ear, for a given position and what ambisonics allows us to do is it allows us to not just model one point at a time, but it allows us to model the entire sound field around the head. So this has a lot of advantages in video games in VR audio in 360, video.

G

Anything where, where you're going to have sounds existing from all sides around you ambisonics, is a very nice compact formulation for that representation. So what is what is? What do I mean by all around your head?

G

Well, if we pretend that this blue sphere is a not an omnidirectional microphone, but a spherical microphone that it is, it can capture it's a it's some sort of spherical capsule and it can capture sound across the whole position of it and if I were to place right there and there and let them charge out and have them capture onto the sphere, you see that they would capture it to different points in the microphone. This is like at a particular time snapshot right, and this is telling you maybe the energy at that particular type shot.

G

Then what we have is we have some representation of two signals arriving at two different locations on sphere, so we this this would might be our. This is what reality would be and we can use truncated, spherical, harmonics, otherwise known as ambisonics as a way to represent a n order approximation to this to this function.

G

So if we consider first-order ambisonics first first, our first-order ambisonics, all a Masonic sin general consists of what are called spherical harmonics spherical harmonics are a set of orthogonal basis, functions that describe shapes on a sphere or positions on a sphere. However, you want to think about it and you can truncate the series- and this basically controls the resolution at which you can describe the the spatial acuity of the show, this the shape contour of the of the of the pattern you're trying to describe. If we use just first-order, we get four components.

G

We get an the one at the top as an omnidirectional Channel, and then we get these three directional modes, one along the x one along the Y one along the z axis, and we with just these three channels. We can then express directional signals that arrive from from all directions, and so, if we, if we wanted to use just these four channels, we can get a first order. Approximation of that which looks like this now that doesn't look particularly great, but it has the right directivity more or less.

G

That's it's given, given the function that it has, it can at least point in somewhat the right direction, and if we, if we want it, the way that this is built is we have those four components and we just have some weighted contribution of those four which best approximates the the signal that we were actually going for.

G

This is all fine and dandy for very wide diffuse fields where you have, for example, like you're recording this, like you're out near the ocean and you're just recording a soundscape around the ocean, something where the directivity doesn't matter as much. But in our example here we had something that had very, very sharp directivity to it with two very distinct sources. So for that we would need to extend up to include more channels otherwise known as higher-order ambisonics, so with higher-order ambisonics. What you see here this is third-order ambisonics.

G

We introduced additional basis since- and these are more more circle, harmonic shapes that can contribute to effectively a higher spatial resolution that we can use to describe the scene and a third order. Approximation gives us a much better fit to reality and obviously, as you go up you can you can get a finer and finer approximation currently there's a lot of systems that use third order, but we pretty much refer to higher order, ambisonics anything anything above first order.

G

Third order is a goal right now, but, as you can see, the number of channels, Rises quadratically, with the order. So you know given more bandwidth and given better compression schemes and things we we might be able to just really more bandwidth. We might be able to expand higher orders. But the point is: is that we with ambisonics? There is a defined number of channels for a given order, but the order can be can vary depending on the content and you get closer and closer to what you were, what you actually had in reality.

G

Okay, so how these systems are typically rendered is usually through either a loudspeaker array like what you see here or through a virtual, loud speaker array, which would be the same kind of representation but replacing the physical loudspeakers with some sort of set of H RTF filters corresponding for each loudspeaker. What you do, then, is you take the ambisonics signal that has that representation of spherical harmonic modes and you project it into what's called a loudspeakers space and the projection involves.

G

Typically, it involves like a pseudo inverse of the of the of the what's called the encoding matrix. So you get a decoding matrix that projects. Your ambisonics signal into some defined loudspeaker array like this, and if you did this binaural e over the headphones, you would have a corresponding HRT F for each one of these for the both the left and right ear. So, in this case, I think this is 32 speakers.

G

So for 32 speakers you would have 64 HRT FS that you would process 32 for each ear sum them all up, and then you would get the the third-order a masonic or whatever an order. A masonic soundfield rendered to the ears. Additionally, you could also have you could also envision. So this gives you a sense of a sound field. That's around your head and just like, if you were within a loudspeaker array, you could actually tilt your head from side to side.

G

You can turn left or right up or down the loudspeakers would stay exactly where they are, and hence you're actually able to rotate the sound field respectfully to where your head orientation is. So this allows us to give that sense that there's an actual 3d space to this. So this is not head tract sorry. This is not.

G

This is head tract in addition to that, a lot of designers that work in this space also want to have you know, dialogue or soundtracks, or things like that things that they don't want head tracked, but they still think are integral to creating the scene, and so we represent that with this pair of headphones that are also on your head. So so, if you extend the visual metaphor, the ambisonics sound field gets projected to these loudspeakers and your head moves.

G

The loudspeakers stay put and the headphones represents a set of non-diegetic audio non head tracked audio. That would follow your ears, no matter which orientation you have so that's a basic overview of ambisonics without going too deep into the math, and now we want to discuss how to add and ambisonics into opus, starting first with the mappings and then into the calculations.

G

So we have two proposed channel mappings for ambisonics, channel, mappings, 2 and 3, and the specification for both of these is that the DES mixer implementation must interpret these opus streams for channel mapping, 2 or 3 as containing ambisonics, and we'll explain what that means in just a second.

G

But there is yeah we'll explain that in a second okay right so for channel mappings, 2, &, 3, there's a expected number of channels, depending on the order and n n can be 0 through 14, and this additional parameter J, which can either be 0 or 1 and we'll describe the the a Masonic Order and, like I, said before. It think the number of channels is goes up quadratically in respect to order. So you see this in the 1 plus n squared and the 2j is the addition of that headphone track.

G

That I was visually explaining just a moment before, so it can, you can have either or for that combination, and the both channel mappings would expect always the number of channels you have to be of this form.

G

Additionally, if you're going to use mixed order, ambisonics mixed order would be you may own. You may want to use the third order horizontal, but only the first order vertical, for example, and you would not include some of the basis functions for that mixed order. You would just simply want to zero out those channels that you don't use and still send the full order, Amazon channels.

G

The order of these ambisonics is ordered by what is called the ambisonics channel number the ACN. This is a defined standard from the from the people that produce Tampax, and so we follow that as well. So it follows a very straightforward scheme and we simply addendum ACN by including the additional left and right channel for the the optional non-diegetic stereo at the end of the ACN channel numbers.

G

Lastly, ambisonics channels, just like all, am Beck's ambisonics they're expected to be normalized with a SN 3d schmidt, semi, normalization.

G

Now into the details of the calculations, in terms of the coding details for the differences between channel mapping, 2 and 3 channel mapping, 2 is a direct, /, ambisonics annal coding scheme. The way this works is we code each ambisonics annal directly, and we have a variable bitrate allocation for each of those channels. More bits are placed in the omnidirectional Channel and less bits are placed in the directional channels for channel mapping 3. This is a little bit fancier, and the proposal here is that we, because the sound field often has a lot of coherence.

G

It's very often that a sound source will be it's very often that the represent the compact most compact representation of your sound field may, in fact be better represented with a transform space outside of from spherical harmonics, either to, for example, what I said before a loudspeaker projection or some other arbitrary projection. We offered the ability of introducing a transform torque to the encoder and a known as the mixing matrix and another transform known as the D mixing matrix from from the coded streams back to the output streams. In this example. The? U?

G

U vector is our is our input streams which go up to C, which is the number of ACN channels with or without that extra additional stereo count. The encoder is is some mixing matrix a which can be a linear matrix like this, or it could be something else depending on implementation details. The number of streams you end up coding is K, which will be the number of streams, plus the number of coupled streams. The way we do the coupling is we couple starting from the top.

G

We just couple the we, as we assume you're, transforming the space into some sort of coherent representation, so that you can take advantage of coupling of the of each pair of X and the D mixing matrix that reproject X back into your output streams. It's called the D mixing matrix, and this is the matrix that we propose to store in the header so that the encoder as its as its as the encoder handles the mixing process. It will also store this D mixing matrix, so the decoder can interpret that during the D.

G

My so for the channel mapping tables mapping twos table looks very very similar to mapping table one. So if you're familiar with mapping family one's mapping table, it will be very much the same, except it will have the requirements of channel count that we had said earlier.

G

Channel mapping, 3 mapping, mapping table will be replaced with this D mixing matrix and it will take up the yeah. It will be fitted in the same position as the mapping table was before and will extend to the count that it needs to.

G

Okay and I think that's it. So if there's any questions, please let me know and thank you very much.

A

All right are there any questions mark. Oh, we have two. You have two remotes.

A

It's hard to see the list from the side of the room.

C

B

C

Echo you can uh you, can you can ask to speak to the mic and then will enable your audio to the roadmap.

C

Or you can just lurk.

C

So it's not a question to do so. uh The the channel mappings seem pretty straightforward when you get into these higher-order Emma, Sonic's and lots of channels. Is there any consideration of work to actually make the encoding more efficient as there room to do that? Besides just channel mapping yeah.

G

So the yeah in terms of the encoding efficiency, a lot of that can be taken in by designing that a matrix so especially with very, very high order. Ambisonics, unless you're dealing with very sharp directional signals, you're going to have a lot of coherence and in the signal and there's very likely that you can, you can think of a matrix transform. That would put your channel count into a much more compact representation.

G

So, for example, like let's say you have like fifth order ambisonics, which would be 64 channels, but you only have a handful of sources, two or three sources in the room at a given time. Maybe a violin over here, someone talking over here and a whale behind you, and so you might actually have a matrix that can actually go from 64 to 3. And then you end up only needing to encode three channels and then D mixing would be the reacts very expression of that back into the 64.

G

So you might end you end up only needing to encode just a handful of channels, depending on how well you can design that a matrix, okay.

C

So you basically did like track based individual source track based in codings and then specify mixing matrix that generates how many of our channels do you want for the hierarchy of a sonics and then D mix back down to uh to the sources right.

G

That's that's definitely a strategy you could use if you, if you are in control of knowing what the content was, there's other things you can do too right, but any any kind of way of basically using that a matrix as a way to to compact the system is, would give you a lot of benefit. Okay,.

A

So I think we had discussed on the list briefly that that we drafted the original log opus draft poorly, in the sense that that it says that anything that was not listed in there any channel mapping family not listed in the original draft should be treated as channel mapping 255. If you don't recognize what it is and we I think discuss some updates to that draft arm. That would essentially go back and fix that language to say something more sensible, but I didn't see that any of those in your latest ambisonics draft.

A

Are you still planning to make those changes, or was that not clear I.

G

Don't I mean yeah I believe perhaps that was not clear. I have to go back this to talk to me beyond about you know, I. Think the if I understand you correctly like you, the concern is: is that if the, for example, the decoder sees channel not be 2 or 3 and doesn't know what to do with it, it reverts back to 255 is that that right.

A

G

A

I think the the basic problem is 255 has.

A

To 255 basically says to use the same kind of channel mapping table that that channel mapping family 1 has and then just pretend, like you know, just write out each individual, each individual channels, you decode it and don't try to say like what that channel means there. Anything which I think works. Basically, fine for channel mapping family too, but it's more problematic for kennel mapping, family 3 and.

G

This was why so, if you'd seen the the previous draft doc that we had sent out the initial one that we proposed had both the mapping table, that you saw in mapping family 1 and this mixing matrix- and there was some discussion- I believe, mark Harris and and a few other people mentioned. Oh, you could actually optimize that out by just including effectively that mapping table as part of the mixing D mixing matrix. So we removed the mapping table from the most recent doc right, which, which I still think.

A

Was the correct thing to do, but but I don't think I think you had a problem whether you remove that or not, and so the the problem you had is that the the table that you included was was the output of you know, operated on the output of your mixing, matrix and mapped to the final set of ambisonics channels, but the output of your mixing matrix could have been some different number than the number of channels you had in actual streams you encoded, which, if you're interpreting the thing is 255 all you know is, is you know I have this many streams and have this many channels like now what how do I map that to my my C output channels, so.

G

The the I guess the best proposal I could I could give at that point then would be to have the mapping table actually relate to what was actually encoded so that at the very at the the worst case scenario, is you just get back the streams you actually just code it and use that as the the mapping table right.

A

um So so we could do that or we could just you know, fixed draft. You know fix RFC 70-74 what you mean 7845 to say what I meant to say when I originally drafted it, which is that the actual contents of the channel mapping table depend on what your channel mapping family is. And if you see when you don't know, you really can't do much with it.

A

B

A

Think I think I think Mark Harris had identified three specific places in in the in the RFC 7845 that we could update, and then we just have this draft make those updates so I, don't know. If we mark did you ever send text to the list on exactly what those changes should be.

H

A

Yeah our jabber scribe.

I

So he said, I did not so the question were the internet draft had a mapping table, but it was not compatible with that of mapping family to five five right.

A

That's what I was saying: okay,.

A

G

A

So so mark, would you be willing to to send that the text that we would need to update 7845 to the list.

A

All right, let's let the notes show that that mark says: okay.

A

Yeah so I think that was the only issue I had.

I

Joe Montana in Missoula, just a minor comment, also I pointed out for the matrix. You need to put it in either little-endian or big-endian, but specify that, but also one thing I don't recall seeing is you would need to also make it clear what whether you're call them first or grow versions. I.

G

Think we say it's column wise. Thank.

I

You said on that's all right: Norris.

G

But we'll we will specify the Indian this too yeah I, don't think that's in there. Yet.

G

A

Yeah, the draft does say column wise.

F

Jeb JEB, Aurelia and I think this is mark again v, I, think, but also the formula in Figure two in the draft was incorrect formula.

G

In Figure two in the draw that's.

A

The one the specifies order and degree based on on kN.

F

A

Actually, maybe I can put these on the overhead for everyone. Yes,.

F

For example, if Kate with one than degree equals minus one yeah.

J

Okay, yeah, that's I think this should be Kate well.

G

With K I see yeah the correct the correct formula: it's it's on the from it should be just straight from the the Ann Beck specification, so we'll make sure that that's cleaned up so I'm, sorry about that yeah one of those might be zero based in the other one one days, one base I think that's what it is. I think it's some shift of.

G

100 I think it's correct. So if we start from 0 K 0 would ya actually I think it's I think it's correct. K 1 would be negative 1 degree, because K 1 would be the second channel, which would be the first-order degree negative 1, but I'll double-check I'll make sure that it that it's working corrected and get back to you later today on that.

I

All right, Jonathan Mozilla: can you explain a bit better how the presence or absence of the non-diegetic audio is signaled, I'm, not sure yeah.

G

So the way it's signaled is by the channel count, so we expect to see either an N plus one squared number of channels or an n plus 1, squared plus two number of channels. So there's there's never any ambiguity there. So we can. We can distinguish it just simply from that and.

I

So in mapping family to.

I

With the non-diegetic be included in the mapping table or not yes,.

K

I

And if so, as what with? What number, it would just be concatenated after the plus to the.

G

On it expects yeah, it will expect the stereo to be at the end on the input, as well as the output, how we actually encode it will will probably likely send it to the coupled stream. So so the the the mapping table will probably account for that and move the the last two channels and so that they can be coded for the coupled string and then move them back. Obviously, okay.

I

So so, basically, you would be in fact, indirectly allowing to code the non-diegetic it's to moto.

G

No I think that the plan is that we will or the way that the I have an upstream upstream this code yet, but the way it works right now is if it detects if it does not detect non-diegetic stereo, and we just have some n plus one squared number of channels. We just code a set of mono streams, but if we have that know.

I

I'm talking about the decoder side, like the format it seems right now, would allow you if you decided to put two more channels. That's non-diegetic right. Yes,.

G

You could yeah, there's, there's no yeah. If you well I, don't I the decoder. Yes, the decoder, yes, but I think the encoder would enforce a coupled stream. I mean.

I

It's a sensible thing to do, but you have to decide whether you allow it or not. In this specific.

G

I think in the specification I don't think it I think we allow it I, don't think it necessarily breaks anything.

I

Yes and so for mapping family three then I thought the idea was to make it possible to support partial order and dishonest. That's so the case. Yes,.

G

Yes, with the non date, the non-diegetic would just be would just be an identity call, a line at the end of the matrix, and if you wanted a, if you wanted a mixed order and like a partial order, you would just have a zero row for that. For that particular input. Now.

I

What I mean is in mapping family three? Do you still force to have the full order be signaled I thought. The idea was that you, you were able to say I'm. You know I'm coding, you know eight channels. Oh I,.

G

See yeah! Yes, because.

I

Then you cannot use the square plus plus to rule to figure out whether that non-diegetic is present at all right.

G

So if you're I think the I think the I think the way that that the doc reads right now and I think our understanding of how we expect the input to be is that you would always signal some n plus one squared with or without that goes to number of channels for either channel mapping. And if you were using partial order, you would need to submit 0 padded channels. Okay,.

I

So because I think the first version said you could specify any number of channels for yeah.

G

And I thought I'm going to change that too. I think we need to make that quite clear to explicitly and allow yeah. So we say that we explicitly allow that number of channels, and so we don't so the only way we support mixed order is by sending zeros with, as as one of those number of channels.

I

Okay, thanks sure.

I

A

A

um No I didn't pass them around because they were onto my laptop.

G

Okay, there's nothing else: I'll just sit down.

A

Any other comments from Jabbar.

B

A

Right, thank you. Thank you for coming. Everyone.

F

How many people do you think, are paying attention to this draft I'm approximately equal to the number of people in the room and on the how many people are paying a different attention to the Opeth's bugfixes um I'd say roughly a similar number?

F

Well, if everyone in the room is actually paying attention, that's probably as good as a lot of working groups on something.

A

You you, you reviewed the update draft yeah, um I I, think there are a few people who have actually looked at the a masonic draft who who we haven't then been able to get to express an opinion on the list, but I know I've received a number of off list comments about it along the lines of. We think this is great. What's happening.

A

I did and he's instead sent mail to the authors, which is also good, but.

C

We have no tourists. Everyone that's doing stuff is actually very actively following him engaged.

F

Interesting way, it's been a tourist free working truth.

K

A

Alright, there's no other business.

A

Thank you. Everyone for coming.

A

Yeah I think once you do that it should be basically ready for working good last call what once days once they address all the issues. This people have.

D

Brought up with the current version well I made for my ignorant. So is the channel mappings.

D

L

Sdp so well, STP is, is mono or stereo yeah.

C

So, there's no way to know whether or not these Amazon's are supported.

L

So so we have over there this. This comes up occasionally and we stay involved. If somebody is interested in actually doing multi channels, opus or RTP, or we should just make a new payload format for it just separate payload identifiers and then no one who's ever willing to actually do that. Work.

K

C

We're intending to be stream.

B