Open Research Institute Ham Expo September 2022, 14 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Truly Excellent Digital Voice Quality: Opulent Voice

Description

Ham Expo presentation from September 2022 by Paul Williamson, KB5MU.

Transmitted voice can never be perfect. Some amateur radio voice modes, especially digital ones, are very much less than perfect. We propose and demonstrate a prototype of Opulent Voice, a higher bit rate digital voice mode that brings voice quality up to a new level.

A

Hi I'm paul williamson kv5mu and I have a few words to say about voice quality in amateur radio since nearly the dawn of radio in the early 20th century. The spoken voice has been one of perhaps four main kinds of program. Material transmitted, written text was the first morse code, radio, telegraph and so on.

A

After voice quickly came music and after relatively few short years, video with computers in the internet, the technical boundaries start to blur and other kinds of transmission are added, but from a consumer's point of view, these four types are still plain to see in everyday use.

A

Amateur radio has never had much to do with music, mainly since we are prohibited from transmitting music. In most circumstances, early radio regulations were designed in part to sort stations into different types with different regulatory needs.

A

Music transmission was and is strongly associated with broadcasting to the public. It's not a coincidence that broadcasting to the public is also generally prohibited on amateur radio.

A

If not for these prohibitions, we would never have been allowed to share the amateur bands, as we always have informally dynamically, and without much government involvement. Video has a different history. It's also strongly associated with broadcasting, but since video transmission was out of reach for amateurs in the early years, there was no need for rules prohibiting it.

A

Even today, relatively few of us are active on amateur television voice is still the main thing.

A

Amateurs have been speaking to each other over the air for 100 years, plus or minus, using a variety of different technologies, starting with simple amplitude modulation, or am some of those technologies can sound pretty good. All things considered.

A

In other cases, the quality can be downright bad, even bordering on unusable, as with anything in engineering, trade-offs are often involved. Sometimes the trade-off is against cost, which may arise from system complexity.

A

Sometimes the trade-off is against efficient spectrum occupancy.

A

Sometimes the trade-off is against performance in terms of the required signal-to-noise ratio for acceptable quality and all too often the trade-off is against the business interests of an equipment manufacturer which may be enforced by patents or other forms of so-called intellectual property.

A

Let's talk briefly about what makes voice sound, good or bad over the air, and, let's start at the very beginning, here's one person talking to another person with no technology involved.

A

The speaker creates sound, which is just pressure waves in the air. The listener detects these pressure waves using special hardware called ears and processes. The resulting measurements, using even more special hardware called a brain. The subjective result created by that processing is what we experience as hearing a perfect voice. Transmission system would be able to recreate that experience with no degradation whatsoever.

A

Now before we even introduce any electronics. Things are not really. This simple for one thing most listeners have two ears and the ability to turn their heads. They can experience sound within a three-dimensional space.

A

Science doesn't yet have a complete understanding of how all that works, but we know a lot. We try to recreate this experience with stereo binaural, quadrophonic, 5.1, dts, dsx, atmos and so on.

A

These systems are exclusively aimed at music and movie soundtrack reproduction, hardly ever for simple voice. I think this may be worth looking into someday, but for now, let's just assume that for voice purposes, we can think of the signal as being monophonic a single channel.

A

A system based on this assumption can never be perfectly realistic. But a hundred years of radio telephony and about 135 years of wired telephony suggest that it's a good place to start our brains seem to be able to adapt to the missing dimensions.

A

We have some powerful mathematics for single channel signals. One thing we know mathematically is that any signal can be decomposed into components by frequency and reassembled without loss. So we can meaningfully describe a signal by its frequency distribution because of the way the human vocal tract works. A voice signal typically has a lowest frequency component or fundamental somewhere in the range of about 50 to 300-ish hertz.

A

It turns out that human hearing doesn't necessarily need to actually detect the fundamental. The listener seems to hear the fundamental, even if only its harmonics are present so for voice communication systems. It's common to assume that there's no need to transmit any frequency components below 300 hertz.

A

The voice is still considered intelligible, even without its fundamental. Now that doesn't mean the listener can't tell that those low frequencies are missing. It just means that they can still understand. What's being said.

A

Likewise, we know from measurements that human hearing is less sensitive at higher frequencies.

A

A young person with undamaged hearing may be able to hear as high as 20 kilohertz, but this degrades with age and exposure to loud sounds to 12, kilohertz, 10, kilohertz or even lower human hearing is also insensitive below about 20 hertz. So it's commonly said that the range of human hearing is 20 to 20 000 hertz, a system that reproduces all frequencies equally from 20 hertz to 20 kilohertz with perfect precision would certainly be considered a high fidelity system suitable for the most demanding music listening.

A

Most listeners would be perfectly happy with a lot less and never know the difference.

A

Experiments with voice telephony have shown that the frequencies that are most important to voice intelligibility are those from 300 hertz to just 3. 000 hertz listeners judge a voice as intelligible, even if all the components below 300, hertz or above 3000 hertz are removed again, that doesn't mean the listener can't tell that the frequencies are missing any listener, with normal hearing will certainly be able to notice the missing information in an otherwise clean environment. A hi-fi system with wide frequency response is a superior listening experience.

A

An interesting special case is when there are two or more voices present all at about the same volume at the listener's ear.

A

All the voices occupy roughly the same frequencies, so you might think that they would interfere with each other catastrophically and the listener would be unable to make any sense out of them, but we know from common experience. This is not the case at a crowded cocktail party surrounded by many conversations.

A

Most listeners can choose one conversation and concentrate on hearing that voice among the many and do a pretty good job of understanding it. This is called the cocktail party effect everything I've just been talking about applies to voices in air.

A

Now, let's begin to introduce some electronics.

A

Here we have a basic one-way telephone system, a microphone, a simple amplifier and a speaker connected by wires. The microphone converts sound into electrical voltage and the speaker converts the amplified signal back into sound.

A

What new problems have we introduced for one thing, the frequency response of the microphone amplifier and speaker won't be perfect. That is, some frequencies will be reproduced at different amplitudes than others.

A

This is especially a problem with microphones and speakers that have been designed to be very small or very inexpensive, and it could be even worse if the mic or speaker is also designed to look cool or be waterproof or whatever.

A

Microphones and speakers are both electro mechanical systems and so high performance devices tend to be more difficult and more expensive.

A

In practice. The microphone is generally designed to have limited frequency response. A microphone intended for high fidelity music reproduction would try to cover the full range of human hearing, but a microphone intended for voice telephony might purposely roll off around three kilohertz.

A

Any responses outside the intended range is wasted and can only cause problems.

A

The frequency response error is not the only possible problem. The components can also be non-linear in any number of ways, meaning that signal components can interact in ways they shouldn't creating extra unwanted signal components. These effects are usually lumped together and called distortion.

A

Distortion is also typically worse in cheap or miniaturized devices.

A

The wires themselves, especially if they're long, can act as a filter and degrade the frequency response. They can pick up unwanted signals from other wires or from other nearby radio, transmitters and so on.

A

As radio amateurs, we don't usually need to be experts at wired telephony. Instead, we add a radio transmitter and receiver to the system.

A

We still have the imperfect microphone and speaker and our radio transmitter has to somehow convert the audio voltage from the microphone into a signal that is suitable for transmitting over the air, and the radio receiver has to reverse that conversion.

A

We call that modulation and demodulation the simplest modulation is am amplitude modulation, where the input voltage from the microphone just varies the gain of an amplifier stage in the transmitter. So the signal is stronger when the mic voltage is higher.

A

This produces a radio signal that can be demodulated by a simple power detector. The simplest am receiver and transmitter can each be implemented with just one transistor or vacuum. Tube am produces a radio signal that looks like this viewed in the frequency domain, where amplitude is plotted against frequency that spike in the middle, that's called the carrier.

A

That's the signal we feed into the variable gain amplifier in the transmitter that stuff above the carrier is ideally an exact copy of the microphone signal. It's called the upper sideband, the stuff below the carrier. The lower sideband is another exact copy of the microphone signal, but inverted in frequency, so low voice pitches make sidebands that are close to the carrier and higher voice pitches make sidebands that are further away.

A

You can see this is a wasteful way to transmit a voice signal for one thing: we're transmitting two copies of the voice signal which takes extra transmitter power and uses up twice as much space on the band.

A

Perhaps worse, a lot of our transmitter power is going into the carrier which doesn't carry any information at all, except to provide a frequency. Reference on the plus side am does not require any processing of the audio signal at all. The system is so simple: it just transmits a copy of the signal, as it came from the microphone.

A

That means that if our modulator and demodulator circuits are designed with care to be reasonably linear and to have nice frequency responses, the audio voltage coming out of the receiver can be a very accurate replica of the microphone signal.

A

Of course, that only applies when radio conditions are perfect, strong signals and no interference.

A

Nonetheless, even today, this potential for high audio quality attracts a group of amateurs who operate vintage, am radios on the 75 meter band with care under perfect conditions. They just sound great, inevitably am came to be replaced by a less wasteful method, ssb, which stands for single sideband.

A

It's easy to say what ssb does it transmits just one of the two sidebands? It does not transmit the carrier at all.

A

This was invented in 1915, but didn't see widespread application in amateur radio until after world war ii, there are several methods to generate an ssb signal, but none of them is anything like as simple as an am transmitter.

A

The filter method requires a high quality filter with flat frequency response and steep skirts, and the phase method requires a similarly high quality 90 degree phase. Shifter. These complex elements tend to be imperfect, especially when designed to be low cost.

A

You may have heard of the collins s-line radios, they were highly prized and expensive, mainly because they used very high quality ssb filters with poor filters, which were more common. Ssb can be pretty hard to listen to even with good filters. Ssb has a problem without a carrier there's no frequency reference, so reception depends on highly accurate and stable tuning.

A

This is less of a problem with modern radios which usually have good frequency, accuracy and stability, but if the tuned frequency is off, the received ssb signal comes out. Frequency shifted by the same amount. All the voice pitches are off when the pitches are too high. It sounds a little bit like donald duck.

A

When ssb was introduced into amateur radio, many users objected, the voice. Quality was definitely worse, not for any unavoidable fundamental reason perfectly implemented. Ssb can sound just as great as am in theory in practice, though ssb signals are usually not as good tuning them incorrectly and listening to them for long periods of time without fatigue are skills that can be difficult to learn.

A

Nonetheless, the advantages of ssb have proven to be enough to justify the popularity of ssb over am on the hf bands.

A

If an interfering signal is present within the bandwidth of the desired signal, both am and ssb pass. The interference right through to the speaker as radio conditions deteriorate and signal strength fades. The desired signal can be swallowed up by noise. However, the degradation is gradual and relatively easy to listen to.

A

The last main. Analog method of modulation is fm frequency. Modulation in fm the voltage from the microphone is used to vary the frequency of an oscillator in the transmitter instead of the gain of an amplifier as in an am transmitter, like am an fm signal, has a central carrier and two side bands, but the math works out differently.

A

It turns out that most noise and interference on the radio bands is like am the fm receiver naturally rejects it. However, the fm signal needs to be wider in frequency than an am or ssb signal.

A

An am signal occupies a bandwidth of twice the highest audio frequency transmitted or about six kilohertz for a nominal three kilohertz voice bandwidth.

A

An fm signal, on the other hand, doesn't have any one particular bandwidth, but must have at least several times as much bandwidth as the audio signal, in order to realize its noise reduction capabilities.

A

A commercial broadcast fm signal, with all its auxiliary subcarriers, occupies 200 kilohertz in amateur radio. Fm signals were originally allocated, 60, kilohertz channels. This has been trimmed down to 30, 25, 20, 15 and even 12, and a half kilohertz in some cases because of overcrowding, even at these reduced channel spacings fm inherently sounds pretty good.

A

It is still subject to degradation due to low quality, microphones and speakers, of course, and the design frequency response is still capped at around three kilohertz, the lowest audio frequencies are reserved for ctcss signaling tones, so fm receivers almost always filter out everything below 300 hertz, but within these limitations in perfect conditions, fm sounds good as radio conditions go from perfect to merely good the fm signal holds up well, it continues to sound good due to a phenomenon called quieting.

A

The fm receiver suppresses background noise to a substantial extent. The standard for a good, strong fm signal is 12 db of quieting. That is the background. Noise is reduced by 12 decibels as conditions get worse, though, the fm signal starts to fall apart.

A

If the signal is weak, random clicks that sound a bit like popcorn popping start to intrude on the received signal. Then, as the signal gets, even weaker noise comes up and overwhelms the received signal rather quickly, because fm occupies a much wider bandwidth. Its use is not allowed below the 10 meter band on vhf and uhf, though fm is king for a number of reasons.

A

One of the big reasons is it's consistently good voice quality when conditions are decent.

A

It may be hard to believe if you tune around the bands now, but our vhf and uhf bands, especially 2 meters and 70 centimeters, used to be overcrowded in major metropolitan areas. Every available repeater pair was in use and during prime commuting time every repeater was busy with conversations there was a demand for increased capacity.

A

This was even more true in the commercial and public service radio services, even with channel spacing split in half and then split in half again, there were not enough channels to go around radio manufacturers began to develop digital technologies to try and cram more users into the available spectrum.

A

The minimum standard for success was to fit two voice channels into the same 12 and a half kilohertz channel. That then accommodated just one in the 1990s and early 2000s industry developed several competing digital radio standards, including dmr and p25 japan, amateur radio league developed d-star in 2013 jesus introduced their system, fusion, which is nearly identical to p25.

A

All of these systems are now in use in amateur radio. Details vary from system to system; they all have some way of linking up systems over the internet, which seems to be the only way to find enough people to talk to you anymore. We won't get into that today.

A

Here's a very high level view of how digital voice systems, including all of these, are put together instead of directly transmitting the microphone voltage as the analog systems do digital voice systems instead convert the microphone voltage into a sequence of numbers, called samples using an adc or analog to digital converter.

A

Here's where another important mathematical theorem comes into play. The nyquist sampling theorem says that we can always sample a signal like this, as long as the sampling rate exceeds twice the maximum bandwidth of the signal. The stream of samples captures all of the information in the signal and the signal can be losslessly reconstructed from the samples that may not be intuitive, even if you've studied the proof of the theorem.

A

But it's a fact. So if we want to sample microphone data that contains frequencies up to 3000 hertz for standard communications grade voice, we need to sample at least 6 000 times per. Second, the implementation turns out to be easier. If we leave some extra room, so typically, we will sample at 8 000 times per. Second, each sample needs enough bits to accurately capture the voltage. The convenient size is 16 bits per sample.

A

Multiply that out 16 bits per sample times. 8 000 samples per second is 128 000 bits per second.

A

If we could somehow squeeze a 128 000 bits per second through our radio channel, we could reproduce perfect voice with 3khz bandwidth.

A

Unfortunately, it's really really hard to get 128 000 bits per second through a radio channel. Without making the radio channel much wider than a standard fm voice channel. We need a way to reduce the number of bits that have to be transmitted by a lot.

A

The algorithm that compresses the voice samples is called a voice encoder then, on the other end, a matching algorithm that re-expands the voice samples is called a voice decoder and together we call the pair of algorithms a voice codec, which is short for coder decoder.

A

All of the digital voice standards I mentioned above use a version of the same voice. Codec called ambi, which stands for advanced multi-band, excitation d-star uses ambi plus, while the others use ambi, plus 2. A more advanced version.

A

Ambi is a proprietary codec from a company called dvsi digital voice systems incorporated it is patented and both the specification and all the implementation details are kept secret.

A

Apparently, the only legal way to get your hands on an ambi voice codec for experimentation is to buy a board with a hardware chip from dvsi that contains their ambi implementation.

A

That's what the dv dongle was, but that's no longer available. Another similar product called the amateur radio thumb. Dv is currently showing as sold out at a discounted price of a hundred dollars.

A

Currently, the cheapest product I can find is the zoom spot mb server board at 150.

A

That gizmo will let you operate any of the digital voice modes. If you have the right software to go with it, but it still won't let you modify the voice codec or even see how it works for an experimental service. This is not ideal.

A

Dvsi's implementation of ambi takes input samples at 8 000 samples per second no more and can output a compressed bit stream at rates varying from 2000 bits per second to 9600 bits per second, all of the digital voice modes, use it at a low bit rate around 2400 bits per second.

A

The voice modes then add a bunch of overhead, including some forward error, correction and transmit up to 9 600 bits per second using variations of four area fsk through a standard, nero radio channel.

A

They reach the goal of transmitting two voice signals in the space of one by time. Division multiplexing that one channel with these design decisions. They don't have room for any more than about 2400 bits per second of voice data.

A

So how does it sound not that great under ideal conditions with no errors on the channel? It sounds a bit like a robot in a box intelligibility. The ability to make out which words were said is barely acceptable.

A

Speaker recognition. The ability to distinguish one person's voice from another is significantly impaired. Listener fatigue is high. It's just not in any way pleasant to listen to when the radio signal degrades enough to introduce some errors. These digital voice modes, sound, worse distortion, goes up and intelligibility and speaker recognition go down even on internet connections. When radio transmission errors are not a factor, it isn't uncommon to be unable to understand what the other party is saying.

A

When the radio signal degrades just a little more, these digital voice modes can become ridiculously bad.

A

They can sound more like r2d2 than like c-3po emitting all sorts of bleeps and blur obnoxious noises it can be entertaining for the first 30 seconds or so, but it's not what digital voice is supposed to sound like.

A

I think it's a sad situation that some of the worst voice quality ever transmitted on amateur radio, is being transmitted today using some of the newest transmitting equipment and some of the most complex technology fans of these digital modes can sometimes be heard to say that they sound great or even better than fm.

A

I think what they're hearing is that digital voice doesn't have the same familiar types of degradation, they're used to hearing on fm, somehow they're able to ignore the problems that are unique to ambien coded digital voice.

A

I am not, and I'm not alone, amateur radio needs a better digital voice mode than these.

A

I have good news: it's absolutely possible to make digital voice. That actually sounds good. There are really only two steps.

A

First, we have to give up on trying to squeeze the digital voice transmission into a legacy channel spacing that was designed for fm signals decades ago and then cut in half three times a channel about 40 or 50. Kilohertz wide should be plenty.

A

We do have room for this. The amateur bands are not that crowded anymore and it's no longer difficult to use uhf frequencies instead of two meter vhf the regulations, at least in the united states, allow offer much more bandwidth for digital transmissions in the 222 megahertz band and every band above that, in particular for a digital satellite system that uses 5, gigahertz and 10 gigahertz microwave bands for the uplink and downlink respectively, as proposed by ori, there's plenty of room for a hundred of these channels.

A

The only other thing we have to do is use a modern, audio codec. Instead of the proprietary relic that is ambi. The codec we choose needs to be able to work at higher input. Sample rates, 8 000 samples per second is pretty marginal for voice quality.

A

It must be able to work at output bit rates significantly higher than the current 2400 bits per second. In order to achieve superior compressed audio quality.

A

Ideally, it should be able to scale up to even higher bit rates in case our standards go up and we want even more perfect voice quality.

A

It should work well with multiple simultaneous speakers and with high levels of background sounds including background music. It should have high quality, open source implementations.

A

Its specifications should be open and available to all it should be free of patent or licensing encumbrances.

A

Does that sound too good to be true? It's not. There is an audio codec that meets all these requirements. It's called opus.

A

We have prototyped a digital voice mode using the opus codec, with the goal of using it as the recommended voice mode for the p4xt digital satellite transponder being developed by ori.

A

We had a bit of fun with the name opus and named this mode. Opulent voice here is the overview of opus from its own website.

A

Opus is a totally open, royalty-free highly versatile, audio codec opus is unmatched for interactive speech and music transmission over the internet, but is also intended for storage and streaming applications.

A

And here is their technology summary check out all those awesome features a wide range of bit rates, input, sample rates, frame sizes and audio bandwidths.

A

They cannot put a constant bit rate or variable and can handle speech or music mono or stereo or multi-stream all adjustable on the fly and doesn't freak out when data is lost in transmission.

A

Our prototype is a pair of application programs written in c plus one encodes and modulates, and the other demodulates and decodes.

A

We use the library, opus implementation libopus and we tell the opus encoder that is encoding single channel speech. We max out the input sample rate at 48 000 samples per second for full band frequency coverage and choose one of their recommended frame: sizes, 20, milliseconds and importantly, we've chosen an output bitrate of 16 000 bits per. Second, that's between six and seven times as many bits as the ambi-based digital voice modes use.

A

16 kilobits is actually near the low end of bitrate, supported by opus, that's appropriate since we're only encoding speech. We don't need high fidelity reproduction of music to simplify the prototype design, we're currently using a constant output bitrate. This will probably change to a variable bit rate in the full implementation.

A

Credit where credit is due, we did not write our prototype from scratch. It is based on a c-plus plus prototype written by rob riggs of mobilelinked llc for the m17 project.

A

The m17 project is another effort to replace the ambi-based digital voice modes with something better, but with somewhat different goals like us. They wanted to replace the mb voice, codec with something free and open and with better quality, but they also wanted to fit within traditional channel spacing, so they could not greatly increase the bitrate.

A

They chose the voice code act called codec2 created by david rowe of roatel, based on his 1997 phd dissertation.

A

They use it at its maximum bit rate of 3 200 bits per second only about one third faster than the ambi modes. On top of that, they use a rate. One half convolutional forward error correction code to protect the voice bits from errors, and we've kept that, along with many other design decisions that we didn't see any need to change.

A

We did simplify some aspects of the m17 design and adjusted various parameters to match the requirements of our higher bitrate.

A

Our implementation can be found at this url like everything we do at ori, it's open source and we publish our work early and often this prototype is a work in progress and we invite you to participate in its further development.

A

So have we succeeded? It's still an early prototype and we don't have a lot of experience using opulent voice in the real rough and tumble world of mobile radio, but so far the test results have been extremely promising.

A

Voice. Quality through opulent voice is not perfect, but it is genuinely very, very good. At least I think. So. What do you think? You've just listened to this entire talk through the prototype opulent voice, implementation as madge would say: you're soaking in it I'll be happy to answer any questions you may have in the remaining time.