Open Research Institute Presentations, 11 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Deep dive of an open source DVB S2 implementation for FPGA HAM Expo 2022

Description

This talk is by Andre Souto (subscribe to his YouTube at https://www.youtube.com/channel/UCqKVoVhkMJdDF4ThSSQ_vZw) presents a high level of the project then dives into how baseband scrambling, BCH encoding and LDPC encoding have been implemented. It also provides information about the how the DVB-S2 specifies these components.

A

Hi everyone thanks for joining the session. My name is andres otto and I'll make a deep dive of the dvbs 2 encoder implementation for fpga, one of the things we've been doing at phase four and firstly, I'd like to thank michelle, w5 and yv for all her phase. 4 work ron w6rz for all his amazing new radio work, without which this rcl implementation would would have been way way harder.

A

I also need to mention thomas and shu and paul for their contributions and help as well and before we dive in. um Let me provide some context by going over essentially three questions: what's the scope of the dvb s2 encoder how to interact with it and how does the system looks like? um So what is the scope? This is the function block diagram as it appears in the dvbs2 spec and I'm gonna start at the end of this diagram.

A

Again here we have the interface to the rf satellite channel and everything after this point um is platform dependent and ultimately depends on the interface to the rf transceivers.

A

So it's out of our scope here, the current implementation starts with the stream adaptation subsystem here and this effectively delineates the scope, so the current scope includes baseband, scrambler fact: encoding, bitmapping, physical layer, framing and modulation.

A

As of the time of this presentation, there is no padder and no physical layer, signaling and pilot insertion, but we do plan to include those as well.

A

The next thing I want to talk about is how to interact with it and the tvbs2 encoder receives baseband frames at the input and produces modulated physical layer frames at the output, and the interfaces we chose here are access stream for both and more on that in a second, the input stream interface carries the configuration for the frame and we have frame length, constellation and code rate, and these three make up what I call frame attributes and they define how the frame should be processed.

A

The remaining parts make up a regular axis stream. Okay, the output interface is very similar, but without the frame based parts. At this point, the data has been modulated, and so it's going to have inq components. Essentially, besides input and output streams, we also have one axillite interface for memory maps register space access.

A

This register space has three main uses that are configuration status and debugging.

A

We can configure the bitmapper tables, we can enable and disable dummy frame insertion, and we also need to configure the filter coefficients status and debugging are for things like monitoring, behavior, controlling the flow of data and both apply to several internal interfaces.

A

We can, for example, check how many frames in flight we currently have, or what's the biggest smallest frame that has passed through a certain stage, and we can also block a data stream for debug and allow, for example, a word at a time or a frame at a time and to check, for example, how the system behaves the input and output streams. Timing diagram looks like this. In this example, we can see an input frame configured as short frame qpsk and quadrate 1 4..

A

The encoder receives this frame and at some point in the future, we're going to see the output frame here and by the way.

A

This is what I mean when I say: frame parameters are constant for a frame during data transfer, and the last part I want to show on the quick intro topic is: how does the system looks like this is a high-level block diagram of the encoder itself as it's implemented in rtl, and it's very similar to the functional block diagram we've just seen these red blocks here are instances of the axis stream debug, which is the block I mentioned before. That allows us to monitor and control the flow of data.

A

A key difference, though, is the presence of this frame fifo. Besides the beat interleaver and it's used for qpsk frames, because qpsk has no beats interleaving.

A

Every block in the system looks similar to this, and most dvb specific stuff will also have frame attributes, but we can't really change them like this, because, for example, this green block here has nowhere to get data attributes from so to support every frame having a different configuration. We need to transport this configuration alongside the frame itself, and we do this by using a axis stream. Tid interface and essentially stid carries arbitrary data to mtid.

A

In this case, the arbitrary data is the configuration itself and which is encoded in this grey block. For example, we need frame length, consolation and code rate, so with the code tid to extract them and everything just works.

A

We can use this to transport. Anything that's constant during the frame, and that makes it very simple to connect multiple components you can see in the top level here, the frame configuration input is encoded into a vector and that's how it gets transported and to other blocks in the design- and this concludes the basics really so time to get into the cool stuff and we're going to start with the baseband scrambler, and this is a great first block to look at because it's very straightforward with few design decisions.

A

This is the text of the specification and the important bits here are. There is an ending as we need to follow. We need a feedback shift register with a certain polynomial and we need to load this shift register with a certain value whenever a frame starts.

A

The spec also provides a diagram that is actually very close to the rtl implementation. We have a 15 bit shift register and every time you get a bit at the input data in this shift register will move from the left to the right. We take bits, 14 and 15. Add them together to get the new value of bit 1. to get the output bit. We take lfs our bit 1 and xor. Slash, add with the input bit and then we're ready to get the next bit.

A

Looking at the block diagram, we can see that the baseband scrambler is quite simple and we have the lfsr here that is reset. Just like I described, and that's going to happen when the design as a whole is reset or during operation when stlas is asserted, and that indicates that a frame has completed.

A

So we reset the value to prepare for the next frame bit. 1 is then xored with input data and that will produce the output data. So remember, I said: there's few design decisions right here. I chose to register the ports and but when making this presentation, I realized it's actually better to move this to outside and register as needed and leave the baseband scrambler doing only what it's supposed to do but never mind.

A

Next, we have the fak encoding subsystem, which is made of three components: the bch encoder ldpc encoder and the bt interleaver. So dch encoding takes a baseband frame as an input and depends a parity check to that. The size of this parity check code is given indirectly in the spec, and the size is going to depend on the frame length and the code rate.

A

The size of the bb frame is given in the bch uncoded block column here and to get the size of the bch backward. We do bch coded size, minus bch, uncoded size and to calculate the actual parity check code. We use a set of polynomials and then we use frame length and code rates to identify which polynomial index to use and this index will then identify with the polynomial itself.

A

Moving on to the busy age, encoder block diagram and I'll coalesce data streams into a single line from now on, um so the internals of the encoder looks like this, and the crc max does the core clc calculation and that's going to include and select the appropriate polynomial.

A

This block has a constant latency so to make our life easier, and we add this extra stream delay to the data stream here at the bottom, to make both data and crc code. Words arrive at the same time in the output modes. So it's simple to choose which one we want to forward and the crc max has no back pressure, which means we only really need to tap off data to it, and we don't need to replicate the streams here.

A

Data and parity check code are multiplexed when the input frame completes and, and we need a counter to count how many bits the parity check code needs to fit. The output data with the counter essentially helps the shift register here to slice parity check code into data with chunks. um If that makes sense, we can look at this by looking at the flow of data right and so the baseband frame flows um to the output and through the crc max and once st last is asserted, indicating the input frame completed.

A

We count the number of words needed to shift data from the crc flip flop, to the output and and by the way the crc blocks here are auto generated.

A

You can find online crc generators out there and I'll leave a link to the one I used and somewhere in the slides, um but yeah any generator should suffice, really so uh moving on to the ldpc encoding. This is the most complex block in the system. Okay- and I think it's the most interesting as well.

A

um The faq encoding subsystem receives a baseband frame adds a bch fact code word and an ldpc fact code word: the spec gives the size of each one of these blocks in the same table as the bch stuff. We've just seen.

A

The size of the ldpc, fake, codeword is given by the difference between the ldpc coded block and the ldpc uncoded block, and the ldpc encoded block size only depends on the frame size really, but the size of the ldpc uncoded block, which, by the way is the same as the bch coded block, depends on both frame size and code rate.

A

This is the text for the encoding process, as it appears in the spec, but I figured it might be maybe a little too boring if I just read this. So what I'll try to do instead is show you how the process looks like in a minute or so we start with the data frame. This data frame has a certain length and for this example, it doesn't matter really what the length is, but the spec defined the frame as having kldpc bits.

A

This frame is then divided in groups of 360 bits. Each the parity bits will be stored in a memory, and I'm going to call this the parity memory.

A

Each data bit will be xored with values pointed by offsets given by the parity bit address, table and essentially each group of 360 bits and we'll use one row from the parity bit that is stable like this, and so the first bit of the first group will use row zero and I'm gonna put some numbers to help us visualize. Okay, um bit. Zero will then be added that is xored in offsets three five, seven and eleven.

A

The second bit also uses row zero, but we add q to the offset um q is an integer that depends on the frame length and code rate for this example. Let's assume q equals three so for before bit, one the offsets are gonna, be um three plus three: five plus three seven plus three and eleven plus three.

A

The third bit is exactly the same as the second bit, except instead of adding q, we add 2 q. So it's going to map to 3 plus 6 here and then the others.

A

The remaining bits of this group will use the same process um over row: zero uh incrementing, the multiplier of q. Each time the remaining groups will use the same process, but using their associate rows and every time we change rows, we reset the q multiplier to zero.

A

Once all groups are finished and the parity bit memory will be filled with the parity values. For the final step, we consider only this memory and to get the complete ldpc code word. We use a 2-bit sliding window to select the first two bits and add them together to get codeword bit 0.

A

next, then we slide this window by one bit and do the same to get um codeword bit one.

A

We repeat this process for the entire parity memory to get the complete code word and this completed code word is then appended to the input frame and and then we are ready to process the next frame so to understand the implementation.

A

um I'm gonna, divide and conquer the ldpc encoder is actually made up of two sub components: the ldpc table in the ldpc core and the ldpc table essentially unrolls. The parity bit addresses, which means it deals with selecting the correct base parity bit address, table, given the frame length and the code rate, and it will track groups of 360 bits, adds q to the offset correctly um and so on, um and the ldpc core actually does the parity calculation itself, and so, let's have a look at the ldpc table.

A

First, so first of all, a dedicated module makes testing ways simpler and faster, and next, if we go through the dvb spec appendix's bnc will find 21 tables and each one has a variable number of coefficients, but in total that is almost here, six and a half thousand and coefficients, and we need 16 bits to represent them um as we've seen previously. We need to unroll the tables um and we need to do that on the fly, because if we start the unrolled values, it would be just way too much.

A

Data to start internally here are four examples of how tables actually look like in the spec. These are all for normal frames, but the short frames ones are very similar. So, let's zoom in in the first one we can see it has two sections: the top one has 12 rows of 15 values and the bottom one has 30 rows of three values.

A

The next one looks very similar and it has two sections 20 by 12 at the top and 12 by 3 at the bottom.

A

The next table is the same pattern and two sections 36 by 12 at the top and 72 by 3 at the bottom, and in fact all tables have two sections and the number of rows and columns within each section is constant and we can use this to our advantage and and build a generic table reader that reads the shape from of each table from one memory and the actual table data from another memory, and this is going to help um with efficient resource usage.

A

So let's have a look at how this works. This is an example of a table yeah, it's small, just to keep stuff simple. Okay, what we're gonna do is put the table content in one memory and its size and in another memory and the tables themselves are always flattened by the way and to read this flat memory region. We need to know basically where the table starts, the shape of the top region and the shape of the bottom region.

A

So, let's say the next table looks like this again. We split stable data from the table size and like with the previous table. We need to know where this table starts the shape of the top and the shape of the bottom section, and we keep doing this for all 21 tables to actually unroll the table. We we also need the value of q, so we add this to the metadata table.

A

In the end, we have one rom with all flattened tables, stored continuously and a metadata rom that tells us how to read this table rom.

A

In summary, the coefficients rom has around 100 kilobits worth of data and it uses seven block ram 18 on xynx fpga. The metadata memory is much smaller only around one kilobit and it will likely be mapped to lookup tables, so piecing everything together, frame, type and code rate address the metadata table. The metadata table outputs, the relevant info to the unroll logic. The unroll logic reads: the parity address table um to generate offsets for the ldpc core calculation and the unraw logic here is just a bunch of nasal counters.

A

So I'm not going to go into details.

A

Timing, wise, a single beat, is going to start the process. The ldpc table then generates one offset per clock cycle.

A

Besides the coefficients, the ldpc table generates an output called m next, which marks the end of a row that is a new bit can be processed, and this means that this period here corresponding to offsets, 3, 5, 7 and 11, are related to bit 0 and then the following period corresponds to bit. 1 then bit 2 and so on.

A

This example shows the output for the ldpc in a minute slide, so q equals 3, and then you can see that uh when we change from bits zero to one, the offset is incremented by q and and that's going to be true for the second for the third offsets um and so on.

A

Now, let's have a look at how the parity calculation is done in the ldpc core encoder. This is the encoder's block diagram. This block takes data and table inputs. The input data stream is replicated in two branches by the way replicating means a single stream becomes multiple streams.

A

In this case it allows both output mugs and the input synchronizer blocks to back pressure, their inputs, and this back pressure is applied to the data input, so the branch on the top goes straight to the output and the other one goes through the input synchronizer and the ldpc accumulation.

A

The accumulation logic uses offsets from the ldpc table to xor data in the associate positions of the parity rim. While that's going on it will keep the output mux. Selecting data from the axis stream replicate. That's the baseband plus bch fact passing through once the input frame completes it will switch to receive data from the post frame accumulation, slash with conversion blocks.

A

The frame ram, uses a block ram internally and block ram's data width is naturally 16 bits technically 18 bits because ecc but anyway, that's why we have this width converter from 16 to n, to convert parity data to the output data width. The ldpc input sync is used to synchronize table offsets to data bits because parity bit offsets are given one per cycle. We need to convert the input data stream to one bit as well. This ensures that output, data and output offsets are synchronized.

A

Data passes through the width converter directly. Offsets pass through literally directly output will be valid when both width, converter and table have data. The system is ready to consume data from the width converter when the ldpc table indicates next and the reader is ready to consume this data as well. We can consume an offset from the ldpc table whenever there is a valid data bit and the reader is ready to consume this data bit as well.

A

The accumulation process is the heart of the ldpc encoding. This blocks connects directly to the input synchronizer, so here offsets are used to read the paired ram, because each ram address holds a 16 bit value. We divide this offset by 16..

A

The remainder of the division is used to operate only on the specified bits of the parity ram. Read data within xor the input data bit with the specified bits of data from the ram and write the results back to the ram itself.

A

Block rams have an intrinsic two cycle: latency between reading and address and data showing up in the read data port and because we are using the same address for both read and write, we're effectively limited to one operation every three cycles. So let's change this a little bit.

A

These flip flops here at the top are used to match the ram latency. This way we can decouple, read and write addresses in time, because we don't need to hold read address constant until right. Data is ready by the way we can call this a pipeline node.

A

However, this two-cycle latency also applies between write data and read data, and this is an issue in cases where we're reading data from an address- that's been written less than two cycles ago and write data won't make it in time. This is called a data hazard and it does happen in some tables in some coefficients.

A

This is solved inside the parity rim itself by doing some tricks around it, so the actual storage is still a regular block ram, but we've added some flip-flops responsible for storing pending rights, and these are rights that have not yet been committed to the block rim itself.

A

Read address is still connected to the blockram, but is also used to check if the address being read is pending, so that read, data will always have the most up-to-date data. In other words, read data can be either data from the block ram or from one of the pending rights.

A

Suppose we are reading and writing from the same address at the same cycle. Write address, write, data and read address will still go to the block rim, but we can compare the right address and the read address if they match we bypass write data.

A

If we're reading from an address written in the previous cycle, we can do the same with data after the first flip-flop and likewise for reading an address written two cycles ago, but with data after the second flip-flop for anything over two cycles, data will just come from the block rim itself.

A

Unfortunately, this session time is up, even though there is still more to cover this session only covered baseband scrambler, bch encoder and ldpc encoder, but there is plenty of interesting blocks too. If there is enough interest, I can cover those in the future. I would like to share the current state of the project, though we have verified all 84 valid configuration combinations in both simulation and hardware, using new radio results as a baseline, and this includes varying configuration on a per frame basis, and we don't need to stop data flow change parameters then restart.

A

Although there are some caveats, we also have a continuous integration pipeline running on github. That runs tests on gnu, radio plus ghdel and some tests with uses if you're not familiar ghdl, is an open source. Vhdl simulator and the os is an open source. Rtl synthesis suite we're currently bringing up an over-the-air setup to test against lab equipment in ori lab west located in san, diego california, and that's the image on the left.

A

Michelle and paul have been doing an amazing job in the lab. I am trying to get more time to help them out and I'm sure we can get this setup up and running with lab gear, commercial gear and that's gonna, be an incredible milestone. I'm I'm sure there will be some issues. I mean there's always issues, but we are making progress and that's great.

A

If this project got you interested and you'd like to get involved, you can check out the github repository phase foreground or phase 4 space websites.

A

Also, if you'd like to get in touch with myself here, are some links as well and by the way the github user is slightly different than my last name, but it's not a typo. So if you're watching this on home expo 2022, I am now happy to take questions. If not, I will see you next time.