Ethereum Foundation Ethereum Research, 19 Apr 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Foundations of Data Availability Sampling

Description

A Data Availability Sampling (DAS) research paper by Mathias Hall-Andersen, Mark Simkin, Benedikt Wagner.

The lecture given by Benedikt Wagner on April 19th, 2023.

A

Great so Benedict the floor is yours: okay,.

B

So yeah, thanks for introducing me and uh yeah I'm, happy to to speak here about this project about data availability sampling. So the last couple of months. We try to understand what what is this data availability sampling? How can we formalize this and uh yeah? So, let's set the stage. What is the setting that we are looking at, and why do we care about this?

B

So let's say we have some peer-to-peer Network and we have a party, let's call her Alice. She holds some data, for example a block that should be posted in this peer-to-peer Network, because this network runs a blockchain, and this block has some Associated Header information and now all the blocks, all the parties in this network download the entire block along with Etc, okay. But that's not the end of the story.

B

We also have parties that we call clients and these clients are restricted in terms of resources, so they are not able to download the entire block instead, what they do is they only download this small header, but still they participate in this protocol. So they want to verify that the block is correct in some sense.

B

According to the rules that this protocol specifies and for that, they rely on other parties that download the entire block to tell them when the block is not correct, and for that, of course, the other parties need to have the block, and that means the client's first job is to verify that the block is available in the network, because otherwise these parties cannot tell them that it's a wrong or incorrect block.

B

Okay, so the goal of the clients is to verify the block or the data in general is available within the network, and for that they will make some queries, but keep in mind. They don't want to download the entire block because they don't have enough resources to do that.

B

Okay, so let me talk about the threat model for a bit, so the clients here want to verify some statement, namely the data is available and the other parties want to convince the clients, so the proposal and the network, and therefore we model throughout the talk and in this work we model the proposal and the network as being malicious.

B

um We don't care about any privacy, so we don't model the clients as malicious, typically okay. So that's that's the setting now, of course, you could ask the question: what does it mean formally that data is available? This is a very vague term. It is not really clear what that means and also, let's say we have a formal definition of that. Then how can you construct a protocol? How can you implement this efficiently?

B

Okay, so these are the two main questions that we address in this work, but before I show you some of our results, let's first, let's first familiarize ourselves with this setting I want to do this by looking at two example: constructions.

B

Okay, so again, let's say we have our proposer Alice. We have the network, and now we have two clients, so in general we will have more than one client, and this is important to keep in mind.

B

So Alice has the data and let's say this: data is now um K chunks of equal length and now Alice wants to send it to the network, but before she does that she commits to this data using a Mercury tree which will give her a Merkle root, and that will be our header, so the clients download these this Merkle root and the data is sent to the network. Okay. Now what the clients will do to check that the data is available, they will query the network and what they will do.

B

They will sample an index randomly, for example, this index I, and then they get back from the network, the data, the if data chunk and an authentication path that convinces them that this data is consistent with the root, and here we already see why we need this. This Merkle root we needed to make sure that both clients get consistent responses. This is an important aspect.

B

We will have that all the time okay, so this is the system if the clients get enough of these responses- and they are all okay in terms of terms of this Mercury path, then they are happy and they say the data is available. Okay. So the question is now: is this a good system or not and yeah?

B

Let's think about this, so what an adversary could do to make the data unavailable with minimal effort is to just delete one chunk of the data okay, but now, let's, let's check How likely- is it that the clients detect this so for one query, the probability that they hit this chunk is one over K, so that's very small and that in particular it means to detect this with high or overwhelming probability.

B

They need to make a lot of queries and they need to make as much queries at as they would have to do if they want to download the entire data. So this is really not a good good idea, so this is a first try, but it's not really a good scheme. Okay and, let's think about where this comes from this problem.

B

Well, it comes from the fact that we send the data to the network without adding any redundancy, and in that way, if we take the data that is stored in the network and delete just one symbol, then the original data is not available anymore. So what we will have to do is we will have to add redundancy okay. So, let's give it a second try.

B

So, let's, let's add redundancy using a code again, let's say Alice has this K chunks of data, and now she treats them as coefficients of a polynomial over a finite field of degree, K minus 1, and then she sends to the network, not the data itself, but instead all the evaluations or, let's say 2K- of these evaluations of the polynomial. In other words, what she does is she encodes the data using a read, Solomon code, okay and then before we saw, we need some form of commitment to make sure the clients get consistent responses.

B

Here we deal with polynomials, so it's a natural idea to use a polynomial commitment scheme such as the kcg commitment, so the clients will get the kcg commitment, which is just one group element and then, along with the read Solomon encoding, Alice stores, all the opening proofs of the kcg commitment in the network. Then, when clients query random positions, they get back the evaluation along with the opening proof, and if this all verifies for enough queries, then they are happy okay.

B

So that's our second try now, let's think about this scheme, is this a good, a good way of checking data availability?

B

Well, if you look at this encoding because of the result redundancy, what an adversary would have to do to to remove information about the data is to delete more than half of the encoding right and if more than half of the encoding is deleted- and you make say, 100 queries, then you have a very small probability of not noticing this.

B

So this is a good scheme, at least intuitively, but you know it's not really clear if this is the case from a formal perspective, so we ask the question: is this secure and to to answer this question? We first need a security definition right. I now just sketched an adversary, but it's not really clear what this adversary is allowed to do and yeah. So that's why we need a formal definition and then also once we have a formal definition. We may use this to quantify. How much better is this scheme compared to the first scheme?

B

So how much do we gain from using these codes? Okay, so that's that's the setting and two examples. I think we have a feeling now of what what data availability sampling should be intuitively. Now, let's look at our results and I want to split the rest of the talk where I present the results into three parts. So first I will show you our formal definition of data availability sampling as a cryptographic, primitive, and then we will look at the constructions and for that we developed some framework.

B

That I will first explain and then we will look at specific constructions. Okay, so let's look at the definition now, how can we Define this as a cryptographic, primitive um yeah and the first thing you you should do when you define a cryptographic. Primitive is think about what are the parties that are running this protocol and what are the algorithms that they that they run. So you specify the parties and the algorithms okay. So we we saw this before we have our proposal.

B

The Proposal somehow takes an arbitrary data and encodes it using some algorithm, and now this outputs, a commitment sort of the header that the clients will download and also an encoding, and this is what is stored in the network, and here you see uh one one aspect of our definition: we don't model this network as a set of a lot of parties. Instead, we say: there's this encoding. It is somehow stored within the network and the clients can query it. Okay. So how do the clients look like?

B

The clients are basically just algorithms that can query this encoding, so they can say. Please give me the if symbol of this encoding and then they get back the if symbol and we split these clients into two algorithms, so V1 and V2.

B

The v stands for verifier because they are kind of a verifier here and the first part just does all these queries and then it outputs a transcript, and this transcript is then verified deterministically by the second part, and the reason why we split it explicitly into these two parts is to make sure we can talk about accepting transcripts.

B

Okay. So that's what we have seen so far I think nothing really special happened, but now the question was really: what is data availability? So how can we say? Data is availability available and we formalize this via an extractor. So this is an algorithm that takes as input some set of transcripts and also the commitment and it extracts the data, and so from now on, we will always say data is available available.

B

If this algorithm can extract it, so availability means the extractor can extract it. Okay- and this is not only some concept that we have in the definition you can think of this. As being run by any node that collects enough transcripts, so whenever these clients finish their interaction, they can just send their transcripts to everyone, they know or everyone they want to. And then, if you collect enough transcripts, you run this extractor and get back the data, so the data is still available.

B

Okay, so that's the syntax, and now, let's think about the properties that we want. What what kind of properties should these algorithms satisfy? And the first thing you you always want is some form of correctness or completeness definition, which just says: if everyone is honest and behaves as expected following this protocol, then what you get is what you expect right and in our case this would be well.

B

If uh I take the data and I encode it and run these clients, then all clients should accept right because the data is available and also what I extract in the end should be the same as what I input the beginning. So this is just completeness, but of course we cannot hope that everyone is honest right, so we need some security, Notions and I already mentioned before that we assume the proposal and the network is dishonest right. They want to convince these clients or verifiers that some statement is true.

B

Whenever you have something like this, it's kind of a proof system right. You want some form of soundness, telling you that if these clients accept, then the statement is really true and in our case this means data is available right. So what we want is in in this setting, where the encoding is controlled by the adversary and the commitment. If the clients accept, then the data is available.

B

Okay, so what does it mean that data is available? Well, we can extract it. But now, if we look at this picture, we see the encoding is controlled by the adversary and the commitment as well. So there's no original data. Okay, so that's the first Challenge and um let me just explain what it means that the adversary controls this encoding.

B

Well, he can adaptively answer the queries of the of the clients, so the client says: please give me position I and then this adversary sees okay, there's the first client he wants position I, so I will answer with something right. He can decide adaptively what to respond.

B

Okay and now our definition of soundness says in this setting where the encoding and the commitment is adversarial. If enough of these clients output one, then we extract something. So some data is available.

B

Okay and I want to emphasize two things here and the first is this: this word enough right, so typically in a proof system, you would assume if, if one verifier accepts, then the statement should be true, so data should be available, but we cannot really guarantee this for any reasonable system. So what we require is that enough clients accept- and let me explain this so let's say you have an adversary in this adversary targets. The first client in a sense that it takes some data encodes.

B

It honestly and responds honestly to the first client and then by completeness this client should accept, and now for all the other clients, the diversary just says: I just give you garbage or random things or nothing as responses and then, of course, everything that every information about the data that that can be used in extraction comes from the first transcript. So if the first client makes only a few queries, then you cannot hope to to extract okay.

C

Excuse me: yes, uh is it okay to ask questions now or we should defer them.

B

Yeah, you can, you can ask one question now sure yeah.

C

I have a kind of a notational question, so you mentioned in the beginning that uh some data is stored in the network, but I can't fully understand the difference between the network and the proposer. So as you say that, like that, adverser can answer differently to to queries. This actually means that nothing is stored in the network, so like they're, just that version has some storage and the answers arbitrarily or by some algorithms. So what.

D

B

D

B

So good question, so in the in the honest case, we have this party that comes up with the data or somehow has this original data and encodes it and stores it in in the network. In the dishonest case here in the soundness notion, uh we just we don't want to trust the Network right, because the network is in the end, the thing that wants to convince us that the data is available. So we should model this as the adversary, and uh we we will see in the constructions that answering inconsistently is not really possible.

B

If we do it the right way,.

C

So in this model there is no Network anymore, there's just an adversary and yeah.

B

So there's an implicit Network and the adversary controls. Whatever this implicit network is doing.

A

Maybe a small add-on: if you consider a adversarial setting, then the setting that Benedict is describing right now is very powerful from the perspective of the adversary.

A

So if we can construct schemes that are secure according to this soundness definition, then there will also be most certainly secure in a weaker model where you do have a network, and you put some trust into the network so having schemes that are secure according to this will also be secure according to weaker security information.

A

B

Yeah, so that's a very strong notion. Okay, so uh I hope this somehow answers.

C

B

um Yeah, so where was I okay um yeah? So the first thing we have, we have to assume that there are enough clients, just because of this adaptivity and how many clients we need is a parameter of the scheme, and we will come back back to that later and then I say we extract something. I, don't say anything about what we extract, so uh you could just output some default data, and then you satisfy this notion, so that just tells us that this notion alone is not enough right. This is not.

B

This is not enough because we don't say anything about what we extract and for that we have this notion of consistency.

B

So again, I said before so these clients they take their transcripts and then they send it to some other parties and whenever you receive enough transcripts, you can extract the original data. Now, let's say you have two nodes that do this right. So you have you run this extraction on two nodes for the same commitment with two sets of transcripts and now, of course, what you want is that they extract the same data right.

B

You cannot say that they extract some specific original data, because the adversary controls uh the way this commitment is created, and so there is no original data. But what you want is that they extract really the same data, and this is consistency. But now, if we look at this picture, then uh this is still a very specific setting right. So there's there's two sets of disjoint clients and you you try to extract from their transcripts.

B

What we really want is a more powerful notion where we don't care where the transcripts are coming from, because if I try to extract, then I don't really know who who made up these transcripts. So in our consistency notion, we give the adversary the power to come up with these sets of transcripts. So the adversary comes up with one commitment, but two sets of transcripts. They do not have to be disjoint and uh then I try to extract from them right. I get two possible data and my requirement is that they should be the same.

B

So the commitment somehow binds you to the data and of course I can only require this if the extractions do not fail, and now we also see that this outputting some default data is not a good scheme. It will not satisfy this notion, because the adversary can just be honest in the first set for some random data that he encodes and force the second extraction to Output the default data, and then this will not work. So this consistency tells us something about what we extract.

B

Okay. So this is our basic definition. Let me summarize it we have this completeness notion, which is very natural. If everyone is honest, then all clients accept and we extract the original data.

B

We have a soundness notion telling us. If enough, clients accept, we extract something. So some data is available and then we have a consistency notion which says something about what we extract, namely we never extract two different things. So if you get a commitment and some other party gets a commitment, you are sure that you also get the same data. So that's our basic definition, and now we thought about extensions.

B

So how can we extend this definition with more functionality that may be useful in practice, so the first extension that we looked at is repairability, and here the idea is what, if you are encoding, is broken and you want to repair it go back to a stable state where you have a uh something that is close to the original encoding and works as expected. But you don't want to change the commitment.

B

You don't want to redistribute all commitments you sort of want to do this repairing transparently, so you need some way to repair the encoding such that it still works with the old commitment, and then we have a second extension, which is a local accessibility here. The idea is what, if you are not interested in the entire data, but instead you are interested in parts of it, for example one symbol and then of course you could just wait until you have enough transcripts and reconstruct the entire data, but that sounds wasteful right.

B

So what if you can just make one query to the encoding and then get the symbol that you want the symbol of the data that you want.

B

So we we Define, syntax and also security Notions for these two extensions, but I don't want to go into detail here, because I want to have more time for the constructions so uh yeah, that's that's it for the definition side. So maybe now is a good uh good point. If anyone has another question about the definition.

B

Okay, so if not, then let me continue.

E

Because you mean it's not sure, can it go back one slide, it's just one question about and what you mean by repairability, because I mean this kind of this way of modeling. The network is a bit strange, just to make sure I understand it correctly. So this repairability means or repairing means that some of the queries are not answered at all or what's the sort of the setting.

B

Yeah, let's say you have a setting where not all clients accept, then you know that there's something wrong with the encoding, some someone messed with with the encoding. Now we want.

E

To return to the state yeah sorry just go back, I mean if you go back into your the way you model this pie. I mean you only have some kind of I and then your answer is on Pi I. So um so, basically this it just I'm. Just not sure. What do you even mean by encoding in this, because you have this.

B

Kind of the encoding is pi, so.

E

It's, oh so 1 by 2, pi. Three! Yes,.

B

E

B

E

String of symbols yeah, but that's a bit weird, because you don't really know that every client gets the same Pi I as an answer. So.

A

F

E

What, if that's encoding, even mean that's kind of what's confusing me here, yeah.

B

So so you you're right in a in a malicious case. I cannot know that and that's something my construction has to take care of, so my construction will take care of the will, take care of that the clients get consistent responses right.

F

B

You say in this case where the encoding is broken. What I mean by that is just that: I see that some clients don't accept, then I know something is wrong right and- and that means okay, you're right. There's no explicit encoding anymore, because now I'm in an adversarial setting but I see some something is wrong and then I want to go back to a state where everything is fine without changing the commitment.

E

Yeah, it's just that's. What probably uh probably would really like to see the definition, maybe but yeah.

B

We can look, we can look at it. You see.

E

The problem- that's kind of it's, not it's not completely trivial- how this would look like that's why I'm.

B

Sure yeah we talk also we we uh it took us some time to Define it yeah. So you're right. This is a bit tricky, yeah Okay. So any other questions about the definition. Okay, good. So now, let's look at constructions, so we have this definitional framework. Now and now we looked at the existing constructions and also we thought about new ones and then what we saw is they all follow sort of a similar framework.

B

So we we abstracted this and came up with this construction framework and it allows us to separate sort of the combinatorial parts of the constructions from the cryptographic, parts and yeah so because all of the constructions follow this framework. I want to explain this framework first, so you can summarize it as data availability sampling from Erasure codes.

B

So we want to construct a data availability sampling scheme as I defined a few minutes ago. For that we will use three components. The first is a so-called Erasure code, and then we have a commitment for this. So this is a new cryptographic, primitive. We Define it's a Erasure code commitment. You can think of this as a generalization of a polynomial commitment scheme, and then we have a third component so-called index sampler.

B

So this is our framework. I want to explain how this works, but we have already seen one example of this framework in the beginning, when we looked at this read Solomon code based construction. So here the Erasure code was the read Solomon code, the Erasure code, commitment for it was a kcg commitment and the client sampled, their indices uniformly with replacement.

B

So this is one example, and now let me give more details about this, so the first thing we want to look at is Erasure codes for that I need to recall some coding Theory just to get everyone on the same page. So what is the code? The code is just a mapping taking some uh element from gamma to the K and mapping it to Lambda to the N. So how does this look like?

B

You take some message which are K symbols over gamma and you map it to a code word which are n symbols over Lambda. So you add some redundancy here and what you can think of is you have this small set comma to the K and you injectively map it into a large space Lambda to the n, and this is your code and I should say that I'm typically referring to the code as the mapping, but at some points I will also refer to the code as the image of the mapping.

B

So the subset of this gamma to the N just we have to be flexible a bit here.

B

Okay, so what is an example of a code, so one very important class is the class of linear codes, where your alphabet is just a finite field and you map a vector X to its codeword by applying some Matrix to it right. So you take your vector, X multiplied with a matrix. You get your code word and an alternative way to look at this linear code is via the so-called parity check Matrix. So this is another Matrix that helps you identify code words.

B

So a vector is in the code if it's product with the Matrix is zero, so think of it like this, you have this long code word, you multiply it with the Matrix and if it's in the code, then it gives you zero. So you have. You can check membership in this code by a linear check and we will use this later.

B

Okay, so one example of such a linear code is the retail element code right um now. What are Erasure codes, so Erasure codes again just any code, but a specific way of looking at it right. So we look at it from the perspective of erasers. So what we want is we want to tolerate erasures. What does that mean? Well, we have our message.

B

We encode it with with the code and now some of the symbols are lost right, and this is something that is already closed to to the setting that we talked about before and now you want still want to be able to recover the message as long as you have enough of these symbols.

B

Okay, so that's an Erasure code and it has one very important parameter, which is called the reception efficiency, and it tells you how many symbols do I need to reconstruct the message. So, whenever I have t symbols and it doesn't matter which one I can recover the message.

F

B

And, of course, if you have the case where the alphabets are the same, then the best you can really hope for is that you have reception, efficiency, K, so you're you can reconstruct. Whenever you have K arbitrary symbols, you cannot hope to reconstruct from K minus one just based on information, Theory yeah and one example of such an ideal. Erasure code is the read Solomon code right whenever I have K points of my degree, K minus one polynomial I can interpolate this polynomial okay. So now we know what an Erasure code is.

B

Now, let's look at this new commitment scheme that we Define the Erasure code commitment scheme and, as I said, this is a generalization of a polynomial commitment scheme, so you can think of the polynomial commitment scheme as being an Erasure code commitment scheme for the read Solomon code. So how does that look like again? We have an arbitrary code and now, let's say we have Alice and she holds a code word c, which is C of X or some message X.

B

So this could be the evaluations of a polynomial and now LS wants to commit to this code. Word and Bob wants to query positions of this coverage, so Bob will query a position and get back the symbol of the codeword along with this opening proof, Tau I, okay, so this is and then, of course you can verify it. So this is just as you would expect, as you have in a polynomial commitment scheme, so you commit to a polynomial and then you get uh back evaluations along with opening proofs, okay.

B

So for this kind of commitment, of course we need some security Notions and for commitment scheme you typically consider hiding and biding. We are not. We are not caring about privacy, so we only consider binding Notions here and there are two that we consider and when we came up with this, of course you want at the same time you want uh to be consistent with the Notions that are already there for polynomial commitment schemes.

B

On the other hand, the Notions should be strong enough to uh sort of suffice for constructing data availability, sampling, okay- and these are the two Notions we came up with so the first is position binding. You also have that in polynomial commitment schemes, where you the adversary, outputs, a commitment and then he tries to open this commitment at one position, I to two different symbols.

B

So that looks something like this universally tries to convince you that there's this code word that he committed to, but at one position he opens it to two different things, and this should not be possible, so the adversary wins when uh all openings are valid and the two symbols are different. So this position binding- and this is something you can easily achieve- just using a maple tree right. So this alone cannot be uh what we want.

B

Instead, we need a different notion. Yes,.

C

Sorry, a question yeah: uh why do have you decided to commit to the code words because well I think in the very original in in the first slides, when you sketch it's impossible solution, you seem to commit to the data rather than to the code words at which point have you decided to okay, yeah.

B

So you can also phrase this as committing to the data, but opening code, the codeword position right. So you you input the data, so so this is sort of uh equivalent I would say, because uh this code is injective right. So you you, when you commit to the code word and it's really a code word. Then it implicitly commits you to the data, but what we, what we will do- and we will see this in a few slides- we take the data and input it to the commitment. But later what is important?

B

You want to open positions of the code. Word that's for the construction. This is important.

C

B

C

Just kind of uh when you say we now work with code words and we decided to commit to code words. This looks like a little bit kind of restrictive because in general, of course you don't have to uh uh encode as a linear code, but you can I know answer queries as just functions. Okay,.

A

C

As you can recover the data from uh from from from function, outputs, uh you should you should Define, okay,.

B

But we will see that most of the constructions. You can write it in this way and uh so so I claim this is not really a loss of generality, especially for the constructions that we consider and if you have other constructions sure you can construct it in a different way.

B

But this helps us to sort of uh avoid redoing certain proof, steps for for all the constructions, so yeah.

A

B

A

Clarify the goal here is not to come up with a definition that covers all possible imaginable constructions. Much rather, it is that you come up with an abstraction that covers a large part of constructions, and then you show that if you have such a commitment scheme, you get automatically a data availability sampling scheme, and now you just need to focus on building the specific commitment scheme rather than building the more complex, primitive. It does not meant to cover all imaginable constructions of data availability, sampling.

B

Yes, but it does cover a lot. So that's that's why we look at this. Okay, so yeah good question. So now we have this position binding right, but what I already said is you can easily satisfy this right? Just take your data compute, the code word and compute a Merkle tree for it right. So this is. This cannot be the end of the story. Instead, we need a notion that relates to this code.

B

So this is code binding, and here we have again LSG outputs, a commitment same as before, but now watch the outputs is just a set of openings. So a subset of this of this code word and what we Alice wins if these are not consistent with the code right. So there's no code word that is consistent with these openings. Okay. What does that mean? It means whenever L is outputs openings.

B

Then, if this scheme is code binding, then we know that all the openings are consistent with the code. Okay. So, for example, if you have the the retolomon code and you're talking about polynomials, then it cannot happen that Alice, outputs, K plus 1 points that are not on a degree K minus one polynomial, okay- and this is something that actually in the original kcg polynomial commitment scheme paper.

B

You don't have such an ocean, so this is a a notion that will help us and we will see that these two Notions together um are sufficient to construct data availability, sampling. Okay. So now we know what an Erasure code is. We know what an Erasure code commitment scheme for this for some code is and I don't want to go into the details about this index sampler. This is a very interesting combinatorial uh thing and we studied it for a while.

B

um But let me instead just show you how we construct data availability sampling now from these components. So we start with the data and now the first thing we do is we use our Erasure code. So we use our eraser code to compute the codeword, and this will be the first part of the encoding that we will store in the network and now, of course, we saw this in the beginning. We needed this Mercury to ensure consistency, and now we use our Erasure code commitment scheme for that.

B

So our Erasure code commitment scheme will give us openings for these codeword symbols and we will group together the codeword symbol with its corresponding opening. So one symbol of our encoding for the data availability sampling scheme is a tuple which contains the codeword symbol and its opening okay. So now this is the encoding. Let's now look at the clients, so the clients get um the commitment from the Erasure code commitment scheme, and now they should query this encoding to verify the data is available. So what do they do?

B

They first run this index sampler and now we know what this is is just an algorithm that outputs Q positions and now they query these this encoding at the positions that is, in example, outputs. So this index number is just specifying how we sample the positions in our query. These and I verify all the openings right from every query: I get a codeword symbol and the opening I can verify it with respect to the commitment. If all the checks pass, then I accept and think the data is available.

B

If some check fails, then I reject okay. So this is the construction and when we analyze this, we want to use, of course, the properties of the Erasure code, commitment schema of the code. So when we analyze this, we have to think about three properties right. We have to think about completeness soundness and consistency.

B

So this is uh a summary. So for completeness and soundness, we will just use the combinatorial properties of these objects. We will not use any computational assumption or anything like that, and the completeness and soundness just follows from the reception efficiency and some measure that we Define for this index, sampler, which we call the quality, and if we combine them, then this will tell you how many clients do need to accept such that you can reconstruct the data right. So remember, in the definition of soundness, I told you.

B

We need enough clients to accept and how many we need is given by the reception, efficiency and this quality of the index sampler.

B

So this intuitively makes sense right if you need to collect a lot of symbols of this of this code word to be able to reconstruct the data, then you also need more queries of these clients or more clients in general to reconstruct the data. Okay. But the third notion is consistency and this one is computational, so so we show it based on computational assumptions. We show it based on the position binding and code binding of this Erasure code commitment scheme, so intuition is as follows.

B

Let's say you have two clients and they query the same position then by position binding of the Erasure code commitment scheme, they get consistent responses and then, if you look at all the responses together, then by code binding, you know that they are consistent with some codeword, and that means whatever subset of that you take as long as it's large enough. It will give you the same message so that that means it gives you the same data, and you have consistency. So this is just a a high level sketch of consistency.

B

Okay, so that's our framework. It tells us how to take a uh Erasure code and an Erasure code commitment scheme and to construct data availability sampling. So, as as Mark said, we can now Focus just on array constructing Erasure code commitments for Erasure codes, and this is the third part.

B

Let's now look at at specific constructions, okay, so here's an overview of the constructions that we that we looked at in this project, so the first one we already seen this a few times now is take the read tournament code and any polynomial commitment scheme, for example the kcg commitment scheme by the construction that I that I just showed you. We can use this to construct the data availability sampling scheme and this already sort of proves that that this construction is secure, just instantiating our framework.

B

Okay, then we have a very generic construction where you just take an arbitrary code and you commit to the codeword using a vector commitment and then to get code binding. You use a snark on top of it, so this is uh generic, but it's also very inefficient because well, the snark is computationally inefficient. If you, for example, if you use a Mercury um and I look at this more as a educational example of what is the general strategy that you want to to have when you construct these commitments?

B

Okay and then there's the ethereum construction. So this is the construction that is currently used by ethereum and we can actually pass this and and uh view this as being an instantiation of our framework. And for that we look at the so-called tensor code and do some row wise commitment. We will see that in a minute, okay and then motivated by the fact that all of these constructions sort of rely on on a trusted setup, uh at least that the ethereum one and the read Solomon one.

B

um We asked the question: okay, can you have? Can you construct this without a trusted setup and ideally from minimal assumptions such as just using hash functions, and we came up with two constructions: one based on hash functions, one based on homomorphic, hash functions and I will show you one of them later.

B

Okay, but now, let's look at this tensor code construction, so the construction used by ethereum right now and what we should do really is. We should first start with an Erasure code.

B

That's the first thing: I will explain to you and then for that code we construct an Erasure code, commitment scheme, okay, so let's start with the code, we take our data and we arrange it in a square of size, K times k and now the first thing we do is we assume that we have some underlying base code, for example the read Solomon code again and we use this code to extend the data along the rows.

B

So we take every row and apply the code to it, and then we get this rectangle and now the second thing we do is we extend it along the columns. So we apply the code to every column and you can write this very concisely in this form. So you take your data and now encoding row. Wise means, multiplying by g-transposed, encoding, column rise, means multiplying by G, and this is the the generator Matrix that we talked about before so now.

B

If you look this at this code and let's say we use a retolement code as the base code, then every row and every column is now the evaluations of a polynomial every row and every column should be a code word. That's our code and before I show you how to commit to such a code word using the Erasure code commitment scheme, let's first think about the properties of this code and in particular about the reception efficiency of this. So how many symbols do we need to be able to reconstruct the data?

B

And why do we care about this again this? This affects the the soundness and completeness parameters that we get okay and you can sort of see that the worst case is something like this. We have all the highlighted cells or the highlighted symbols of this code word, but you are not able to uh interpolate any of these rows or columns, and so it's about three quarters of the data that you need to be able to Recon to be sure that you can reconstruct.

B

This is the reception efficiency and now, let's look at the commitment, so I said we have this. This uh every row is in the code in the in the base code that we use, and now, let's assume, we have an Erasure called commitment scheme for this base code.

B

So if the base code is the read tournament code, let's assume this is the kcg commitment scheme and now because every row is already a code word, why not just commit to every row and that will give us a set of n commitments and now first idea that we could have is just group them together. That's our commitment, and now we should think about these two binding Notions right. We should think about position binding and code binding, so first to to understand that first, let's see how an opening looks like right.

B

Let's say: I want to open this symbol of the code word then. This is of course, Associated to some row commitment and what I do is I, just open it with respect to this commitment. That's how you open it now, let's think about position binding and code, binding, well position. Mining is easy right, because every position, as you see here, is associated to a to a row commitment.

B

So if you break position binding of this new construction, you break it for this specific row, commitment and therefore the position binding here follows just from position binding of the underlying scheme, for example kcg.

B

Okay, now, let's think about code binding so for code, binding I want to uh I want to be sure that whenever the adversary out opens a set of positions- and this is consistent with the code- what does it mean? It means that in every row and in every column, this is consistent with the base code. So whatever the adversary opens. I know that every row is in the code and every column is also in the code. Okay. What? If a row is not in the code? Well, then, because I commit to rows.

B

I break the code binding of the underlying Construction, okay, so code binding. So this half of code binding, is, is clear because I commit to rows now. The challenge is what, if some column is not is not in the code and in particular, if you just do it like this with no additional uh change, the adversary can just form all these rows independently. Right, there's nothing that that ties these rows together and therefore this cannot be code binding yet, but we can make it code binding by one additional check.

B

What we will do is we will check that column is in the code by checking it over the commitments, and for that we will use the fact that commitments, for example, kcg commitments are homomorphic, so we will do an homomorphic check that checks that all these uh all these commitments together are in the code. So um all the all the columns are also in the code, and here we will use the fact that checking membership in a code is a linear function right, so we'll use the parity check, Matrix, okay and then with some care.

B

You can analyze this and you can show that it's actually code binding okay, but uh what I want to show you next is not this. This proven detail, I think I sketched parts of it now. I want to show you our new construction and again, the motivation was that you, you rely on a trusted, setup and expensive public key operations here. So the question was really: can we do it just from hash functions and.

C

B

Construction that I want to show you yes question. All.

C

Right and something like fry, which looks like root Solomon plus Miracle tree, wouldn't give you that.

B

So we are currently actually working on whether whether fry gives you that we are, we are not. We are pretty convinced, but the proof is not not clear right, so this is uh yeah good.

D

Question yeah so I mean you could obviously use Fry's polynomial whiskey, um but that would be inefficient Computing all of these openings. So what we're looking at is whether we can avoid that and just your proximity test with fry so.

E

D

In addition to some consistency, checks for for particular opening.

A

Points so basically not just fry as a black box, but rather use fry directly applied to the application and avoid computational overhead that we would have otherwise yeah. Looking inside the.

D

Security proof of fry yeah.

B

Yes, so that's what we are currently working on: okay, so yeah. But let me show you this this construction. It is for so-called interleaved codes and we only use hash functions for that. So what is an interleaved code again? I want to start with the code and then I want to show you how to commit to a codeword. So as for the tensor construction, we start with our data arranged in a k by K Matrix and as for the tensor construction, we take some base code and and extend all the rows using this base code.

B

But now this is where the similarity stops. So now we won't extend it along the columns. Instead, we group, together all the columns as symbols. So now our new code will have n symbols and each symbol of the codeword is an entire column of this Matrix. So again you can you can write this in a matrix form and that that also means.

B

If we look at this code, then the reception efficiency is the same as for the underlying code, but with the with the caveats that now symbols are larger right, so you need, if you get, if you get uh K symbols of the underlying code, then you have them in every row. Sorry, if you have K symbols of this new code word, you have these symbols in every row, so you can reconstruct every row.

B

But again the symbols are larger, and that means whenever a client queries a symbol, then it will get the entire column. Okay. So now the first thing when we, when we look at this, is and one- and we can't want to construct an Erasure code commitment this- we should ensure that it's position binding right and here what we do is. Let's say we do a very naive thing. We just commit to all of the symbols using a hash function. Right then position binding just follows from the Collision resistance of this hash function.

B

But now the tricky part is code bindings. So how can we ensure that this is code binding? Well, we want to check that every row that you open is consistent with the base code.

B

Okay, of course, I cannot check that for every role, because I don't have the entire rows, um but what I can do is I can sort of compress these rows into one row using a random linear combination, so I sampled, some random, Vector, r and I take the first row times, R1 the second row times R2 and so on, and sum that up and I get one compressed row w and of course, if all of these rows are in the code, then W is also in the code.

B

This is just because this this code is linear, and the hope is that if one of the rows is not in the code, then the W will also not be in code. That's the hope! So how can we implement this? Well, we can just take a random Oracle, so a hash function modeled as a random Oracle, apply it to the hash values and get our r.

B

Then we take W to be R times C. So that's just combining the rows. Okay and now our commitment would be the hash values and the W. Now the first thing you check when you get a commitment, is you check? The W is in the code to sort of implicitly check that all the rows are in the code? Okay. So how do you verify an opening then? So one aspect about this construction is that there is no explicit opening proof instead, I just give you the symbol.

B

So a symbol to recall is a column in this Matrix. So I give you this column CI. You check that W is in the code. You always check that for position binding you check that the hash is consistent, and then you also check that the along this column, the linear combination, was formed correctly. So you check that R times. Ci is WI, that's how you check it and now, let's think about this, is this actually code binding? So the intuition is that it.

G

B

But can we can we prove it okay? So what does it mean? We.

F

Have an adversary and.

B

This adversary breaks code binding, so it outputs a commitment and it outputs some openings. um Okay and now this adversary breaks code binding if all these openings uh together are inconsistent with the code. So there's no code word consistent with these openings, and now we want to rule that rule that out. Okay, how can we approach this? The first thing we can do is we can say: okay, if we model this hash function again as a random Oracle, then these hash values Define the pre-image of them, and we can in fact extract it.

B

So this is a pre-image c star, which is the pre-image of all these hash values and now by Collision resistance of the hash function, the CIS have to be consistent with the c star right. So that means, in particular the C star has has to be not in the code, because otherwise I mean that's the winning condition of code binding right, whatever is consistent with the CI is not in the code. Okay, but also all the openings need to verify. So that tells us that W has to be in the code.

B

Okay and now it's all of the bad events that we have to rule out during the security proof is that well they are consistent. These openings are consistent with the c star and they are also via this R consistent with the w, but the C star is not in the code and W is in the code. And now, if we just look at this, then believe me, you can bound this, but there's a huge problem, and this is that we have this. uh There is term here, so we need to use a union bound.

B

So why is that? Well, we don't get to choose the set I where Alice opens the commitment right. So Alice can just choose the set I in an arbitrary way, and we don't know anything about it, except that it's large enough, and that means we have to rule out this event for every set I, and the only thing that we can do here is to use a union bound, but there's an exponential number of such sets, and that means the union bound will kill us here.

B

So this is not a construction that that we can prove secure, and now how can we change it? So the problem really is that Alice chooses this set I and now an idea that you could have is what, if, let's say, Bob chooses the set eye or some somehow they said, I is chosen at random and Alice cannot choose it, and this is this is how we fix this problem again. We have the same the same hashes for position binding. We compute this random linear combination.

B

This is everything as we had before, and now our fix is that, in addition to that, we sample some subset J of all the positions. This can be a rather small subset size, roughly security parameter or something like that, and then within the commitment we force Alice to open um the commitment at this and the indices in the subset. So now, if we look at the security experiment that we had before L is no longer chooses the set. Instead, the set is chosen at random, and so we don't have to do this.

B

Union Round, that's roughly the intuition of this Construction. Okay. So that's our our hash, based construction and before I, want to wrap up I like to compare these constructions and sort of tell you the trade-offs that we have here. So we saw in this. We have seen this. This uh read: Solomon plus kcg construction, and here the real Advantage is really that the commitment is constant size. It's just one group element independent of the data size and also the same holds true for the communication complexity per query.

B

So if you query, then you get one field element and one opening, so that's also independent of the data size.

B

Then, if we look at this tensor construction, so the ethereum construction, we still have these these advantages, so the commitment is larger.

B

Now it's scales with a square root of the data size, but we have an additional nice feature, which is that you can do this reconstruction in a local way, in a sense that you can take one row reconstruct this row and you don't care about the rest of the encoding and now, if we compare it to our new hash based construction, then the clear advantages of this construction are that you don't need a trusted setup. We have sort of minimal assumptions and you can implement this because you don't need public key operations.

B

You can implement this over small fields, which gives you a computational efficiency. Of course. At the same time, you have a larger commitment size and a larger communication per query, but there's one final advantage that I want to mention- and this is the total communication size that we need to reconstruct the data. So what we can do is we can look at how many samples do we need to make in order to be sure, with a certain probability that we can reconstruct the data? And then you multiply that by the communication per query.

B

So you get the total communication complexity over all clients that you require to reconstruct the data, and we computed this and it turns out that for this construction, it's smaller than for the other ones, and the main reason is that, for example, if you compare it to the tensor scheme, you have this, you need three quarters of the symbols right, so you need to make more queries to reconstruct this okay. So uh that was some comparison. Let me now summarize what we have seen and what are the next steps for us?

B

So I showed you the definition of data availability, sampling that we came up with and I showed you this framework, how to construct data availability, sampling from Erasure codes, and then we looked at two specific constructions for Erasure codes and Erasure code commitments, okay and the next steps, while we want to finish this paper, so the technical part is, is done, it's more about the writing, the intro and texts and things like that and then in the beginning of may we want to submit it and then we want to look at more constructions.

B

For example, this fry based construction would be interesting to see if we can improve it, improve what what we have so far so yeah with that. Thank you for your attention. If you have any questions.

E

B

Okay, so I see there's a question in this chat. Are you working in an asynchronous or synchronous? Setting? ah Okay, that's a tricky question. So so what we? We don't Define like the communication uh between these parties. The only thing we Define is that we Define sort of the clients as an oracle machine that gets Oracle access to the encoding. So what we assume is that you make a query and you instantly get back the uh the response.

B

So you you query position I, you get back um the if symbol of the encoding or nothing or something that the adversary chooses, but there is no! So in that sense you think you should think of it more as a synchronous setting.

G

B

G

Yeah, no, no, it does actually and I guess for the definitions. That makes a lot of sense for the constructions. Probably you might want to start thinking about the what happens if your net network is asynchronous, because I think it's fair to assume that that it is.

B

G

I mean obviously not for this paper. That'd be crazy, but this could be could be quite a nice follow-up, yeah.

B

A

So it's not directly clear what happens in an asynchronous Network.

F

A

Service definition, adversary.

F

A

Already allowed to adaptively answer the client's queries, so that's even worse than an asynchronous Network in some sense, because you can just give completely adaptively chosen independent incorrect answers to the different clients.

G

uh No um asynchronous isn't really about adaptivity asynchronous.

F

G

Of saying that you don't know when the client's going to respond, like you know that you'll guess a response assuming they're online and available, but whether you get a response in a minute from now or 10. Minutes from now would be something that you would like me to actively like well for partial synchrony you'd need to sort of consider what the actual time bounds are and for philosyncrony you would need to sort of say we can't model time at all.

G

We just need to assume assume that we're going to have some kind of interaction where we sort of do it message based and when things arrive, we say done or not done and yeah I think that so for us yeah, that's very, very complicated. Not for this.

A

Yeah yeah, so this I guess like tackles the slave, your orthogonal problem to what we defined in this paper. At least all I was saying. Is that, like you, could see also asynchrony as like you're, not getting the response, but that would be like in.

D

Terms of does it affect.

A

Security or consistency or soundness, and that it does not like liveness issues can exist, but, like security issues should not that's all I was saying.

F

Yeah I think I agree like the because you, if you, for example, just simply put a cut off time, say if it's more than 10 seconds until he has done so, then we say it's not available then like that. That would basically give that right and then we would just say yeah. The adversary hasn't answered.

A

G

Is not to say that I mean I, guess you probably need to set that time back uh time is starting to be very high if you're doing it like that, because otherwise you have to start considering the setting of well, some uh sometimes they're not going to be able to respond within 10 seconds, but they will respond within 20. So what do we yeah? So it probably needs to be like a ridiculous overlap.

F

Yeah, but uh yes, yes, but yes that affects liveness yeah. So if you want to, if you want to also consider I, can this actually worked? And yes.

A

Maybe a short comment to what Francesco wrote in the chat. So that's like a question of how you model it, like you could say that completeness says under good Network conditions under honest participants under honest encoding. Everything works well, in which case uh it would just be an additional Assumption of what your completeness is um and then like the not being delivered would be part of the soundness, but you could also model it. As saying like you have completeness in an asynchronous Network and then you would have an extra definition.

A

Basically, that's just a question of how you want to split it. There's like not one right answer, it's just a question of what is a preference for modeling. It.

B

Any other questions.

G

uh Possibly a less technical one, but uh would it be possible how how much longer are you planning to be doing your internship with us for.

B

Also, technically, it ended.

G

Technically, it's ended: okay, okay, so if I were to turn around and be there like, this would make a great blog post. Then then, you're, probably afterwards for that, but Mark.

A

Did my internship not.

D

G

Okay, yeah yeah.

G

A

After the paper was done, maybe one small comment regarding like that we abstracted away the network um so in the talk Benedict was talking about the index sampler. So one thing that we did try to model at least a little bit is the fact like it's.

D

A

Explicitly modeled but like implicitly, it is in there is that if.

E

A

Uniformly random queries like in practice: you would do the encoding and you would split it across a bunch of machines. Now, if I do the uniform sampling, so I'm, one client and I collect 50 samples. I might potentially need to contact 50 different machines to get my sandbox. So that's why we have the subtraction of an index sampler where we say.

A

Maybe you don't want to have 50 uniformly random indices, but you want to have a random index and then 50 consecutive samples, in which case you would probably only need to contact much much fewer servers, um so we kind of like will provide benchmarks and comparisons of like you know. The.

F

A

Sampling strategies and like how long they take to collect the desired number of samples that you need for the different constructions.

B

Yeah, so as far as I understand this is also what is what is done in the ethereum construction right? You group together uh symbols and you you don't query one symbol alone. You always create I, don't recall, I, think it was 16 episodes or something like that.

D

B

D

A

Contiguous segmentation, yeah.

B

Yeah and to understand how this Compares, what it, what is really better, we also simulated this and did some analytical uh work on this, but yeah.

D

B

Is I think for for its own? This is a very interesting problem. This index sampling.

A

And the other thing regarding what Gotti asked in the beginning regarding repairability- maybe it's also worth mentioning so informally to describe it like inform me saying: oh I, want it to be repairable, is very easy, but actually writing down. Any sensible definition turns out to be extremely hard because, like you, have an adversary, and you want to have some kind of repairability notion that says something is correct, but you didn't start with the correct object. So it's a very tricky. So we have one in the paper formally written down.

A

That kind of like does seem to model something that is, let's say, seems sensible to us, but even that definition may be still a bit too weak. So, for example, if you want to distinguish the ethereum construction from a standard kcg, just like read solvent code plus kcg construction, this definition doesn't necessarily highlight some practical differences that you would have, but making a stronger definition that does highlight them is like uh because I mean you want to make a definition.

A

That is expressive, but it's also understandable because it can be very expressive and useless to use as a definition or the other way around. So we try to find the middle ground and, like there's still maybe some work to do to find like an even stronger but still useful definition of repairability. That's a very tricky topic. Actually, it seems very innocuous, but it's very hard to define it. Foreign.

F

C

Nice work. Thank you so much Benedict yeah.

G

This is a really good presentation, as well.

B

F

Yep, thank you. Yeah awesome. uh Great work. I'll have to run so yeah thanks and uh yeah. Bye, bye, great nice.

A

Seeing you thanks a lot Benedict.

B