IPFS Data Together, 30 Sep 2017

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Data Together Presentation at PASIG Autumn 2017

Description

Presentation about applying decentralized patterns to the activities of Libraries, Archives and Museums. Given at PASIG, a conference of digital preservation specialists. There is a less technical version of this talk, which goes deeper into the economic context, in the recordings from the 2017 NDSR symposium at the World Bank: https://archive.org/details/ndsr-dc-2017

A

Hi this is Matt Zumwalt, and this is a recording of the presentation. I gave at pay sake at Oxford University in autumn 2017. This was a meeting of digital preservation, specialists from libraries and archives around the world. I work at protocol labs, which is a research and development and deployment lab for networks. Our main projects are ipfs the interplanetary file system and file claim, which you may have heard about in the news recently.

A

We also have a number of underlying open source projects that focus on building patterns into the web that make the protocols we use future-proof.

A

So when I say that we focus on decentralizing the web or Reedy centralizing the web a useful way to explain this is by referring to a diagram from a paper from the early days of the development of the internet. So this paper was published in 1962 on the Left.

A

It shows a client-server architecture where, if you imagine each of the dots in each of these diagrams as being a machine or a node on the network, in your client-server structure, you have one server that serves many clients and those clients are unable to connect to each other. They are only able to connect to the server in the middle diagram which this paper called decentralized.

A

You have a federated architecture where some machines are able to connect with each other, but then the other client nodes are only able to connect to those super nodes and they are not able to connect directly with each other and then over on the right. You have a peer-to-peer architecture where any node is able to connect with any other node, but I was also able to rely on other nodes as a relay when passing information between them and other nodes on the network.

A

So the way that I got into this work of decentralization was that I was I spent a number of years, helping libraries and archives to build systems for curating their digital collections and especially with the bent towards preservation.

A

But as as we watch the web developer round us, it became clear that there was this pattern that I like to think of as being a form of digital captivity, where the data that we create is becoming captured in something like zoos or something like a cage where the data is not allowed to thrive in the wild and it's not under the control of the people who created it.

A

Instead, it's forced to exist in some pre-configured system, that's controlled by other parties and as as my work progressed in helping libraries and archives to build their own systems, I became convinced that the systems we were creating was still we're still participating in that same system. Despite the best intentions, we were just building another form of cage, rather than helping the patrons of those libraries and Archives to thrive with their data existing in the wild and so I became fairly strongly convinced that the solution is to rebuild the web.

A

Now, let's look at an example in response to the election in 2016 people became concerned that the new federal administration would start making climate data less available, either by turning off servers or obstructing availability in some other way, and so there was an up swelling of effort of people attempting to move that data to other safer locations from a preservation perspective. It's important to think about what this means. This was thousands of people spending days, weeks or months of their lives, attempting to preserve other people's data.

A

Now, what happened in this effort was that there were some unintended consequences now. The thing to keep in mind is that these people were attempting to help those federal agencies. These are people who care about the EPA and they care about the data produced by the EPA, but when they made replicas of the data, then that data became something that was in competition with the original.

A

So these new locations, where the data had been moved to were competing with the original source, we're competing with the EPA rather than reinforcing it, and this is because the web itself is structured in the pattern that forces this kind of competition.

A

What we need is another structure to the web, one where it's easy to replicate data where once you've replicated data, it's easy to validate that two copies that are in different locations on the web are identical and where the replication of data across that network increases availability, rather than creating a situation of competition that having more copies of the data on the web, simply increases the number of paths that a node could follow in order to retrieve the data that it wants.

A

By switching to this alternative pattern on the web, it brings up a new kind of conversation where, rather than being a confident competition to define which locations we should rely on to retrieve content. Instead, it becomes a conversation about what data are at risk. What are the ways that they're at risk? Why are those data valuable who are they valuable to and who should be holding copies of it?

A

What this does in an underlying way is it turns access, discovery and preservation into participatory activities, and they become conversations between people who care about data and the institutions and community organizations that represent them or aggregate resources from those communities. So if we express this as a simple principle, it's that information should be possessed by the people who rely on it and that communities and institutions should aggregate information and resources to support access discovery and preservation.

A

Now, from a from our libraries and archives perspective, there shouldn't be anything revolutionary about these ideas, they're, simple and straightforward, and that's because libraries and archives were created as a means of sharing possession of information resources. But let's look at how the current structure of the web stands as an obstacle to achieving this, because the current the current structure of the web uses an approach of location, addressing we identify content by its location rather than content, addressing where you identify and identify information by its content.

A

So for a simple example, if you think about how we identify a book when I recommend a book to you, I recommend it based on. You should read a book by the with this title by this author. I might tell you when it was published or what the publisher was, and you can go and find any copy of that book. That matches that description of its content and read the book being confident that you're reading the book I recommend it.

A

This is very different from how we how we identify content on the web, because on the web, when I use an HTTP link, I'm pointing to a location and then, when you use that location to retrieve that content, if you download a copy of it you're making something that from the perspective of the web, is a fundamentally different thing, because it's in a different location.

A

Now, if we start using links to those other locations, each copy of the content becomes a distinct thing in a distinct location with its own identifier, which has its own profile of who relies on it and points to it. And so that's, where you've created this situation, where even if you were replicating the content in an attempt to reinforce availability, you're, actually undermining the original copy, because you're competing with it to be the location.

A

Now, the alternative to this is to use a content addressed approach with digital information, so the key benefit with content. Addressing is that, if anyone on the network has a copy of the content, you'll be able to find it and retrieve it, and the key idea here is that location doesn't matter. What matters is that you are able to get the content that you requested and to be confident that you are retrieving the content that you want it and the way we achieve this is by using cryptographic hashing now for a digital preservation community.

A

This should be a familiar concept, because we use this pattern with fixity checking where you can put any content into it through a cryptographic, hashing algorithm, and that algorithm generates a unique string of letters and numbers that represents the content. If you put exactly the same content through that algorithm, you will always get the same hash.

A

As a result, you will get the same string of letters and numbers as a result, but if you put content that has changed even a bit, even a single bit within that content, if it has changed, you will get a different hash. So what this means is that those hashes are universally unique identifiers for precisely that content with a content addressed approach, we use those hashes as the identifier for the content and what this achieves is that benefit or if anyone on the network has a copy of that content.

A

You'll be able to request it using that hash and you'll be able to retrieve it from them once you've retrieved it you'll be able to validate it against a hash you use, you can validate it against that hash identifier and know that you got the content that you were requesting.

A

Let's look at how this affects patterns like accession of data, where, if I have content that I've been sharing with my peers using a content addressed approach at any point, I can share that identifier with the library and the library could accession that content I would never have to upload the content. All I have to do is pass the identifier to the library and they can retrieve the content from the web on their own.

A

But you can do this in a more robust way where I might submit metadata about my content or I might submit information metadata like a version history, so you could have a series of versions of a data set that I think are important for the library to hold in their collections. All I need to do is submit the hash of that versioned history of the content, so I'm submitting the hash of the metadata and the library can pull in the whole corpus of information, including the metadata as they see fit.

A

Underlying this pattern, the thing that makes this powerful is that you're using hash linked data structures, so this is a this is a pattern that we've already seen impact in areas of our industry or areas of the technology industry through sit through technologies like get an apache spark or Bitcoin and BitTorrent all of these technologies underlying them. They are using the pattern of hash linked to data structures. So what are the benefits here? Why are we using hash linked data structures?

A

So one benefit is that we can decouple content from its location, which we've already talked about. This allows any content to exist in many places and to pass through many hands without losing integrity.

A

Underlying that is this notion of cryptographic, integrity checking. You can use the the link value itself. You can use that hash to validate the content that you got, and so this is a powerful tool for ensuring the integrity of entire systems of data over time and as their term it transmitted over the web.

A

Now, the most fundamental thing: that's, making hash link data structures powerful. Is that what you're doing is creating immutable data structures? So this notion of immutable data structures is prominent in the field of field of computer science. It's been around since the advent of computer languages or programming languages, and the main idea here is that you can create data structures that do not mutate and whenever you're creating a new version of data. You have a new identifier for that new version.

A

So, to put this together, let's look at how ipfs does this when you're running an ipfs node? One of the ways that you can add content is from the command line. There are many different api's that you could use to add content to an ipfs node, but this gives us an easy example.

A

So, if I have a file or a whole whole corpus of files on my system or a corpus of data on my system, I can tell ipfs to add that content to its repository and ipfs will return the cryptographic hash that identifies that content. I can then use that hash to request the content through any ipfs node anywhere on the network, and that node will be able to use that hash to retrieve the content. Validate it and return to me.

A

I PFS nodes are also backwards compatible with the HTTP web, the regular web, as we know it today, where you can use these hashes to ask any IP FS node to retrieve content for you. This allows this allows tools like web browsers to use HTTP to retrieve content. That's actually stored on the peer-to-peer web now to look at a concrete example. The text is a bit smaller here, but this is a real hash for a snapshot of the English version of Wikipedia.

A

Having added that entire corpus, which is about 2 terabytes on to an IP, FS node, you then have the hash for their content, and then you can retrieve any subset of the content, such as an individual page from Wikipedia, using that hash plus the path within that corpus or the path within that data set.

A

Where you can find the content that you want and as we saw before, you could use that through directly interacting with an IP FS node using IP, FS api's, or you can retrieve it over the HTTP web using those same paths and those same identifiers.

A

What this opens up is the possibility of a notion of pinning where, if I have content on my machine, and you want to hold a copy of that content on to your machine. What you do is you tell your IP, FS node, to pin that hash on to that machine, and what that tells the node to do is to retrieve the content corresponding with that hash and to hold onto it on that machine until or unless you unpin it and run some form of garbage collection to clean up after.

A

So, let's look at how this impacts, accessioning access discovery and preservation, such as the activities that libraries and archives engage in all the time. The main idea here is that we can use sets of pinned hashes and treat them as collections. This gives us content addressed peer to peer collections, since this group is focused on preservation, let's start there so the first example. The first benefit is that replication becomes easy and transparent. That, in and of itself, is extremely useful in a preservation context.

A

When you're moving data between different storage devices or different storage contexts, you have the ability to replicate and then to validate the results very easily, but there's the other powerful pattern that a downloaded content copy of the content, possibly downloaded by one of your patrons if they downloaded it. Using a pattern like ipfs or using a content addressed approach. Now that downloaded copy is a valid replica that you can cryptographically validate at any point which opens up the door to a pattern of participatory preservation.

A

If someone cares about data, they can replicate it on to their machine, possibly telling you that they've replicated it and then that copy is an actual valid valid, a table replica of the content.

A

This is enabled by the ability to do integrity checking automatically on the content, but there are also some things some ways in which this benefits just the day-to-day activities of preservation, such as format, migrations and versioning of content. Those things become much easier when you're using a content addressed approach.

A

This also impacts accession in and the main way that this impacts. Accession of content is that it lowers the barriers to entry or the lowers the barriers to accession of content. In the first place, it anticipates the next generation of tools. The people who create and share data in their daily work are going to be using tools like git tools like Apache spark, which are already using content, addressed approaches.

A

But the main thing is that it opens up this possibility of submission without upload, where I can simply notify my library or my compute community archive of the hash of the content that I want to be held in it's pin set, and the archive can pull that content into its collections without having without forcing me to stop my work.

A

It also becomes easier for me the patron to provide the context to provide metadata context on to my collection, using whatever tools I see fit, including doing things like submitting a version series of hashes.

A

There is also the impacts on access, which is that replication of content reinforces the availability. So if you have a spike and demand for that content, the system will naturally be replicating and increasing the amount of available versions that could be used to retrieve the content.

A

It also allows us to do things like be specific about versions of content which becomes important when people are relying on access versions of data that lies in an archive.

A

It also impacts discovery in really interesting ways. So, in one sense, you can think of it as the metadata that we're tracking about our collections is the data set in and of itself, which is a data set. You could publish over the decentralized web that anyone could fork that collection. They could perform an machine analysis on that collection or metadata in Richmond, and they could submit the results which could be validated and then potentially integrated into the main, the main version or official version that an institution maintains or community maintains.

A

But you can also think of it in terms of blurring the line between the metadata, that's been created by an institution or a library and the metadata that metadata that were created by the people who actually use the underlying information that the metadata created by an end-user can be directly pulled into the collection as part of the descriptive metadata or part of the technical data, and that that same same information can be pulled out by an end-user and repurposed and submitted back in so we're reducing the boundaries between those two things in a way that ultimately impacts discovery in positive ways.

A

It also provides a powerful means of deduplication of content and deduplication of effort and deduplication of metadata. So, for example, if I already have metadata for a particular piece of information, you could request the metadata and match it against the content, using a hash of the content itself.

A

This is the beginning of a conversation about how we can use these technologies to empower communities to share possession of the data they rely on.

A

If you find these ideas interesting or you'd like to have or you'd like to participate in the conversation about where these technologies might go, we've been using the heading data together as a placeholder. For those conversations, a number of organizations are already involved and we'd love to hear your voice as well. We're not only interested in seeing how we can apply this with the data of the web as we know it today, but we're also interested in looking at how we can apply these tap apply these patterns to the next round of data.

A

That's arriving on the internet, such as the Internet of Things and the data that's being produced for mixed reality, augmented reality and virtual reality, which will play such a central role in the way that people encounter information over the coming decades. I hope you found these ideas interesting and inspiring and I look forward to hearing from you.