IPFS IPFS Camp 2022 - Gaming & Video & the Metaverse, 2 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Video and Metadata Standards with Livepeer - Yondon Fu

Description

Shannon will share the learnings and observations from Livepeer, as they work across ecosystems to help steward shared open standards.

A

My name is yondin and I'm, one of the co-founders of the Life beer project, which is a decentralized video streaming. Network I'll talk a little bit more about life here, as we go through the presentation, but the title of today's presentation is video and metadata standards and I'm going to go through a few topics today.

A

First I'm going to talk about what metadata standards are, why we should care about them, why we should care about them in the context of gaming and metaverse applications, and also, most importantly, why we care about them at life peer and what the relevance of video is as a media format to metadata standards, then I'm going to talk a little bit about an approach that we've been working on called video, compute on decentralized storage and then the main example that I want to present today is this new feature that we've been previewing with a few developers called ipfs CID video streaming playback, and we think that that this is an example of what you can do in this video compute on decentralized storage approach and to wrap things up.

A

Taking a step back I'm going to talk about this decentralized storage, video compute pipeline, which we think that we can enable for the broader web3 Community when it comes to supporting the next generation of enhanced video experiences on the internet, but before getting into all that metadata standards. This is the starting point and the first question that I think is worthwhile.

A

Asking is well what do we mean when we say metadata- and this might be an obvious question to a lot of folks here, but just in the interest of kind of establishing a firm ground before moving forward at the end of the day. Metadata is the set of data that describes another piece of data. Hence the term meta and I think the concept that probably best illustrates this today is nft metadata.

A

So on this screen here we can see on the top. We have a screenshot of openc, which shows an azuki nft and then on the bottom. We can see the actual metadata that is associated with this nft and nft is an interesting form of an on-chain asset that also has on-chain property rights, and this notion of ownership on chain is super important right.

A

However, the off-chain metadata and the metadata that's linked with that asset is also just as important, because that information, whether it be traits in this case for the Suzuki nft, we have different attributes that are associated with the nft or links to other pieces of off-chain content, such as images, videos, so on and so forth. All of those pieces of data and information are information that give the nft additional meaning. So I think this is a nice example that illustrates why metadata is relevant.

A

Generally speaking, and this gets us to the question of well, why do metadata standards matter in the first place? And the answer that I have here on this slide is that if metadata formats are standardized, then multiple applications can build upon the same metadata because they understand how to consume the metadata, and we have this level of interoperability and portability where the same metadata format can be consumed.

A

In two different domains, and then those domains can interpret that metadata differently and the example that I would highlight here is screenshot number one is of an application called Zora, which provides an interface to an nft Marketplace and then once again revisiting openc, where both of these applications are actually serving user interfaces and user experiences that are built upon the same set of shared metadata for an individual nft.

A

So this nft here it has a name Superfluous and the same metadata is being read in both applications, but the user interface and user experience is different in both of these contexts, because they're targeting different types of consumer preferences, they're targeting different types of experiences on the product side that they like to create. So this is one of the powerful things that standards help enable where these two applications can exist and have the same shared fundamental data layer.

A

Now this is the gaming and metaverse track. So a good question is well. Why do metadata standards matter for gaming and metaverse applications, and the answer that I'd like to give here is that well in the context of games and metaverse applications? How should content such as images and videos be referenced in asset metadata such that that content can be built upon in these different contexts and different applications in this game?

A

World in this metaverse world, one of the most powerful things about all the work that gaming and metaverse entrepreneurs and application developers are working on. Is that given a simple starting state for a game world or a metaverse world? Instead of having to build all the content yourself and instead of having to build all the experiences yourself, you can drive that forward with the community and developers in a permissionless way can extend the game.

A

World extend game, clients extend metaverse clients by reusing the same content and logic, building blocks that were established previously by their predecessors. So the example that hopefully can help illustrate some of the interesting things you can do here when you are able to build permissionlessly on the work of others. Is this very simple illustration which is imagine you have a trophy asset in a game that comes attached with an instant video replay of the victory?

A

What if we as a developer, want to build an extension of the game client, where every single time, the player holding the trophy asset inserts a particular special Zone in the grid, the instant video replay of the victory is played back in a virtual movie theater for everyone in the general proximity of that player to watch, and this can be an experience that no one ever expected to build when the game world was first created. But someone can decide that that is an interesting experience that they'd like to enable.

A

However, this requires you to be able to permissionlessly access that content. It requires you to be able to parse the metadata of these assets that are being created by other players and build on top of that. So I think that's one of the interesting ways that we can approach metadata standards for gaming and metaverse applications. And lastly, this is a talk about video, so I'd be remiss not to talk about why metadata standards for video and the main thing that I'm hoping people can take away with this slide.

A

Is that there's this interesting question of in a video context? How should videos be referenced in metadata if there are so many different video formats and so many different possible Renditions of the same exact, visual content when it comes to video oftentimes? The video that's produced is very rarely the video that you see. This might be because you have an mp4 version of the file. You have an mov version of the file or it might be because you've processed the file into different qualities.

A

You have applied different filters, and this comes at odds with this other property that we like in web 3, which is verifiable data. So the nice thing about cids in the ipfs ecosystem is that we can establish links in the metadata with a verifiable piece of content, because the CID is just the hash of that content. However, when it comes to dealing with this problem of video formats and Renditions, a single CID on its own is not going to be able to address.

A

This represent all the possible different Renditions that you might need today and then also thinking forward to the future all the different versions of that video that you might need in the future. What happens if someone comes out with a new format that you'd like to support? That's only supported on certain devices. What happens if you want to enhance the video into a new quality that wasn't previously supported?

A

All of those things are extensions of the same content and if you kind of bake into the CID, just the single piece of content that you have today, there are questions around how you should handle that in the future. On the other hand, in a web 2 context, we can solve that problem right, a YouTube url. If we link that in the metadata, you can serve many possible Renditions with that. But unfortunately this is a reference to a location and not to the data itself and the link could break so.

A

There's this interesting question of well. What do you do when you still want this shared verifiable data layer that is enabled I see Ides, while also wanting to preserve optionality and flexibility by being able to create different Renditions and formats of the same content going forward?

A

So this leads to an idea that we've been working on at live peer and just generally how we can support the weather ecosystem and it's this notion of video compute on top of decentralized storage, where everything starts with decentralized storage as the shared verifiable data layer. But you have compute layers that you layer on top in order to augment and enhance the content that already was anchored into decentralized storage.

A

So the nice thing about cids and metadata is that, as shared verifiable data, they can be the root of content built on by others. So the CID just being the hash of a particular data. It can serve as the original reference of content and additional Renditions, whether they be different qualities, filtered versions of the original content.

A

Those can be processed by the layers that you layer on top of the CID, but the ultimate root and input is still the CID of the original content and a way that I like to think about it is that we can view cids as the base ingredients for enhanced Renditions. So uh something that I will highlight here is: let's say that we have nft.

A

We have the metadata for that nft and then I'm sure many of people are familiar with the Rick Roll video, so I just chose to use that as an example here, because there is actually an interesting news, article that came out last year that I'll get into momentarily. But let's say that we have a Rickroll nft right, so we have a CID that references, the Rickroll mp4 file, and now this is permanently linked with the nft.

A

The Rick Roll video, as many people know, is a Rick Astley music video that came out a long time ago. So when it was actually created, we as kind of video technologists, didn't really have as sophisticated Tools in order to create really high res versions of that. But in 2021 there was this Verge article that came out that noted that someone actually using some new uh video AI based techniques upscaled the original Rick Roll video. So now you can watch Rick Astley in 4k.

A

Now you can get rickrolled in 4k and that's an example of like a post-processing step that was not available at the time that the content was released. But naturally, you want to see if you can enhance the experience of these videos, so naturally you're going to want to explore applying these forms of compute in order to enhance what you already have. So in this case, we can apply compute on top of the Cid in order to get this enhanced video Rickroll in 4k.

A

So this brings me to a demo that I want to show, and hopefully it will work but before getting into it the feature that I talked about earlier ipfs CID, video streaming playback. The general idea here is that the plumbing for video streaming online is this process called transcoding and transcoding is the process of taking an input, video and transforming it into all of these different qualities and Renditions and formats so that you could be on your device on your Android device.

A

Your iOS device, high bandwidth connection, low bandwidth connection, and you can continue watching the same content seamlessly and if you go to a website like YouTube If, you go to the UI. You can actually see here that there's multiple different qualities you can pick from and by default, the player on the front end. It's going to intelligently choose for you, so you never as a user or a consumer have to touch anything.

A

You just get the best possible quality experience, but that's only enabled if you do the transcoding first, it's only enabled if you apply this compute process, this video compute process in order to enable that experience.

A

So this comes to what we've been doing at live Pierre, so live peer, is a protocol on the network for supporting Global and open decentralized, video infrastructure and one of the primary tasks and compute that the network is responsible for Today Is video transcoding. So video transcoding is a very heavy process, but it can be Hardware accelerated, so someone that is say a crypto Miner that has access to a GPU. That GPU can actually be quite efficient for this task that we work on here in the network.

A

So the network brings together all of these Hardware providers in order to accelerate this process. So it's much easier for the end user.

A

We've built this live pure Studio product, which is a gateway to the live, pure Network, which is really just a hosted API that wraps a lot of this functionality provided by open source software in the network, and the goal is to make it as easy as possible in order to actually leverage this functionality in practice in your application.

A

So coming back to transcoding as I mentioned previously in a lot of cases, if you want to have this efficient playback process, if you want to be able to have things work, regardless of what device you're on and what internet connection speed you're on transcoding can help there. But what happens if all you have is a mp4 file? What happens if you have an nft that all it has is a CID link to a file that already exists on ipfs?

A

Well, the feature that we've been previewing is what it allows you to do is all you need to do is enter a CID into a JavaScript player component that we've created and under the hood it will work with the studio which relies on this open source software in order to import and pull that CID data in go through the compute pipeline, transcode it and then make it available for playback and then from a developer's point of view.

A

It just feels like you're plugging in the cids that you have access to already for your nft metadata, and you just get this optimized playback experience in return.

A

So I recorded this quick demo, video that I'm going to show- and hopefully it can help- illustrate the benefits of using this approach and some of the things that we'd like to go on to work on going forward.

A

So I don't know if, okay, so I, don't think I have sound here, so I'll just narrate instead. So what is happening here is that oftentimes, you see the effects of transcoding a lot more prominently when you're on a shaky internet connection. So what you can do in the browser and what I did here is I'm actually going to simulate a 3g connection.

A

So it feels like you're on a mobile connection, as opposed to like your high-speed internet, which might be like fiber optic, Wherever, You Are, so first I'm going to switch over to that mobile connection and then, by being on that mobile connection, then I'm going to show the playback experience when you take an mp4 asset that you play back on ipfs and then using the video streaming approach. I mentioned uh just now.

A

So, as you can see from this footage, we're trying to play this MP4 asset from an existing ipfs Gateway, but we're encountering a good amount of buffering, and that buffering is typically due to the fact that if you have a single quality version of your video content and it's high quality, which makes sense why you would want that you might not be able to download the data fast enough to continue playback.

A

So next, what I'm, showing here is that we have this docs page for this live pure JS SDK, and what you can actually do is you can take the same CID that was being played back from the ipfs Gateway and you can plug an ipfs URL with that CID into this page. And what this shows is that it's going to try loading the video streaming based playback and from here we can actually see the same exact content being played back via the CID.

A

That was inputted, and the nice thing here is that we don't actually get any buffering. So we're still on this same 3g connection right. So it's still shaky. It's still not going to be fast enough for a lot of types of data downloads, but because we transcoded it, we get access to these different versions that we can switch between and from an end consumer point of view. You actually continue playback and you don't get the spinning Circle of Death anymore.

A

So that is the quick demo of the feature and that's something that we rolled out for developer preview recently and what we hope this can illustrate to people is that you can get the best of both worlds.

A

Here you can get shared verifiable data via ipfs, but you can also apply compute on top in order to get this enhanced video experience, and that leads me to the last portion of my talk, which is, we just showed this notion of ipfs CID video streaming playback, but when taking a step back, we think we can generalize this feature and when we generalize this feature, we end up with what we call a decentralized storage video compute pipeline. So transcoding is what I, just demoed and transcoding is a big part of supporting video on the internet.

A

But it's not the only thing that you can do in order to take some input content and enhance it so that you can create new and interesting and creative product experiences.

A

um Cool, so a first question is well this video compute pipeline. What other forms of video compute could actually be interesting? What does that even mean? So I present a few examples here that, hopefully, can help illustrate a little bit of what you can start exploring when you have this pipeline.

A

The first example is, some of you might have encountered open, AI whisper, so this is a machine learning model that the openai team built and released recently and what's cool about this model is that it is intelligent enough to be able to parse an audio stream from any media file and then automatically transcribe and create captions files from that.

A

So as a result, even if you didn't start off with a transcription of what actually happened in the dialogue of a video, you can Auto generate that now and it actually can work pretty well, and someone actually did that for a whole slew of videos on YouTube. So uh the second screenshot shows um there's this Lex Friedman podcast and all the videos are publicly available on YouTube someone actually downloaded all the videos and transcribed and generated captions for all of them, which is pretty cool.

A

But then the next step that you can take from there is that you have the captions. Well, you want the captions in your video and support them in your video as well. So something that you can do is auto generate the captions and then also automatically insert them into the video stream, so that, similar to what you see on YouTube, where you can turn on closed captioning for your video, you can support this in more applications as well, but you don't necessarily need to be YouTube, I think. Another good example is something called super resolution.

A

uh It's just a fancy term for can I take a original video that I have and then increase the quality of it. So it can look crisper and more quote, unquote. Modern! So we've seen a lot of examples of this where people try to restore old footage and then basically enhance it, so that maybe an animation that came out in 1995, you can restore it. So it looks like it came out in 2015 or 2020.

A

and there's in some interesting compute work here in the AI side, and one of the most popular models for this is called esrgan and on the left. I show a screenshot of something that I did locally, which was I, took an input, video and then applied ESR Gan to it, so that we can upscale it and increase the resolution, and you might not be able to see it perfectly here.

A

But what this illustrates is that we had this small video that I created and then I was able to upscale it and make it substantially better crisper and it actually, it can actually play back in a much higher quality, even though it was recorded at a lower quality.

A

I think another example: that's interesting is uh many. People here might have bought into or been checking out the generative AI hype, that is on Twitter and what's interesting about generative AI. Some of the stuff that people have been working on in stable diffusion is that on one hand, people are working on models to basically be able to transform text to video, but even before getting to the text-to-video models, we can generate interesting videos from text already using text to image models.

A

So this footage here is actually also a video I created with stable diffusion that under the hood, what is happening is that I can take, prompts and generate two images and then basically interpolate between those two images to generate like a sequence of frames that illustrate what it would look like to transform one image to another.

A

So this is interesting from like an art context, but what I'm trying to illustrate here is that this is also another form of video compute, where, instead of the input video it's well in sort of, instead of the input being a video, you can have the input be any form of media, but at the end of this pipeline, you're still creating this encoded video that can then be transformed on transformed and transmitted on the internet.

A

So what this starts to look like is a more generalized Pipeline, and what this pipeline illustrates here is that, right now, as I mentioned, the main focus of the live pure network has been on transcoding, but when we think about what we can do for verifiable shared data such as cids, you can take that as input into these pipelines, and you can do substantially more and you can start exploring other types of enhancements to the video experience as well.

A

So I won't go through every single component of this pipeline, but just a few things that I'd like to highlight is that you can see here that there's this encode box that is similar to every single pipeline here and that's because what video transcoding is at the end of the day is two parts.

A

It's decoding an input, video and it's then encoding an output video so that it can be transmitted on the internet and that encoding process is important because it allows you to compress the data as much as possible so that you don't have to consume gigabits of connection in order to transmit a 4K, video or a 1080p video.

A

So, in all of these pipeline examples, whether you are scaling a video, transcribing the audio and then creating a video stream, whether you are taking an input, video and applying super resolution to upscale it and then encoding that video stream or if you are using a text to image or text-to-video model and generating video frames and then encoding it into a video stream.

A

At the end of the day, the output of this pipeline is encoded, video that can be transmitted on the internet, and we think that this is an interesting way to think about how to extend the data that you have in decentralized storage, augment it enhance it so that you can not only have the experiences you're used to with media and web 2. But you can go beyond that as well and at the end of the day, cids Remain.

A

The root for this pipeline, so cids come in and then you apply the compute layers on top and you end up with enhanced video.

A

The last thing that I'll mention and I'll close with is this notion of verified inputs and verified outputs. So I think an important question to ask here an important question that we're looking into is in a world of many formats and Renditions of the same original video, given a verified input. So a CID is a verified input in that you can verify that the data matches the hash of the CID or sign data, or someone takes their private key and signs the hash of a data.

A

Given a verified input, how do we link this video output of the pipeline back to the input? So, ideally you have something that looks like this illustration that I have on this slide, where I have the video output and then transmitted with the data. Is this linkable provenance chain through the Transformations that were applied back to the verified input, so verified input in and verified output out and I?

A

Think that is the ideal scenario, because now, not only do you have cids as the root of your pipeline, you can always link back to that CID and you have provable provenance. You have verifiable provenance back to the original video as well.

A

So that's something that we're spending Cycles thinking about at live peer and we think there's some interesting things that you can do here with uh cryptographic, proofs, providing audit chains for the media having those audit chains be transmitted alongside that video or piece of media, so that when you are able to view it you're also able to consume and display and use as a developer that Providence chain.

A

If that's something that anyone is interested in always happy to talk about some of the research we're doing on that front as well, so to wrap things up in conclusion, I just want to summarize some of the takeaways that I hope people can walk away from this presentation with so one standardized metadata and video references allow developers to build experiences around assets and video content in a permissionless manner.

A

Cids can actually serve as this shared verifiable data layer that then can provide inputs into this video compute pipeline. That's based on decentralized storage. I mentioned this CID video streaming playback as an example of what you can do with that pipeline, that's rooted in decentralized storage and lastly, this General video compute pipeline Anchored, In decentralized Storage, we think, can be a general approach to creating enhanced video experiences with different types of video compute on top of decentralized storage.

A

So thank you for taking the time to watch this presentation and my colleague, Shannon and I are around the rest of the day and all the week.

A

If anyone wants to chat about any of the topics I mentioned in this presentation, this QR code can take you to the live here, community and Discord, where we think and talk about a lot of these Concepts and last but not least, we have a event that we're hosting on Wednesday, which is the future of web3 video, where you can get some of this content and much more as a lot of Builders and just really smart thinkers congregate to think about what are the video and media experiences that can be uniquely enabled in web 3 and we'd love to see everyone there.

A

So that's it for me. Thank you.