IPFS IPFS þing 2022 - Browsers and the Web Platform, 10 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Saving the web to IPFS while you browse with Webrecorder tools - @ikreymer - Browsers and the Web

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Yeah so I'll talk about saving the web to ipfs with web recorder tools, and I'm also going to try to do a live demo, so it should be interesting, um yeah, so about weber quarter. So web recorder is an open source project. uh uh Some of our objectives are uh building quality, open source web archiving tools, uh also making web archives more accessible, centralized technologies, obviously a big one, striving for highest fidelity, archiving and replay, which I'll demonstrate later, I'm really empowering anyone to create, share and use web archives.

A

However, they would like to and also create portable web archiving format, and really our motto is web archiving for all and yeah. So as it turns out, uh ipfs can help with many of these goals, at least these three um and yeah, and I I'm the uh I guess I should have introduced myself- I'm I'm the lead, developer and creator of the web recorder project uh and yeah just to cover. uh There are many different use cases uh supporting archives and libraries and digital preservation.

A

Work uh is a major major use case and we work with a lot of libraries, including the national library of iceland, who is using some of our open source tools, a brief shout out to them uh also supporting community and personal archiving, uh including allowing individuals to archive their own content, allowing communities to archive uh shared content online. That's that's important to them.

A

uh Data rescue and archiving at risk content, so anything that might be threatened by government censorship, war or just blink rat for a variety of reasons, also supporting verifiable web archives used as evidence. It's also becoming a more more important use case and last, not least, of course, is the kind of bridging between the current web or whatever. You want to call web 2 and the next generation of the web, whether it's dweb web 3, p2p web yeah, and so each of these use cases could of course be their own presentation but uh yeah.

A

So there's a lot. uh A lot of different use cases for this technology um and uh just brief introduction to one of the workflows we have archive web page is one of the tools which allows you to archive with a browser extension, an electron, app or just a regular website.

A

Then we have a portable format, which we call the yxd format which I'll talk about later as well, and then we have a separate tool which is replay webpage, which allows you to view the web archives and it runs as uh just a regular web app a progressive web app, a desktop electron, app or an embeddable web component um and uh yeah, and so that's sort of one of the workflows and of course, ipfs is sort of the key there as one of the storage options for for the portable data format that we have um there's.

A

Also, this is sort of the manual archiving workflow, which I'll demo there's also uh an automated workflow, where we place archive web page with a crawler that can run in docker as well as kubernetes and there's the api and a gui being developed, and it's been deployed anywhere from the cli version can be deployed anywhere from a raspberry pi, in fact, to a larger cloud deployment. So I won't.

B

Have time to demo.

A

This but uh just kind of mentioning the scale of of some of the things that they'll want to support, yeah and so I'll just uh skip right ahead to the live demo portion um and so so uh I'll start in brave and um could load so archive webpage kind of has this landing page from here.

A

You can install the extension, so I've already done that and it's right here and it's uh so what I can do is let's say I want to go to the web recorded twitter feed, for example, and I can go into the extension I can click create new archive, just call it demo and then I can click start, and this is using the chrome debug api.

A

So it's showing that I'm debugging this browser that's sort of a part of chromium, and that allows me to archive everything, that's being loaded here, and so you could see here that uh I've already loaded 10.4 megabytes and it's deduplicated. So it's stored. Eight point I think, that's accurate.

A

So let's try maybe something more interesting like the ips thing, hashtag and there's some tweets in here. Let's say I want to archive this twitter feed. I could click on each tweet. We also have that that's going to be a little bit tedious. So we have this thing called autopilot which uh what that will do is it will automate that for me, so it will start clicking on each tweet and scrolling down and also scrolling through the images and as it's doing that, that's all being archived.

A

So we can see the size counter going up um and uh yeah. So it's uh and if I leave it running it should just keep going through the entire hashtag.

A

um So in the interest of time I will just pause that here, it's probably good enough, um and so it's and then I'll stop the recording and what I can do is so click that I can then go to the browse archive.

A

uh And now it's showing me that I've archived this uh this twitter. Actually, I guess, since so some of the complexities. I started at the web record that I o and then I dynamically changed. So it's not actually a separate page.

A

So that's uh so I probably should just gonna uh start again on this page just so that that gets added as a page entry. So the history api makes web archiving a lot harder uh and actually maybe I'll do something else, let's, let's also click on another page, um so I'll also go here and obviously not anything.

A

This is just our static site and that should also get archived once it loads. So, oh maybe.

A

Let's see here.

A

Hopefully, it'll loads, so I guess I already have it uh doing a lot here. So I just loaded first and then I can turn that archiving back on um so another thing to show, while it's loading that is while um uh so- and this is now the replay so now I'm in the uh now I'm viewing the archive- and this is exactly the the view that uh you can see that it's logged in as myself. So this is a view of of twitter.com that is unique to me.

A

So it's a it's very much a personal archive um and obviously for anyone uh viewing the web, especially social media. You would have a personalized view and, okay, I guess we're loading from um maybe I'll, stop that and um yeah. So from from the extension.

A

What I can do is one of the things that we have is the sharing option and that's where so we can share this archive on ipfs and that's the way it does. That is. It generates a waxy file and additional replay boilerplate, which I'll demo. I can also download this archive locally.

A

And uh then I'll just have it locally as a file, um and now that I click on the sharing link. So I have I make available three different options, because that's sort of uh was necessary to test in different environments, and then I can just um because I'm in brave, if I paste the uh obviously it's in the same instance, and now it's loading the same archive from from ipfs. And so um I should be able to click on uh on here.

A

And this is the archive of the of the ipfs link twitter speed and it should have- and I should be able to scroll and probably click on some of these.

A

Some of these tweets, so it's a by high fidelity. um The idea is that we're archiving not just not just the uh a static version of a site but the actual uh sort of the interactive web application, including all the javascript, that's involved, and so this is. uh This is one of the approaches, um and this requires uh installing the a browser, extension and uh archiving exactly what's loaded in your browser.

A

What if you only wanted to archive? Well, if, if you didn't want to or you don't have a chromium-based browser- and you wanted to.

A

Archive just a single page at a time later created a kind of a simplified version of this, where you can actually archive a single page at a time without having to install an extension and this approach, oh and I should say, the sharing ipfs works well in brave through there, but not in and not anywhere else currently, uh but uh in the express archive web page, you can just enter a url and it will load that and this should load in any browser.

A

So let's say I want to archive dietrich's tweet. Actually we also have a way of archiving just the tweet version itself, and uh so this is a kind of a a special. um So embed is a standard for embedding that allows for essentially loading an embed like this and twitter happens to support it and so and then I can uh also.

A

Download- and we have this- this url, which I can load again in in actually that was in chrome, so, let's switch to uh to brave uh so that the the gateway link loads natively, and now we have it loading uh in brave. um Let's also try. Agar gore has two browsers: aren't enough: let's try three uh and actually I'm gonna copy, the direct pfs link because and.

A

And so that, and now we have it loading in aggregator as well, um so we could probably also there, of course, for for other browsers, there's still the the um the direct gateway link which, uh let's see, if that actually loads. Oh no, I guess I copied the about a second yeah, so if I just click on that, uh that will probably take longer so um some of the other things we can do. uh There's a also archive youtube videos in this way, um oh actually for youtube.

A

uh Another thing you can do is actually archive. So we're talking about kind of archiving things from existing archives or web 2 to web 3 and having a more permanent storage replaces. You can actually also re-archive things from internet archive, so there's a mode where you can enter.

A

You can select between live web and and the uh in a time stamp and stuff. For example, if I enter geocities and then I load that it should start pulling this from from into an archive that takes a little bit, and so this is the a version of geocities.com from from 1996 and it's fairly small.

A

So it's only and then we can also share that to ipfs as well, and so you could re-archive pages from an archive one page at a time uh with this, uh and uh it will generate uh a different uh sit every time, because it's also including time stamps in there, but um yeah so kind of same idea.

A

You can load it in in uh in brave and then we'll have a a version of this um okay. So so that's uh part of the yeah, oh probably, first of time, I'll move on back to the slides uh and uh so yeah. If you need to archive a page to ipfs in a pinch, uh you could try express darker web page. um It generally should work and as long as it's not something that's, that's private uh should work for public uh data.

A

It's also processing things through cloudflare, because uh that's the only way that we can access things in another website due to course, restrictions um and yeah, and so what works really well uh brave access and creating web web archives on the desktop uh works really well, um as I've just shown. uh Also, access on uh in aggregate browser, thanks to move, uh works really well on desktop and there's also a mobile version of of agora that can load ipfs links directly as well and, of course, using a web.

A

Brief storage has generally been really fast, but of course that's a just an http api, but it works yeah generally, very, uh very robustly and reliably yeah, and so how much data can they store?

A

So in the in the case of that geocities page, uh that was, I think, around 100k or so um so that's one uh kind of lower bound uh and then uh part of an archiving effort uh to archive ukrainian websites uh from a large crawl, not using the browser extension, but using the crawling system which produces data in the same format. uh We actually have a an archive, that's about one terabyte um and yeah, and so we can actually browse that.

A

um So if I just load that here, I had a preloaded, it's a it's a also in the same format. um I believe it's uh it's a page in four languages uh that has a whole bunch of youtube videos, uh there's a whole media section that that is uh just multiple hour. Long youtube, videos and the size is one- is just over a terabyte yeah, and so the idea is that this system should support uh anything in that range and, ideally, ideally larger as well.

A

So chrome is trying to load for the gateway. That's going to be slow, so we'll just move on and yeah, so a little bit about the waxy format. So it's a kind of a standalone format. It packages some of the uh existing formats that we use in the web archiving world, including work which lacks indexes. So that's sort of the main limitation of work.

A

So we add the kind of standard index format called cdx into to the waxy format. It extends the frictionless data package wanted to start with something. That's just not not created entirely a new format from scratch, but based off of something and frictionless data seemed like a good alternative.

A

It can include uh direct signatures of all the data here, so it actually uh yeah everything is, is uh hashed and then signed at the end uh in the manifest or in the digest file, uh and it's basically just a zip file so which allows for random access. So- and this is the structure of the zip file and there's a link to the specs um yeah, and so the idea is that this, this file format can be loaded from anywhere from ipfs from local file system, http uh and yeah.

A

It's basically the simplest thing that works so putting it on ipfs uh didn't want to do anything special with ipld to start just a basically a unix fs directory structure with uh the waxy file itself, and then this boilerplate, um these four uh three files uh index.html, which loads the uh web component um and then a service worker and the ui um that provides the the nav bar and all the ui for for browsing.

A

And that- and this is what, um when I express a web page and in a comment page we're putting data on ipfs. This is what it's putting. Essentially. Is this a sid with this structure and uh so yeah? How have we tried? Putting data on ipfs um so by the browser extension with jsipfs in the browser. So if you're using chrome, that's what it will use, um if it's using brave, then it will use the native google ipfs in brave, and I think we'll probably do something similar with aggregor.

A

Then there's also an electron version of archive web page, which runs jsipfs in node and then and then finally, with express archive web page just uploading to web3 storage through regular http api. um And if I were to rank these approaches. um This is kind of how how it's been uh the js ipfs in the browser. I extension is not really reliable, for I can just generate a url and give to users and expect that to work, which sort of makes sense the way that the native support and brave works much better.

A

But there's still some hacks involved. The electron app with jspfs mostly works, but there's some yeah. It's also not not quite as reliable as it is in brave and then of course, uploading to web3 storage has been super reliable and easy to use. So so that gets five stars um yeah and some of the yeah. Some just more details about um the js ipfs and the browser also attempted custom preloading, uh because don't necessarily want to upload everything at once. uh It's running in a service worker.

A

There are some limitations, can't use webrtc um but yeah. So there's there have been a few challenges there, the embedded mode in brave since there's no writable gateway. Yet I basically have to port scan for a api port that brave runs on. It runs on one of five ports, depending on, if you're, using brave release, version or or dev build and also have to override cores, because the gateway is not designed to be used in the browser. So basically it's sort of a huge hack.

A

The electron api had to implement a custom pc system, because the recording happens in the browser and then send it to the electron node process and also kind of also very very custom implementation there and then for website storage, just creating the car files and using the rest api, um so yeah, so, instead of all kind of very different ways of trying to do the same thing and uh yeah. So now the challenge is for reading from pfs.

A

So number one, I think, is need for reliable gateways, uh especially for small random access reads um and uh because what often happens is since it's pulling small amounts of data, the gateways, often time out with a 429 error, or at least the web. That link does- and I think yes, it also does but uh yeah.

A

So it sort of depends on the on what the gateway, implementation, um yeah uh sort of one of the reasons for web archiving is to be able to solve, link rot and have a permanent url to give to users, and so http links are not reliable, um so bfs links content address, that's great, uh you can press the link, but we need to choose a gateway to hexadipfs and so we're kind of back at blinkrat, because we need to give users an http link which to a gateway which may not always work, and so that is sort of a problem for reliability um currently and um yeah.

A

The way that I think sort of sort of the the three main key operations for a web archive. I think that uh I need to focus on are sort of from user's perspective, as a user should be able to browse an existing web archive, and so to do that we need to load blocks from sid and by random access.

A

We need to be able to make a copy of an existing web archive um and that's where we need to pin the whole thing locally or maybe somewhere else that they that the user can can can know that that's their copy and then, of course, creating and sharing serialize to waxy uh and then pin that data in that structure that I showed earlier and make a distinction between one and two because just to browse the web archive, you just want to have the random access.

A

You might want to just look at a few pages uh and then, if you actually want to make a copy, then that's when you pull the whole thing. So the the random access read from a very large data set is is very important here. um Yeah and ip fester really be as infrastructure users of archives shouldn't have to know or think about using ipfs.

A

I think- and I think ideally just present these- the the the options on the left in the left column is what I want to present to users and not tell them about what's happening on the right unless they're developers um and uh yeah and uh yeah. So I think that these are some of the key goals to uh to be working towards um just briefly about future work, um uh how to build large archive collections uh trying to standardize this data layout.

A

So this is kind of what we have for one waxy files, but you can. Obviously you can't put everything into one file. You want to be able to mutate and add additional files later. uh Maybe that's what the structure looks like still kind of being decided. There might be another manifest involved there.

A

Probably um another idea that was discussed is not used as a actually unzip the waxy format and just put that directory layout that I had earlier directly uh when, when putting it on ipfs, the zip is basically just used as a as a container. It's not.

A

The data is actually stored in the zip as uncompressed data to make it portable so that users can download a single file, but if they're putting in if, if putting that data on ipfs- and you just aren't downloading a file- maybe we just unzip it first and maybe we create a car of this directory structure and maybe that will be a standard and uh more work ahead. So standardization um uh authenticatable web archives. So that's that's a really big one.

A

So since anyone can create an archive and put on pfs, that's a I demoed or downloaded. How do you trust this web archive um either as the waxy file or as a cid or or in in any format? um And we have several approaches for in the cloud but still trying to figure out how to do that with archives created in the browser um interrupt between the cloud and browser-based archiving tools so um making it easier to kind of?

A

If you need to crawl a whole site uh around the cloud system, then maybe you patch in things that are highly interactive and require browser-based archiving and possible integration with pvs companion in some capacity.

A

Maybe, as dave mentioned, uh something to be discussed um private encrypted web archives, something that we don't have right now, but if you're archiving private uh exactly what's loading your browser, including logged in and paywall content, you definitely want to have an option to make that private and encrypted search, either by url by date or possibly full text search, probably all the above and then finally even further out and maybe even putting not just web archives but whole web 2 servers and emulators and running them on on web assembly.

A

And for that you could see old web today, which is a another project. I worked on that runs old browsers in in emulators, um so that's way further out. But since there's a lot of talk about web webassembly, I thought I'd just put that in there as well um yeah and uh thank you and find out more about webreporter site and, yes, we're hiring if anyone wants to help. So thank you.

C

Thanks ilya, you have tried ipfs to both read and write in more ways in one single. If you are interested in a job doing devrel too, because we've had a lot of more demos, anybody have any questions for ilya.

D

um Regarding your cloud thing, um do you have any sort of like deny lists or any sort of moderation features because it seems like this could be a way for people to just like circumvent blocks and other things or start using your infrastructure to crawl things which uh they maybe shouldn't uh so like? Is there any moderation in place or plans in the future, or is there even like a worry of dmc takedowns.

A

um Possibly so it's still in very early stages of development, uh so these tools are a little bit further along um right, now, kind of focusing on supporting archives and and what institutions that are kind of would run crawls in a controlled way uh yeah. If, if it's ever offered as a public service, um then we definitely need to think about that. um Of course, it's all open stores, so someone else could could run it on their own and then they would have to worry about that.

A

But right now we're not quite ready to yeah to offer it as a public service but yeah. Those are good questions to to answer before doing that.

B

So when you have the file system layout, that was like two slides three slides ago that had like web archive one archive n. I didn't grab click. So what's that, for why do you want to have multiple? Is it like snapshots over time.

A

uh Possibly yes, uh so it could be snapshots over time. It could be that you've crawled one site uh or a bunch of sites. Then you realize that you need to add more things, so it could be either snapshots of the same site or additional sites or uh yeah, and any combination of that.

A

So the the the replay yeah so the the replay here is uh and again this is all something that that will probably change um yeah. So the right now the replay is bundled with uh with the archives themselves, there's an argument for possibly also separating back, because you could actually point this. um Since it's just a service worker, that's loaded, you could um load the replay and then pass the parameter that loads loads, the actual archive from a different sid, um so yeah, so that that that there's a few things to figure out there.

B

Would be like an app that helps you ground.

A

Yes, exactly yeah yeah. This is just the boilerplate right, so this is the this is the the actual ui that um it's like. That's basically, what's what renders this uh and yeah essentially and the html is just uh yeah yeah, just basically just that.

E

Thanks that was really cool um kind of two questions: first, one's uh the crawler that you're using behind the scenes to do all of this uh is that open source somewhere? Can we integrate it into other stuff? To do more of this.

A

um Yeah, so so there's a difference. So what I showed now uh with the browser extension, that's basically just um that that's all manual, so nothing is being automated there. It was. It was crawling, so it wasn't crawling. It was archiving what I was loading in the browser there's a separate tool that does crawling, uh which is also open source. Yes yeah, I didn't quite have time to demo it, but I I can. I can talk more about it, but it produces.

A

It also produces it basically automates the browser uh running in and automates it with puppeteer and produces output in the same format.

E

Very cool, okay, so follow-up question: um if you do this on, like a global scale, can you deduplicate, like javascript libraries and css across, like the global archive that you're creating.

A

uh Currently, no because the the format includes uh timestamps, uh and so the way it's stored is is, uh but if there is, I mean ideally, this would be uh yeah, so currently the the format itself would, because it includes timestamps another metadata. It would not. It would be it'd, be a different waxy file every time um the interesting part of it would be if, if especially for large files, if perhaps there's some sort of uh custom chunking.

A

That could do this automatically that that's sort of my ideal so that I won't have to worry about that as much especially for for larger amounts of data such as video all right thanks.

C