IPFS Virtual Meetups, 26 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Webrecorder: Web archiving for all - Ilya Kreymer

Description

Join Ilya as he demos high-fidelity web archiving using the ArchiveWeb.page browser extension, and talks about how web archives created in the browser can be directly shared with others using an experimental features that uses js-ipfs in the browser.

For more information on IPFS
- visit the project website: https://ipfs.io
- or follow IPFS on Twitter: https://twitter.com/IPFS

Join your local IPFS meetup to attend our next event: https://www.meetup.com/pro/ipfs/

Sign up to get IPFS news, including releases, ecosystem updates, and community announcements in your inbox, each Tuesday: http://eepurl.com/gL2Pi5

A

Yeah, thank you for having me yeah. I wanted to uh talk about high fidelity web archiving and do a quick demo of some of the tools that uh I've been working on and so I'll uh we'll go ahead and and start uh so. uh I want to talk about so I work on a project called web recorder and uh the idea with uh web recorder is to is sort of uh motto is web archiving for all.

A

So the idea is to allow anyone to create uh web archives of exactly what you see in your browser and to archive them at full, fidelity uh and so um kind of just very briefly. What are some of the goals of the web recorder project uh yeah? It's basically to focus on on uh capture and replay uh and and archive things as as accurately as possible.

A

uh All the tools are fully open source and uh another key goal, of course, is to make uh web archiving more accessible uh by a decentralized technologies uh and uh so uh uh kind of to start what is web archiving and what does it mean to actually save a page and view it later?

A

So you might think that, well, there are kind of some obvious approaches to start with, uh including, um for example, the browser save page, as uh so, every browser has that and you can go to a page uh and and and try to save a page uh if you actually do that on on any page. That's uh anything that that is not uh mostly just a static document with html uh you'll quickly find that uh what you actually save doesn't generally work very well uh when you try to load it back up.

A

uh Part of it is that uh that doesn't save the uh it doesn't save the network traffic that that got you to that point. It's only saves a static snapshot and it doesn't save any of the state and javascript and oftentimes modern web pages, which are really complex. Applications, don't really work when just loaded from from your local file system.

A

Another approach that is often tried is sort of kind of simple crawling and scraping.

A

You could actually use wget and you could point it at a website uh and it'll retrieve that html it'll extract all the links and it'll repeat uh recursively, but again, there's no, no javascript run uh and you'll find out that uh you'll get the static assets from a site, but anything that's loaded dynamically.

A

uh Anything that's loaded through javascript generally doesn't work, um and so these are sort of the uh the what I would call lower fidelity approaches to web archiving and a high fidelity web archiving is attempting to archive exactly what you see and hear and do in the browser and essentially to capture the the interactive experience of websites while keeping them interactive.

A

And so since modern websites still are made up of kind of http network requests, that's basically what uh what we're attempting to capture and so, for example, if we look at a at a uh so and I'll.

A

So if we look, for example, at a site like twitter, uh that is uh highly dynamic, uh and you know if you look at the dev tools, for example, we'll see that when we load uh twitter, uh even though it's everything that's that's being loaded, is, is being served uh and you can actually see the the kind of the network requests coming in in devtools, and so uh the browser already has this. Obviously in order to to create the page, and so what?

A

If we're able to capture all this network traffic and simply recreate it later? And so that's basically the idea with uh with high fidelity web archiving, and for that we have a browser. Extension uh that's available on archive web page. So it's easy to remember and you can actually go to archive web dev page and from there.

A

If you're, using a chromium-based browser it'll take you to the chrome web store, you can also download it as a desktop app and run it that way, and so what this extension does is essentially archive the exact network traffic, that's being loaded in the browser and so before. I show that uh just very quickly I'll kind of cover, uh so the way that it works is that it archives the all the traffic via the uh chrome debug protocol, uh which is what devtools also uses and it stores the data in in the browser in indexeddb.

A

And then it can serialize that data into a format uh and into a file format, that's downloadable or that format can really be stored anywhere, including ipfs.

A

And what you could do after you've stored that data, of course, is then to replay them and uh replaying websites is actually even harder than actually capturing them. You have to rewrite the urls and you have to emulate uh really the the javascript environment uh and it's sort of basically uh it's basically a mini uh a mini wayback machine. That's running entirely in your browser, and so you won't have time to cover all of that. But uh that's sort of the the idea behind this uh and uh I'll go ahead and start a quick demo.

A

uh And so let's say I'm on a twitter page here- and I have this extension installed, and so I can go and uh I can create I'll just create a new demo here. So called demo2 and I'll click start, and you can see this uh size counter going up. So that's actually all of the all the network traffic. uh What we just saw in devtools being archived into the browser, and so as more things are loading on this uh on this twitter page.

A

uh You can see this size counter going up and uh when uh basically the the extension tells you sort of if if more requests are being loaded or if it's done, and so when it's green like this, that means that's, that's uh it's no longer loading anything additional, and so I can kind of scroll down and then it'll start loading additional things and the size counter will go up and uh since it's using the debug protocol, it tells me that uh archive web page is debugging the browser.

A

This is sort of a a chrome, the chromium-based uh security settings. So it's it's always there. um I could also click on another page. Let's say I can click on the uh web recorder home page then it'll also archive this page as well, um and so let's say I'm done archiving and I can click stop and then I can go to browse archive, and this shows me the two pages that I have just archived and.

A

I can then click on on each of these, and now I'm loading twitter.com entirely from from the network requests that are stored in the browser- and I can basically so again what's been. Archived is exactly what's loaded, so I can kind of scroll down as far as I did before. I won't. Let me.

A

It'll probably stop at some point, since I only went that far. I can also click on on this site that oh well, maybe that didn't work because there's a redirect, but I can also click on it like this and load the home page in this way, and so you also notice that when I look at this page, it's I'm logged in as myself.

A

uh So this is my view of of twitter.comrobacore.io just a web recorder, twitter page, but it's logged in as me, and so uh this is sort of a unique view of the web and and if someone else goes to this url they'll see something different because it won't be logged in as them or it won't be logged in as me, and so this extension really allows you to archive exactly what you see in the browser and sort of your your own unique view of the web, which is, of course for most social media sites or many sites is, is entirely different for for each individual and uh then so.

A

What we also have is uh is the sharing option, and this is where uh I can actually go and uh and click start sharing and and now um this archive has been written to ipfs uh and I'll cover. What that uh and I can actually, um why don't, I go ahead and uh paste the link here into the chat so that, because it might take take a little bit of time so I'll just go ahead and, and uh let's see here.

A

I'll just stop sharing for a second and paste this link and uh it might take a little bit for it to to load and, in the meantime, I'll go ahead and cover.

A

What is happening here, and so again the idea is that you can share your unique view of the web with others, and so.

A

So how is it how's this data actually serialized to ipfs uh and we have a format called waxy which stands for web archive collection, zipped and it's a zip-based format uh inside of it. It stores data, another format called work, which is a standardized format created by internet archive and it also stores the raw indices uh into the data.

A

So I basically have the http request sponsor traffic and an index to look that up and also a list of pages and there's a manifest containing all of the files and a key property of this format is that since it's a zip file, uh it can be random accessed.

A

And so uh the idea is that, even if you have a large archive, you don't have to load everything all at once and that's sort of a key requirement for this to to work, and so what's actually written to idpfs uh and what's written is actually four files. uh The index, uh the uh a service worker sw.js and a ui file ui.js, and then the actual archive in this waxy format is also stored in in that as part of the multihash.

A

And so uh what I just shared in that in the channel is basically a link to this multihash. That was just created directly in the browser, uh and uh the idea is that uh so I shared a link to to an http gateway. um That's that's one way to load the data, uh there's actually multiple ways to so.

A

The sharing options uh include.

A

uh This is basically the sharing menu in the extension and so it'll uh basically allows you to to get the the multihash that was just shared, get the shareable url through a replay webpage, which is another site that that's hosted. That will then use jsipfs to load that hash or you could just get an https gateway link, uh which is perhaps the most most compatible.

A

I would say, but not necessarily the fastest, uh and uh here uh top is the status and, as you probably noticed, I was using the brave browser and the reason for that is that brave has native support for go ipfs which is really great, and so that allows me to connect to the native go ipfs daemon, that's running that has been started by brave and currently uh the way that this works is that I already had to enable it in brave manually and then I actually check which port the the uh ipfs uh api server is running, and it's um on one of these predefined ports, depending whether it's uh a brave nightly or a brave uh production release.

A

um Eventually, I would like to have it so that this is possible to determine through an api and brave, does actually have this api chrome.ipfs uh for extensions, but currently it's only available to the ipfs companion, and so I'm I'm working with brave and they're trying to make that more more generically accessible to other extensions as well, and that will make this a little bit uh simpler.

A

um So that's sharing and brave that that works, probably the best.

A

You could also share in the electron app which I didn't show, but there is uh basically download archive as an app which has essentially the same ui, and this app uh starts a local desired pfs, daemon node, and so it connects to a local running instance, um and that also generally works pretty. Well, um then, of course, the more general use case is, if you're using this in chrome, uh there are still some some uh some issues to resolve.

A

Of course, since there are no direct pdp connections in in chrome, uh everything must be served over a websocket uh there's no webrtc either because I'm using a service worker, and so it connects to the way that the that basically, okay, the way that connecting to ipfs in in chrome works, is uh basically or in a browser that doesn't have native support is by connecting to a preload node, which then loads. The entire hash.

A

Essentially, as a kind of proxy- and that can be problematic, especially if your archive is very large, uh that will make it harder to uh so essentially have to sync everything that I just archived over uh over that websocket and also the format wax designed for random access, whereas preloading is not, and so those are some of the current limitations.

A

One one minute warning to wrap up, uh I think there's some of the current limitations of the system uh is that uh basically, the the preloading isn't yet working as reliably as I would like it to uh when loading over a gateway. Occasionally, there are timeouts since, uh since it's making a lot of uh small range requests uh and when using replay webpage, it also requires a preload server uh yeah and that's that's. Basically it and uh maybe I'll also share uh I'll quickly share another.

A

uh So here's another archive that was that was just made earlier uh and I'll put it in the chat.

A

And uh yeah, so that's that's, basically that the idea is that you can create archives directly in your browser uh and uh and then share them with others. uh If you're using brave it works really well, if you're using chrome, uh hopefully we'll have it working better in the future, and so you can share sort of your unique view of the.

A