Token Engineering Commons Labs Working Group, 2 Nov 2021

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: W13 Labs WG: Scraper bots using selenium and lxml

Description

Special TEC Lab episode where we explore the development of scraper bots using selenium and lxml. We make a system that scrapes data from medium accounts.

🙏 Thank you for watching! Hit 👍 and subscribe 🚩 to support this work

🌱Join the Community🌱
on Discord https://discord.gg/uM4ZWDjNfK
or say hello on Telegram https://t.me/tecommons
Join the conversation https://forum.tecommons.org/
Follow us on Twitter: https://twitter.com/tecmns
Learn more http://tecommons.org/

A

Hello, everyone.

B

A

Wait, I can also rename myself again: um how does it work.

A

You can be whoever you want.

A

B

I guess more user-friendly.

A

Wow, so many people today, this is awesome.

A

Hey nagin, hey there.

A

Anyone have any big weekend plans.

A

I just found out there's a new dune movie, but it's probably just still in the theaters. That's all I got yeah. Some people were getting really excited about that.

B

Well, the last one was so bad, but it's been nearly 40 years, so hopefully we're over it. This one is good. I haven't seen it, but but some friends didn't.

A

They said it was good, so oh I'm I. I am extremely hopeful and I believe you it. It can't be.

B

B

Yeah I tried to watch the original. I think it was on netflix and I just fell asleep like 30 minutes in.

B

Okay, awesome good good group today. uh This is nice to see all the diversity here, um I'm pretty much going to get started. Today's a pretty fun uh session, um we're gonna take a step aside from some of the rewards research series that we've been doing so just to get people caught up. um We've basically been doing two series: uh this fall.

B

We started with the proposal inverter, which is a mechanism for getting proposals funded by many dows, so that daos can kind of coordinate together to fund proposals for things that are going to create a common good for multiple dows, which is a really cool mechanism, and we did that for about five or six weeks and it's still being modeled and simulated in the prime dow weekly dow to dao model workshops which are on wednesdays, and you can find that in the prime dow server we then jumped over to opening up rewards research because it's kind of like the season, it's happening all over the place.

B

There's a development task force in the tec that is aiming to upgrade the praise system, uh make it more simple to facilitate and more equitable overall and a bit more. Automated and to sort of capture a lot of the nuances and subtle details of the work that's being done in this community, specifically, but also lending itself to be more generable general for other communities that choose to implement it as well.

B

So we've been doing a rewards research series here to open that up and we two weeks ago we did a review of all of the analysis that was done on the praise system over the summer and last week we took a look at alexandra, the bot that johann has developed that we're using in longtail to track discord minutes essentially, but we're going to take a step aside from all of that and just do something fun today and we're going to take a look at the concept of web scrapers.

B

Often as a token engineer, you want to collect some data sources and you might want to create a system that can pull information from the web and put it into a spreadsheet that can then be analyzed, so we're going to build something today. I think we can build it in an hour. I hope, and uh it's it's pretty cool um the inspiration for this is over the weekend.

B

I was really looking deep into klima dao and, if you're not familiar clean, medaille just launched their token- and this is- I was just putting together a sort of research document I'll go ahead and share this in the labs channel.

B

Oh, can you guys see my screen yeah.

A

B

So climadow's purpose is to accelerate the price of appreciation of carbon assets in order to force quicker adaption to the realities of climate change and drive additional finance toward low carbon technologies. As further voluntary commitments and compliance regimes come online, the world's companies will be forced to compensate I.e, offset their carbon emissions.

B

The costlier, the negative externalities of their damage becomes the more economic, the decision to reduce emissions and invest in green alternatives. As described on our website. We are building a community, that's resolute on solving climate change by creating a black hole for carbon to accelerate value pressure past the event horizon of traditional markets, while creating synergies between defined carbon markets. So clean medaille, as it turns out, is actually a fork of olympus olympus, dow, which is a really interesting protocol that uses staking and bonding mechanisms to accumulate assets inside of its treasury.

B

So basically, whenever the price of of ohm, this is from olympus dow. Whenever the price of ohm goes up, the protocol mints more tokens and sells those tokens on the market in order to accumulate other tokens to lock them in the treasury. So those tokens might be, you know wrapped eth or usdc things like that.

B

So whenever the price of the token is appreciating, it actually sort of brings the price back down by minting, more tokens, selling them and accumulating other assets, and so the treasury is perpetually growing. So clean medao, forked that model and said we're not just going to lock any assets in our treasury. We're going to lock, specifically carbon credits, which is possible from another protocol called toucan, which is uh tokenizing tons of carbon uh carbon credits, which is really cool, so anyways clementa is fascinating. They launched over the weekend, and this this is really revolutionary.

B

If we look at klima on, uh what's it going for on coin gecko- oh my goodness, and it's up 78 percent overnight um this. This is crazy, because one clima token uh represents one base: ton of carbon. So it's one ton of carbon.

A

Yep, this one's a climber, yeah.

B

A

Search for climate.

B

A

B

A

That's two different tokens.

B

A

A

Yes, I think a climber is on mainnet and you have to bridge it to polygon, and then you can wrap it to the to the real climber. So maybe there's some.

B

But I think it's the same thing. I think there are one yes, uh this is called alpha clima, so it's before their full protocol is released. This is just a simple erc20 placeholder, basically for clean the tokens and you're right they're going to be on polygon. I believe.

A

So normally price is 2 500, I just oh. How did you find that on uh jack's guru right now, but I guess it's on coin gecko too.

A

Oh clean it up.

B

Mark cuban was on a podcast recently like this week talking about climate. Maybe that has something to do with the price going up. So.

A

B

I think this is an arbitrage opportunity, because I think these tokens should be the same price.

B

But yeah mark cuban's into this. That's that's interesting, but this is really. What you got to understand is that the point of clima is to raise the price of a ton of carbon, and you can think of one climatokin as representing one carbon ton, uh one ton of carbon right now. If klima's trading at 2500 usd that means it's pulling. That means you can buy a carbon credit from a verifier in the world.

B

There's a few really reputable ones like vera or south pole, and you can tokenize it through the toucan protocol and then you can get it into climadow in in exchange. For one climate token, and before before klima launched the price of carbon. You know it varies all over the world, but on average uh one ton of carbon was trading for about 12 usd.

B

So basically overnight, this dow pulled the price of captured carbon from 12 usd to 2500 usd, and I would say that accomplishes their mission now. I would guess that this price has got to drift down, because that is such a differential from twelve dollars to twenty five hundred dollars, but imagine if it stays constant.

B

Every person in the world who's running a regenerative operation. Maybe it's regenerative agriculture or just straight carbon capture, whatever it may be. They're now able to redeem 2500 usd for every ton of carbon they uh capture and to give you some context. If you grow hemp, for example, one acre of hemp will sequester about 15 tons of carbon per crop cycle, and you can do two or three crop cycles in a year if you have optimal conditions.

B

So if you had 50, what's 15 times, 2.5, it's like 40, 40, 000, so you're. So a farmer is getting an extra 40 000 usd per acre of crop per crop rotation. It's just really interesting how these cyber physical systems? um You know they it's! This is created out of financial engineering, but I do believe it's going to have a very, very significant impact on the shift of economies towards sort of carbon capture and regenerative infrastructure.

B

It's really interesting.

B

But that's a bit of a tangent, so how I did my research on klima was. I read through all of their medium articles, so.

A

Let's see is that.

B

And how I started is I just made a spreadsheet.

B

So here's klima, uh they write, really good articles. I don't know who their writer is, but they're really really good and the how the articles are all they. They kind of tell a one really good story arc, uh there's 15 articles in total, so I copied them all into this spreadsheet. So I have a link to each article. I have the data is published, the recommended sort of reading time and the number of claps it got so that I can get kind of a single overview into the protocol and the whole narrative that they're telling.

B

um So they have three a three-part series on introducing klima, um then a little bit more on carbon and how they're forking olympus dow about their. They did an initial discord offering, which is pretty cool. They did a token sale initially to discord members pretty interesting model. They did some nft art more about carbon markets, then their fair launch as a liquidity, bootstrapping pool how to participate, how to participate, financing forest protection. That's super interesting and then incentive alignment, carbon sourcing and their launch.

B

So it tells a really good story, but I was thinking this is a pretty good process to try to understand a protocol is to go through their all of their medium articles and even just seeing all the headlines in one place and also kind of the number of claps associated.

B

You could imagine, you know, plotting this on a chart to kind of see the traction of the protocol over time and how it might gain and lose momentum. So I thought okay. This is.

B

I want to be able to do this for more for more medium sources, because it's kind of a good way to do research, but this this took me about 30 minutes, probably just to do this by hand I copy and pasted all the titles uh and the data and made sure that the links were working and everything so it'd be nice to have an automated system that that could take care of that.

B

So that's what we're gonna build today and I'm pretty much ready to jump into that. I I hope we can do it in 45 minutes. uh Does anyone have any questions or comments before before I get started on that it'll be pretty much hacking for about 40 minutes.

A

B

Let's get started on that, so the first thing I'm going to do is, I guess, start a new repository.

A

um I can probably close all this down I'll just keep this open as a reference.

A

And I'll go on the commons build on github.

A

And make a new repository called medium scraper.

A

A

Simple scraper bought for researching on medium make it public add a readme, get ignore python.

A

And license uh I'm not going to add a license for now.

B

Oh, we should have a default license on the tec. That would be nice.

A

B

The mit license the default license for most of these projects.

A

Yeah that sounds about right, so I can we can cop. I can copy one in at the end of the lab.

B

Okay, now I'm gonna open up a bit of a template. I have some old uh code that I had worked on a couple years ago. um That is a bunch of.

A

B

For looking at real estate data uh yep- and I have it here- I called it van land db.

A

And I'm just gonna open one of these up to reference as a template.

A

B

So I'm going to be using a library called lxml, which is an xml parser. Xml is a kind of a programming language for formatting information and transferring it and actually html is a type of xml language. It's actually a subset.

B

So I'm going to use that and I'm going to use the selenium web driver. So let's go.

A

Ahead and grab these.

B

In our new repository here create a notebook.

A

And we'll call this a medium scraper, okay and let's see if we can import.

B

These things we probably have to install them, oh seems, like I already have lxml and selenium installed, that's kind of cool. um Now I made a sort of function here, uh but how do we get started? Okay, so, let's name our url.

B

So selenium, like I said, is a web browser, so we're gonna open up um selenium and it's actually going to open up a web browser and we're going to be able to navigate to this website.

B

So, let's name our url and see. If see. If this works now you have to have an a driver, a certain kind of driver installed. uh Let's see if I have it so browser so from selenium, we're importing web driver and we're I'm going to try to open up firefox.

B

And it's that simple okay, so if it doesn't work off the bat for you, it's probably going to give some error. There's a thing called a gecko. It's called gecko is the firefox driver.

B

You'll just get an error that says: gecko not installed, and so you just have to google like how to install geckodriver.

B

And it'll I there's usually a stack overflow post or something, and it's kind of as simple as like downloading a file and just putting it on your system somewhere pretty simple, but it looks, looks like I have it. So this is the selenium driver and it's basically firefox.

B

You can go to youtube. You know just loads like any any normal browser.

B

So let's minimize that for now- and we can say browser.get and we'll pass it the url, and so it should navigate okay, so it navigates to that webpage. That's cool!

B

So now we're going to want to grab the html from it. So it looks like this is the code here.

A

That I want to do so: let's try that.

B

Html equals browser.page source and we get all the html. So if we open this up- and I think if you hit ctrl shift- I uh in in pretty much any browser- it opens up your developer tools, and so you could go right. Here is all the you all the html.

B

um So that's pretty neat just close that for now. Oh actually, that we're going to need that. So now, what do we want to do? We have all the html, uh so we could probably pull the.

B

There's a lot there's a lot here. I hope we're able to find what we're looking for, we probably probably will be able to. I hope it's not all encoded or something.

B

So, let's, let's take a look at how we want to get our data again and remind ourselves of that. So we had our climadow research.

A

A

A

B

So we want to get data like this, so let's start with titles, let's see if we can get a list of all of the titles of all of the articles, so how we do that, is we open up the browser and we will use the developer tools and there is a really neat tool in the top left called pick an element, and you click that and then you go and highlight over something like this, and it shows you where it is in the html.

B

And then we can see what it is here. So it's class ehbw. Okay, has these properties rel, I'm not even sure what that's short for, but it's called no opener, it's an href.

B

So in html href is a link and it gives a relative location of the link and then it has in text the actual title which is climadou launch and it says, reflections uh on our manifesto yep, so we're gonna see if we can based on the properties of this link, if we are able to basically grab all the all the titles that we want to grab.

B

So the I mean the first thing we can do is we can get all hrefs, which would give us all links, but there's probably like 100 links on this page.

B

We can also try to get everything with this class which might actually work. So, let's try that.

B

B

B

So, let's get everything with class e h b, w.

A

So, let's take a look at the template.

B

So we're going to create it with lxml. We want to create, what's called a tree.

A

And then we can.

B

Use this technology called xpath.

B

And this is going to search through the tree to find what we need so see we're giving this specifier for class, so we're going to use, give the classes ehbw- and this is the type of the element- and I think the type of the element we're looking for is just actually a what's called an anchor in html.

B

uh So, let's just see if this gives us the titles that would be really nice cool, I think.

A

Oh, we have one two, three, four, five, six, seven, eight.

B

Nine ten and let's see how many titles we have on our page.

A

We have one two, three, four, five, six, seven, eight nine.

B

Ten, maybe so it gave.

A

Us elements: let's see if we can unpack those.

B

Those are like html elements.

B

So, let's just grab one of them and see what we can do so, let's e for element equals titles. um Just take the first one.

A

And see what functions we have um text cool.

B

So we can get the text of the element, so we could say e dot text for e in titles ease for element sweet.

B

So we now have all the titles of the articles that have loaded now. It doesn't give us all of the articles, because there's this button here called show more, which we would have to click so there's we're gonna have to make a little bit of logic where we check. If there's a show more button, and if there is we load but looks like we know how to get all of the titles um that are loaded and let's see if we can also get the link that corresponds to them.

A

A

What would it be href or something.

A

And see what values does values.

B

Okay, so these are the values of its properties. So it's remember its class. Was this ehbw opener.

A

And then here is technically the link. I believe.

A

B

This values- if these are all standard, then- and we call on their values- then we're going to get three elements.

B

So this is this looks like a relative link. So if we actually remember, we have our original url here, which is climatedow.medium.com.

B

So if we go url plus this.

A

Then that should be the full link.

B

To the article.

B

But it has this weird um user profile.

A

B

A

We can remove.

B

So, let's just see if we remove everything after the question mark, um so this is a string so strings in python. I believe, have a dot split.

A

Is that right, let's just uh see if we say a.

B

Equals uh abc and we.

A

Go a dot split on b.

B

Then it gives us the two values, so we should be able to split this.

A

On the question mark here,.

B

um And take the first value, then we get the link without the all that other weird stuff on it.

A

Oops, that's a 404! So that's not right!.

A

Well, how do we, let's, let's open up our browser and see what this link actually is,.

A

I think there was a extra forward, slash in that.

A

B

A

Here right, yeah good point, so let's wait.

A

So that's because our url has a forward slash, and this has a forward slash.

B

Okay, so this is getting pretty hacky, but sometimes that's just the way it goes.

B

So this means um take all the characters except the last one.

B

And so, if we put that here.

A

Now we just have a single forward, slash, let's see if that does better.

B

A

Was that vive iv or yeah yeah awesome good.

B

Eyes with the double forward, slash, okay, so this is how we can get a link. I know it looks kind of ugly at this point, but it works um at least for a single case. Let's hope it generalizes for all of them. So we know how to get all the titles, so we could also do um let's see if we can just do a list comprehension.

B

uh So we say e is the element for e in.

A

Titles then we should have all the links.

A

So, let's see if this.

B

Carbon markets, retail, offset pricing, yep, okay cool, so we can get all the titles and we can get all the links.

B

Okay, so now it's usually you want to take it once you do some hacky stuff, like that, it's good to just take a step back and clean up a little bit, so that was the class we needed. We got that.

A

B

We get the html, we make it a tree, so here we're getting all the titles. So, let's just kind of annotate this.

A

B

Of the titles and their links, so I just made this a markdown cell just for a little bit of notation. um I can also merge these cells, because this is all just getting the html. This is just um initialize browser and get the html.

A

Of the page, and then we get so let's um it's usually.

B

Good to kind of run things, let's restart, so we basically nuke our whole workspace and just double check if everything is working as it should. So, this is going to open up.

A

The browser and load the web page.

A

B

Then we grab all the titles based on their class and we're able to get all of their names and all of their urls cool.

A

B

Next, so we have names and urls for each article.

A

Let's get the date, let's get these stats here.

A

So four days ago, okay, so let's get the date open up our developer tools pick an element pick the date here: okay,.

B

So, let's try this class b, a b, b, b, b, c b y okay: let's see not very descriptive class names, but hopefully it works.

B

So we'll do the same.

A

Thing we did before we're going to use this.

B

Tree.Xpath- and we already have the tree loaded from the html.

A

A

B

Is a type p p for paragraph, that's html, so.

A

We'll just change this to p and we'll put this in for the class.

A

And call this dates.

A

Cool and there should be 10 right.

A

A

One two three, four: five, six, seven, eight nine.

A

So we might have a bit of extra info here. Let's see what we've.

B

Got so these are probably all elements like we saw before and so oh, we should be able to get text. So if we go uh e dot text for e, in dates.

A

A

So it looks like we have an extra one at the beginning and at the end.

B

So again we can do some hacky stuff, we say, drop the first one and drop the last one, and then we should have 10 elements here, so that should match with all of our titles.

B

Okay, so those are the dates. Let's take the number of claps.

A

And what was the.

A

Hey the reading length and the number of claps.

A

Okay, let's get the reading length.

A

So that's just a span: let's see what else do we have here? Okay, this: let's try this class b, a b b b, c b d, whatever that means.

A

Oops, this is um reading time.

A

A

Okay, we get read more, not exactly what we're looking for.

A

Oh because it's I think it's in this span, so let's take one of these elements.

A

And try this get children.

A

What does that do span? Okay, so we get. We get the span element, which is what I was going for.

A

Okay, what is this.

A

Comment: okay: we don't want the comment. Oh maybe.

A

ah That's what we want: okay.

B

So um you you'll notice this we're working on. What's called a tree, this tree thing that we've created out of the html text. It's uh if you've worked with javascript, then you're, probably familiar with the dom.

A

What's it, what what is that? What is that? What's the dom.

A

B

Document object model, so this is actually how, like all web pages, are constructed. Every web page is actually a tree, a tree data structure. So there's some root node, like the very top of the tree that represents the entire web page and then you'll. Have your initial sort of containers inside of that, and that'll usually be like your background.

B

For example, it will be near the top of that tree and um you know, and then you you have containers inside of containers inside of containers inside of containers inside of containers, and so you get this sort of tree structure, uh which is how every web page is rendered, and so that's what we're doing here- we're sort of navigating the tree. So so um I had gotten this html object here, which is the p the paragraph and nested inside of that is the span object.

B

So that means that this span object is a chill as a child of this paragraph object, and the paragraph is the parent of this span object because they are contained in the dom in the tree structure of the dom.

B

So I was able to I couldn't grab this span directly because it doesn't have any properties, it doesn't have any class or anything there's no way.

B

I could specifically tell the xpath on how to find these elements because they don't have any attributes, so I had to find their parent attributes based on this specific class that we get, which gives us this sort of green and when you see class here, that's referencing, some sort of css class, and so that's how this text gets styled like this with green and the size and font and everything, and so I can grab this whole object here by its class and then take its children and actually its first child was a.

B

It was a comment for some reason. There's a random html comment here, but then the second child is the span object, and so I got the span object and then I took its text to get the reading time. So I should be able to apply this to all the elements that I've got um to get all the reading times.

A

And what are all titles or no.

A

A

This okay and then it's like reading time, okay, sweet one, two, three, four: five, six, seven, eight nine ten looks good.

B

So we have all the reading times, and the very last thing to get is the of claps. Let's get that.

A

Okay, it's got a class, so let's see if we can copy that.

A

A

Class call this claps.

A

Oh, it's empty um because it's not a paragraph. It's a button.

B

Sweet, so it looks like a lot more than 10.

A

Let's see if what happens when we check the text, so e dot text for e in claps.

B

Okay, so it's just like alternating.

B

um Anyone know in python how to get all the alternating element, elements of a list.

A

Filter by range with mod to just get the even ones. Oh nice.

B

I think we want the odd ones in this case. I.

A

Think, um yes,.

B

A

B

Could you might be picturing some code in your head, but I'll see what I, if I, if we go like this.

A

A

A

If not cool octopus for the win, you're a killer.

B

That's awesome, okay and then we could also just I think, there's.

A

A function called strip if we go a equals, a a dot strip cool. So let's do that e dot strip.

A

A

Oh because e is an element okay, so we want here dot strip.

B

Cool okay: I think we have everything.

A

Do you want to wrap that as an int, potentially yeah well,.

B

It's okay, because we're gonna put everything together as a data frame.

A

B

And it's okay: if it's a string and then if we wanted to, we could just cast the whole column.

B

But yeah good point: okay, so I'm just gonna rerun.

A

This whole thing and hope everything runs.

B

Sometimes, when you rerun everything you'll realize you left in a little bug or something.

A

Titles is not defined.

A

hmm Did I remove the whole titles thing dang?

A

So it was something like this.

A

A

A

A

B

Okay, okay, now we gotta we're just outputting these; let's save them. So let's say: let's overwrite our initial.

A

Element lists that we find so oops.

A

ah No can't do that um I'll call.

B

These columns, I'm gonna, call these things. Columns.

B

A

Title column, I'll call this url column.

A

A

A

Reading time, column and claps column.

A

A

B

We go it's nice to just have some simple standard aspects to like your variables. Sometimes it helps the eye. So now we've got four groups like four cells code cells, each one ends with having a column variable, so that can just kind of help. The reader understand where we're going with this.

A

I'll just say: get all of the titles uh urls uh dates reading times and clap numbers.

A

A

Put it all together as a data frame, so we'll just import pandas.

A

And pandas.dataframe.

B

We'll just see how this I think the easiest way is a dictionary.

A

Let's just see what this looks like um title map to title column, yeah, okay, I think that's what we want.

A

So date, maps to date, column, uh reading time, maps to reading time, column and claps maps to claps column.

A

A

Call that a data frame.

B

And, uh oh, we don't have the urls.

A

So it'd be cool if we could have a hyperlink.

B

I wonder if we can have a hyperlink column in a data frame.

A

Wow panda's style format.

B

Maybe it's not necessary.

A

It seems a little bit complicated.

B

It's kind of cool that it looks like pandas has a uh if you use the format. Okay, so so actually pandas can format um html.

B

Maybe let's try this real, quick.

A

Yeah: okay: let's try this uh something like this.

A

So I wonder if I can make a hyperlink column.

B

So in seven minutes I'll try to make a hyperlink column and then output it. So, first of all, we need the url here.

B

A

Is the url column.

A

B

Okay, so we have our url column now this is we could just output this as a csv and call it a day. I'm just going to see if I can do something kind of funky at the end here with a few extra minutes and merge these two columns into a single hyperlink column. So.

A

Trying to remember how.

B

To like make a function on two columns.

A

um We can use up so what happens if we go df.apply.

B

And then we pass a function.

A

Like, let's make def uh call, this function uh takes a column, wait.

A

Oh yeah, what if we go, df uh hyperlink.

A

A

A

A

A

Where was that here.

A

um I think I have to go apply.

A

A

Okay, I don't think I'm going to get this we'll.

B

Leave this as a bonus for the for.

A

The keen reader, um the.

B

La keen lab student, so homework challenge.

B

For anyone who wants to attempt it is the idea of combining a title column and a url column into a single sort of like this right, so these are actually links. These are hyperlinks now, alternatively, what you could do is you can just you could take this data and we can export it into our spreadsheets.

B

So what we're going to do is df.2 csv and we'll call this scraped data.

A

Or we can call it, you know, what's our url climate.

A

Dot medium.com: let's just call it climadow data.

A

A

So now we have our data.

A

And we could oops, I didn't want the hyperlink column.

B

So we can then just copy that into like a google sheet.

B

um And then, probably in google spreadsheets there's a way of, I don't know mapping all of these links into these titles, but that's just being finicky.

B

So uh that's that's what it's that's the process of uh putting together web scrapers. It's really fun, as you can see, it's quite simple. In the end, um we simply you know import our packages, selenium, lxml and pandas. uh We grab the url, we open up the browser and we plug in the url and we make a tree structure out of it. So this is what we call boilerplate. This is just like the standard code. You know you'll just have this every time and then what you do. Is you open up?

B

And you use this uh inspector element to pick what you want to grab and you find any unique identifiers like the class or the html type, that it is and you use this xpath function, and this might look a little scary. But when you have a template to go off of it's really easy and you can always just look up. um You know python lxml, xpath and then you'll see all sorts of stack, overflow posts and tutorials and that kind of stuff.

B

So you can figure out what what you need to pass into xpath to get your list of elements and then, once you have your list of elements you just you know either directly extract the text or in this case sometimes you have to kind of index into a list to grab what you really want.

B

In this case, we had to actually get children and then index into that list and then grab the text. But it's usually something like that. And then you clean up your data and you can pipe it all into a data frame to have a nice clean output, and then this can be automated right. So now the idea is well: let's just try this: uh let's go.

B

Let's try olympus dao.

B

Right so we could take this and it'll. Probably break there'll, probably be some slight difference here, but you can, as you apply it to more and more things you might have to tweak some aspects manually, but eventually you can come up with a general system that will just scrape any any web page as long as it's a sort of a standard format.

B

So, let's see if this breaks or if this outputs something okay, so it didn't- it didn't get anything here. uh You know so somewhere along the chain, there's going to be some slight difference from climadow to olympus dow on medium and we could go through and figure out what that is, but maybe that's a future episode to turn this into a generalized, uh medium, scraper bot, but I'll push this up. We've got a repo here and I'll go ahead and push that now and we're at the top of the hour.

B

So I just want to thank everyone for joining. This was kind of long and heady, but uh fun stuff too, like in 45 minutes, we were able to scrape all the medium articles. So it's pretty powerful technology.

A

Thank you, sean. That's amazing.

B

Cool. Thank you very much.

A

Thanks guys, that was awesome have a good weekend. That was amazing.

B

Soaking up every second.

A

B

Very informative.

A

Thanks david thanks, john I'll, see you later yeah thanks, guys.

A

B

Thanks bye-bye.

A

B

Okay, so that that's uh the code is all pushed so feel free to go check it out guys.

A

I'll see you all next time.