►
From YouTube: W13 Labs WG: Scraper bots using selenium and lxml
Description
Special TEC Lab episode where we explore the development of scraper bots using selenium and lxml. We make a system that scrapes data from medium accounts.
🙏 Thank you for watching! Hit 👍 and subscribe 🚩 to support this work
🌱Join the Community🌱
on Discord https://discord.gg/uM4ZWDjNfK
or say hello on Telegram https://t.me/tecommons
Join the conversation https://forum.tecommons.org/
Follow us on Twitter: https://twitter.com/tecmns
Learn more http://tecommons.org/
B
A
Wait,
I
can
also
rename
myself
again:
how
does
it
work.
A
B
B
B
B
Okay,
awesome
good
good
group
today.
This
is
nice
to
see
all
the
diversity
here,
I'm
pretty
much
going
to
get
started.
Today's
a
pretty
fun
session,
we're
gonna
take
a
step
aside
from
some
of
the
rewards
research
series
that
we've
been
doing
so
just
to
get
people
caught
up.
We've
basically
been
doing
two
series:
this
fall.
B
There's
a
development
task
force
in
the
tec
that
is
aiming
to
upgrade
the
praise
system,
make
it
more
simple
to
facilitate
and
more
equitable
overall
and
a
bit
more.
Automated
and
to
sort
of
capture
a
lot
of
the
nuances
and
subtle
details
of
the
work
that's
being
done
in
this
community,
specifically,
but
also
lending
itself
to
be
more
generable
general
for
other
communities
that
choose
to
implement
it
as
well.
B
Often
as
a
token
engineer,
you
want
to
collect
some
data
sources
and
you
might
want
to
create
a
system
that
can
pull
information
from
the
web
and
put
it
into
a
spreadsheet
that
can
then
be
analyzed,
so
we're
going
to
build
something
today.
I
think
we
can
build
it
in
an
hour.
I
hope,
and
it's
it's
pretty
cool
the
inspiration
for
this
is
over
the
weekend.
A
B
So
climadow's
purpose
is
to
accelerate
the
price
of
appreciation
of
carbon
assets
in
order
to
force
quicker
adaption
to
the
realities
of
climate
change
and
drive
additional
finance
toward
low
carbon
technologies.
As
further
voluntary
commitments
and
compliance
regimes
come
online,
the
world's
companies
will
be
forced
to
compensate
I.e,
offset
their
carbon
emissions.
B
The
costlier,
the
negative
externalities
of
their
damage
becomes
the
more
economic,
the
decision
to
reduce
emissions
and
invest
in
green
alternatives.
As
described
on
our
website.
We
are
building
a
community,
that's
resolute
on
solving
climate
change
by
creating
a
black
hole
for
carbon
to
accelerate
value
pressure
past
the
event
horizon
of
traditional
markets,
while
creating
synergies
between
defined
carbon
markets.
So
clean
medaille,
as
it
turns
out,
is
actually
a
fork
of
olympus
olympus,
dow,
which
is
a
really
interesting
protocol
that
uses
staking
and
bonding
mechanisms
to
accumulate
assets
inside
of
its
treasury.
B
So
basically,
whenever
the
price
of
of
ohm,
this
is
from
olympus
dow.
Whenever
the
price
of
ohm
goes
up,
the
protocol
mints
more
tokens
and
sells
those
tokens
on
the
market
in
order
to
accumulate
other
tokens
to
lock
them
in
the
treasury.
So
those
tokens
might
be,
you
know
wrapped
eth
or
usdc
things
like
that.
B
So
whenever
the
price
of
the
token
is
appreciating,
it
actually
sort
of
brings
the
price
back
down
by
minting,
more
tokens,
selling
them
and
accumulating
other
assets,
and
so
the
treasury
is
perpetually
growing.
So
clean
medao,
forked
that
model
and
said
we're
not
just
going
to
lock
any
assets
in
our
treasury.
We're
going
to
lock,
specifically
carbon
credits,
which
is
possible
from
another
protocol
called
toucan,
which
is
tokenizing
tons
of
carbon
carbon
credits,
which
is
really
cool,
so
anyways
clementa
is
fascinating.
They
launched
over
the
weekend,
and
this
this
is
really
revolutionary.
B
If
we
look
at
klima
on,
what's
it
going
for
on
coin
gecko-
oh
my
goodness,
and
it's
up
78
percent
overnight
this.
This
is
crazy,
because
one
clima
token
represents
one
base:
ton
of
carbon.
So
it's
one
ton
of
carbon.
B
A
B
A
A
B
But
I
think
it's
the
same
thing.
I
think
there
are
one
yes,
this
is
called
alpha
clima,
so
it's
before
their
full
protocol
is
released.
This
is
just
a
simple
erc20
placeholder,
basically
for
clean
the
tokens
and
you're
right
they're
going
to
be
on
polygon.
I
believe.
A
So
normally
price
is
2
500,
I
just
oh.
How
did
you
find
that
on
jack's
guru
right
now,
but
I
guess
it's
on
coin
gecko
too.
B
B
But
yeah
mark
cuban's
into
this.
That's
that's
interesting,
but
this
is
really.
What
you
got
to
understand
is
that
the
point
of
clima
is
to
raise
the
price
of
a
ton
of
carbon,
and
you
can
think
of
one
climatokin
as
representing
one
carbon
ton,
one
ton
of
carbon
right
now.
If
klima's
trading
at
2500
usd
that
means
it's
pulling.
That
means
you
can
buy
a
carbon
credit
from
a
verifier
in
the
world.
B
There's
a
few
really
reputable
ones
like
vera
or
south
pole,
and
you
can
tokenize
it
through
the
toucan
protocol
and
then
you
can
get
it
into
climadow
in
in
exchange.
For
one
climate
token,
and
before
before
klima
launched
the
price
of
carbon.
You
know
it
varies
all
over
the
world,
but
on
average
one
ton
of
carbon
was
trading
for
about
12
usd.
B
So
basically
overnight,
this
dow
pulled
the
price
of
captured
carbon
from
12
usd
to
2500
usd,
and
I
would
say
that
accomplishes
their
mission
now.
I
would
guess
that
this
price
has
got
to
drift
down,
because
that
is
such
a
differential
from
twelve
dollars
to
twenty
five
hundred
dollars,
but
imagine
if
it
stays
constant.
B
Every
person
in
the
world
who's
running
a
regenerative
operation.
Maybe
it's
regenerative
agriculture
or
just
straight
carbon
capture,
whatever
it
may
be.
They're
now
able
to
redeem
2500
usd
for
every
ton
of
carbon
they
capture
and
to
give
you
some
context.
If
you
grow
hemp,
for
example,
one
acre
of
hemp
will
sequester
about
15
tons
of
carbon
per
crop
cycle,
and
you
can
do
two
or
three
crop
cycles
in
a
year
if
you
have
optimal
conditions.
B
So
if
you
had
50,
what's
15
times,
2.5,
it's
like
40,
40,
000,
so
you're.
So
a
farmer
is
getting
an
extra
40
000
usd
per
acre
of
crop
per
crop
rotation.
It's
just
really
interesting
how
these
cyber
physical
systems?
You
know
they
it's!
This
is
created
out
of
financial
engineering,
but
I
do
believe
it's
going
to
have
a
very,
very
significant
impact
on
the
shift
of
economies
towards
sort
of
carbon
capture
and
regenerative
infrastructure.
B
B
So
here's
klima,
they
write,
really
good
articles.
I
don't
know
who
their
writer
is,
but
they're
really
really
good
and
the
how
the
articles
are
all
they.
They
kind
of
tell
a
one
really
good
story
arc,
there's
15
articles
in
total,
so
I
copied
them
all
into
this
spreadsheet.
So
I
have
a
link
to
each
article.
I
have
the
data
is
published,
the
recommended
sort
of
reading
time
and
the
number
of
claps
it
got
so
that
I
can
get
kind
of
a
single
overview
into
the
protocol
and
the
whole
narrative
that
they're
telling.
B
So
they
have
three
a
three-part
series
on
introducing
klima,
then
a
little
bit
more
on
carbon
and
how
they're
forking
olympus
dow
about
their.
They
did
an
initial
discord
offering,
which
is
pretty
cool.
They
did
a
token
sale
initially
to
discord
members
pretty
interesting
model.
They
did
some
nft
art
more
about
carbon
markets,
then
their
fair
launch
as
a
liquidity,
bootstrapping
pool
how
to
participate,
how
to
participate,
financing
forest
protection.
That's
super
interesting
and
then
incentive
alignment,
carbon
sourcing
and
their
launch.
B
B
I
want
to
be
able
to
do
this
for
more
for
more
medium
sources,
because
it's
kind
of
a
good
way
to
do
research,
but
this
this
took
me
about
30
minutes,
probably
just
to
do
this
by
hand
I
copy
and
pasted
all
the
titles
and
the
data
and
made
sure
that
the
links
were
working
and
everything
so
it'd
be
nice
to
have
an
automated
system
that
that
could
take
care
of
that.
B
So
that's
what
we're
gonna
build
today
and
I'm
pretty
much
ready
to
jump
into
that.
I
I
hope
we
can
do
it
in
45
minutes.
Does
anyone
have
any
questions
or
comments
before
before
I
get
started
on
that
it'll
be
pretty
much
hacking
for
about
40
minutes.
A
B
A
A
And
license
I'm
not
going
to
add
a
license
for
now.
A
B
Okay,
now
I'm
gonna
open
up
a
bit
of
a
template.
I
have
some
old
code
that
I
had
worked
on
a
couple
years
ago.
That
is
a
bunch
of.
B
For
looking
at
real
estate
data
yep-
and
I
have
it
here-
I
called
it
van
land
db.
A
B
B
These
things
we
probably
have
to
install
them,
oh
seems,
like
I
already
have
lxml
and
selenium
installed,
that's
kind
of
cool.
Now
I
made
a
sort
of
function
here,
but
how
do
we
get
started?
Okay,
so,
let's
name
our
url.
B
So
selenium,
like
I
said,
is
a
web
browser,
so
we're
gonna
open
up
selenium
and
it's
actually
going
to
open
up
a
web
browser
and
we're
going
to
be
able
to
navigate
to
this
website.
B
So,
let's
name
our
url
and
see.
If
see.
If
this
works
now
you
have
to
have
an
a
driver,
a
certain
kind
of
driver
installed.
Let's
see
if
I
have
it
so
browser
so
from
selenium,
we're
importing
web
driver
and
we're
I'm
going
to
try
to
open
up
firefox.
B
B
B
B
Html
equals
browser.page
source
and
we
get
all
the
html.
So
if
we
open
this
up-
and
I
think
if
you
hit
ctrl
shift-
I
in
in
pretty
much
any
browser-
it
opens
up
your
developer
tools,
and
so
you
could
go
right.
Here
is
all
the
you
all
the
html.
B
So
that's
pretty
neat
just
close
that
for
now.
Oh
actually,
that
we're
going
to
need
that.
So
now,
what
do
we
want
to
do?
We
have
all
the
html,
so
we
could
probably
pull
the.
B
B
A
A
B
B
So
in
html
href
is
a
link
and
it
gives
a
relative
location
of
the
link
and
then
it
has
in
text
the
actual
title
which
is
climadou
launch
and
it
says,
reflections
on
our
manifesto
yep,
so
we're
gonna
see
if
we
can
based
on
the
properties
of
this
link,
if
we
are
able
to
basically
grab
all
the
all
the
titles
that
we
want
to
grab.
B
B
So,
let's
just
grab
one
of
them
and
see
what
we
can
do
so,
let's
e
for
element
equals
titles.
Just
take
the
first
one.
A
And
see
what
functions
we
have
text
cool.
B
So
we
now
have
all
the
titles
of
the
articles
that
have
loaded
now.
It
doesn't
give
us
all
of
the
articles,
because
there's
this
button
here
called
show
more,
which
we
would
have
to
click
so
there's
we're
gonna
have
to
make
a
little
bit
of
logic
where
we
check.
If
there's
a
show
more
button,
and
if
there
is
we
load
but
looks
like
we
know
how
to
get
all
of
the
titles
that
are
loaded
and
let's
see
if
we
can
also
get
the
link
that
corresponds
to
them.
A
B
A
B
B
But
it
has
this
weird
user
profile.
B
B
So,
let's
just
see
if
we
remove
everything
after
the
question
mark,
so
this
is
a
string
so
strings
in
python.
I
believe,
have
a
dot
split.
A
Is
that
right,
let's
just
see
if
we
say
a.
B
Equals
abc
and
we.
A
B
B
So
this
means
take
all
the
characters
except
the
last
one.
B
B
Eyes
with
the
double
forward,
slash,
okay,
so
this
is
how
we
can
get
a
link.
I
know
it
looks
kind
of
ugly
at
this
point,
but
it
works
at
least
for
a
single
case.
Let's
hope
it
generalizes
for
all
of
them.
So
we
know
how
to
get
all
the
titles,
so
we
could
also
do
let's
see
if
we
can
just
do
a
list
comprehension.
B
A
B
B
Of
the
titles
and
their
links,
so
I
just
made
this
a
markdown
cell
just
for
a
little
bit
of
notation.
I
can
also
merge
these
cells,
because
this
is
all
just
getting
the
html.
This
is
just
initialize
browser
and
get
the
html.
A
Of
the
page,
and
then
we
get
so
let's
it's
usually.
B
A
A
A
A
B
Got
so
these
are
probably
all
elements
like
we
saw
before
and
so
oh,
we
should
be
able
to
get
text.
So
if
we
go
e
dot
text
for
e,
in
dates.
A
A
A
Oops,
this
is
reading
time.
B
So
you
you'll
notice
this
we're
working
on.
What's
called
a
tree,
this
tree
thing
that
we've
created
out
of
the
html
text.
It's
if
you've
worked
with
javascript,
then
you're,
probably
familiar
with
the
dom.
B
Document
object
model,
so
this
is
actually
how,
like
all
web
pages,
are
constructed.
Every
web
page
is
actually
a
tree,
a
tree
data
structure.
So
there's
some
root
node,
like
the
very
top
of
the
tree
that
represents
the
entire
web
page
and
then
you'll.
Have
your
initial
sort
of
containers
inside
of
that,
and
that'll
usually
be
like
your
background.
B
For
example,
it
will
be
near
the
top
of
that
tree
and
you
know,
and
then
you
you
have
containers
inside
of
containers
inside
of
containers
inside
of
containers
inside
of
containers,
and
so
you
get
this
sort
of
tree
structure,
which
is
how
every
web
page
is
rendered,
and
so
that's
what
we're
doing
here-
we're
sort
of
navigating
the
tree.
So
so
I
had
gotten
this
html
object
here,
which
is
the
p
the
paragraph
and
nested
inside
of
that
is
the
span
object.
B
A
Oh,
it's
empty
because
it's
not
a
paragraph.
It's
a
button.
A
Think,
yes,.
B
B
A
If
I
mod.
A
A
A
A
B
A
A
B
A
I'll
just
say:
get
all
of
the
titles
urls
dates
reading
times
and
clap
numbers.
A
Let's
just
see
what
this
looks
like
title
map
to
title
column,
yeah,
okay,
I
think
that's
what
we
want.
A
So
date,
maps
to
date,
column,
reading
time,
maps
to
reading
time,
column
and
claps
maps
to
claps
column.
A
B
And,
oh,
we
don't
have
the
urls.
B
It's
kind
of
cool
that
it
looks
like
pandas
has
a
if
you
use
the
format.
Okay,
so
so
actually
pandas
can
format
html.
A
Yeah:
okay:
let's
try
this
something
like
this.
B
B
A
B
A
Like,
let's
make
def
call,
this
function
takes
a
column,
wait.
A
Oh
yeah,
what
if
we
go,
df
hyperlink.
A
A
A
A
The
keen
reader,
the.
B
For
anyone
who
wants
to
attempt
it
is
the
idea
of
combining
a
title
column
and
a
url
column
into
a
single
sort
of
like
this
right,
so
these
are
actually
links.
These
are
hyperlinks
now,
alternatively,
what
you
could
do
is
you
can
just
you
could
take
this
data
and
we
can
export
it
into
our
spreadsheets.
A
B
B
So
that's
that's
what
it's
that's
the
process
of
putting
together
web
scrapers.
It's
really
fun,
as
you
can
see,
it's
quite
simple.
In
the
end,
we
simply
you
know
import
our
packages,
selenium,
lxml
and
pandas.
We
grab
the
url,
we
open
up
the
browser
and
we
plug
in
the
url
and
we
make
a
tree
structure
out
of
it.
So
this
is
what
we
call
boilerplate.
This
is
just
like
the
standard
code.
You
know
you'll
just
have
this
every
time
and
then
what
you
do.
Is
you
open
up?
B
And
you
use
this
inspector
element
to
pick
what
you
want
to
grab
and
you
find
any
unique
identifiers
like
the
class
or
the
html
type,
that
it
is
and
you
use
this
xpath
function,
and
this
might
look
a
little
scary.
But
when
you
have
a
template
to
go
off
of
it's
really
easy
and
you
can
always
just
look
up.
You
know
python
lxml,
xpath
and
then
you'll
see
all
sorts
of
stack,
overflow
posts
and
tutorials
and
that
kind
of
stuff.
B
In
this
case,
we
had
to
actually
get
children
and
then
index
into
that
list
and
then
grab
the
text.
But
it's
usually
something
like
that.
And
then
you
clean
up
your
data
and
you
can
pipe
it
all
into
a
data
frame
to
have
a
nice
clean
output,
and
then
this
can
be
automated
right.
So
now
the
idea
is
well:
let's
just
try
this:
let's
go.
B
Right
so
we
could
take
this
and
it'll.
Probably
break
there'll,
probably
be
some
slight
difference
here,
but
you
can,
as
you
apply
it
to
more
and
more
things
you
might
have
to
tweak
some
aspects
manually,
but
eventually
you
can
come
up
with
a
general
system
that
will
just
scrape
any
any
web
page
as
long
as
it's
a
sort
of
a
standard
format.
B
So,
let's
see
if
this
breaks
or
if
this
outputs
something
okay,
so
it
didn't-
it
didn't
get
anything
here.
You
know
so
somewhere
along
the
chain,
there's
going
to
be
some
slight
difference
from
climadow
to
olympus
dow
on
medium
and
we
could
go
through
and
figure
out
what
that
is,
but
maybe
that's
a
future
episode
to
turn
this
into
a
generalized,
medium,
scraper
bot,
but
I'll
push
this
up.
We've
got
a
repo
here
and
I'll
go
ahead
and
push
that
now
and
we're
at
the
top
of
the
hour.
B
So
I
just
want
to
thank
everyone
for
joining.
This
was
kind
of
long
and
heady,
but
fun
stuff
too,
like
in
45
minutes,
we
were
able
to
scrape
all
the
medium
articles.
So
it's
pretty
powerful
technology.
A
A
B
Okay,
so
that
that's
the
code
is
all
pushed
so
feel
free
to
go
check
it
out
guys.