►
From YouTube: DataOps TUTORIAL: Data engineering / Ejemplo DAGSTER end-to-end con BigQuery, dbt, spark, jupyter
Description
#dataOps #dataengineering #dagster #dbt #bigQuery #SPARK
En este video vamos a hablar explorar los principios para el diseño de una plataforma de Data Engineering moderna y construiremos un pipeline de proceso de datos con dagster (https://dagster.io/) - un orquestador de pipelines de datos utilizando:
* BigQuery como DWH
* dbt como herramienta de transformación de datos SQL
* dataproc/pySpark para procesar datos a escala con SPARK
* Jupyter notebook, para explorar y visualizar el resultado
Código del video: https://github.com/velascoluis/dagster_gcp
Sígueme en:
👉twitter: @luisvelasco
👉medium: https://medium.com/@velascoluis
👉github: https://github.com/velascoluis
A
A
A
As
the
intro
said
this
past
year,
we
have
been
focusing
on
analyzing
everything
that
happens
in
the
final
part
of
a
much
larger
pipeline.
We
have
seen
non-regional
regional
models,
how
to
process
audio
text
video
a
lot
of
things
to
run
these
models,
that
you
need
data
and
how
to
generate
treat
process.
This
data
would
be
a
bit
the
field
of
data
engineering
of
danger
and
the
concept
of
data
warehousing,
as
we
know
it
today
in
Today,
this
centralized
site
where
to
store,,
integrate
and
normalize.
A
The
data
of
the
entire
organization
was
born
in
the
60s
and
70s,
and
mont
adopted
this
term
in
the
70s.
The
data
warehouse
data
lived
through
golden
ages
in
the
80s
and
90s,
with
technologies
such
as
Oracle
teradata,
and
then
the
truth
that
they
experienced
a
small
decline
with
the
advent
of
big
data
worldwide,
but
it
seems
that
now
they
are
living
a
second
youth
with
the
birth
of
center
plates
in
the
cloud.
A
Well,
here
you
see
on
screen
the
classic
framework
that
is
used
to
logically
describe
a
data
warehouse
data
engineering
is
this
discipline
that
brings
together
everything
related
to
data
processing
and
this
generation
of
data
from
taiwan.
Out
for
subsequent
exploitation,
we
find
concepts
as
diverse
as
data
processing
in
real
time
or
quality
rules,
or
how
to
generate
a
central,
logical
model,
with
its
dimensions,,
its
fact
tables,,
etc.,,
etc.
A
Okay,,
and
now
what
I
propose
is
to
analyze
what
major
trends
are
currently
occurring
in
the
market
in
the
industry
that
are
in
some
way
redefining
the
entire
concept.
In
data
engineering.
We
are
going
to
analyze
three
major
trends,,
the
first
of
which
is
what
I
call
the
common
separation
of
responsibilities,
specialization
or
hyper
specialization,.
A
What
is
this
from
a
technological
point
of
view,?
No
matter
how
you
look
at
it,,
we
are
moving
from
a
world
focused
on
data
process,
produces
zentric
to
a
data-centric
world
data
centric
in
the
80s,
90s
and
2000s.
The
kings
of
business
software
were
undoubtedly
the
erp
or
crm
applications
such
as
whatsapp
or
cell
phones,
where
they
somehow
digitalized
key
business
processes
in
the
organization
from
2002
2010.
A
We
are
seeing
this
yes,
where
really
the
main
source
of
value
is
not
so
much
the
process,
but
the
analytical
treatment
of
the
data
in
some
way
generate
these
processes,
for,
as
you
already
know,
to
improve
decision
making
by
placing
the
data
in
the
The
center
of
everything
that
is
happening
is
that
this
platform
is
that
at
the
beginning,
it
was
a
little
more
monolithic,.
They
are
being
disaggregated,.
A
They
are
specializing
in
different
components
where
today,
where
each
one
of
them
is
adding
a
lot
of
value
to
a
complete
state,
before
we
had,
well,
one
database,
that
pretty
much
did
it
all.
Now
we
are
seeing
this
database
break
down
between
the
storage
layer,
sql
compute
engines
or
real
time
stream,
processing
or
a
semantic
model
or
scheduler
separate
from
the
database
or
different
tools
to
generate
this
data
within
and,
of
course,
the
different
modes
of
consumption
both
in
prizes
and
in
the
cloud.
A
This
central
role
is
being
divided
into
multiple
specializations
and
We
find
titles
as
specialized
as,
for
example,,
a
pipeline
orchestra,
engineer,,
a
machine
learning
engineer,
a
data
infrastructure
engineer,
and
on
the
exploitation
side,
business,
analysts,,
vif,
analysts,,
etc.,,
etc.,,
etc.
Finally,
as
a
summary
of
the
concept
of
separation
of
responsibilities
and
What.
Hyper
specialization
does
is
dismember
monolithic
components
that
are
different,
more
agile,,
highly
specialized
subcomponents
with
greater
autonomy.
As.
You
can
see,.
A
This
approach
has
multiple
advantages,
but
also
some
disadvantages,,
such
as
the
need
to
integrate
and
that
all
these
components
speak
naturally
and
not.
We
have
to
make
a
brutal
effort
in
the
integration,
well,
and
I'm
telling
this
today,,
but
tomorrow
it
can
change.
It's
a
very
changing
ecosystem,
where
practically
every
month
or
every
quarter.
We
are
seeing
new
components
that
appear
within
this
framework,,
some
of
which
are
successful,,
such
as
the
concept
of
store
of
characteristics
of
the
blind
physio
others
are
not
so
much
and
what
we
are
seeing
is
that
they
tend
to
disappear.
A
Therefore,
it
is
a
super,
dynamic
and
vibrant
ecosystem.
Well,
the
second
macro
trend
that
I
want
to
talk
about
within
the
world
of
data
activity.
Is
this
everything
as
code
is
talking
about
data
engineering
and
images
like
the
ones
you
see
on
the
screen?
Come
to
my
mind,
these
famous
drag
and
drop
applications
of
little
boxes
to
compose
your
transformations.
Applications
such
as
power
center
informatics,
oracle,
stop
great
or
talent
from
ita
stage,
etc.
etc.
A
Although.
The
learning
curve
of
these
tools
is
relatively
simple,.
They
present
a
series
of
problems
that
I
call
the
gray
ones
of
the
seven,,
which
is,
first
of
all,
scalability
and
performance
in
general,.
These
integration
tools
collect
data
from
various
data
sources.
they
integrate
them
into
an
internal
engine
and
finally
deposit
it.
In
the
data
warehouse
the
one
from
the
on-board
warehouse,
then
what
we
are
seeing
is
that
the
current
power
of
these
practical
warehouse
entities
is
brutal,.
A
Therefore,
it
is
better
to
make
a
goal
paradigm,
all
within
the
data
warehouse
and
that's
it
integrated
it
into
the
data
warehouse-
is
the
concept
of
lt
versus
e
p.
L
well
most
of
these
tools.
The
truth
is
that
they
were
not
born
with
this
embedded
concept.
Many
of
them
are
trying
to
make
the
change,
but
they
are
tools
much
more
to
think
about
data
integration
than
for
work
in
the
new
20
warehouses
in
the
cloud,.
A
Therefore,
they
present
scalability
and
performance
problems
and,
as
we
said
before,
we
are
moving
from
a
world
focused
on
processes
to
a
world
focused
on
data.
The
data
the
data
is
going
to
go
more
and
more
and
more,
and
it
is
super
important
that
we
have
tools
that
scale
as
our
data
parks
also
increase,.
The
second
gray
of
TVs
brings
together
a
large
package
of
things,
such
as
the
need
to
configure
the
reliability
of
their
rigidity,
collaboration
or
automation,.
These
tools
are,
by
definition,
quite
rigid,.
A
They
provide
a
series
of
transformations
now
of
the
box,,
such
as
making
a
joint,
makes
an
addition
in
some
ways
of
loading,,
but
when
you
take
your
feet
off
the
plate
a
bit
and
want
to
do
something,
a
little
more
complex,,
what
you
end
up
with
is
a
little
box
that
is
already
a
script
with
a
lot
of
code
that
you
have
developed,.
Therefore,
we
lose
a
bit
of
the
grace
of
having
these
tools
that,
in
the
end,
are
relegated
to
being
mere
script.
A
Recorders
developed
in
this
way,
on
the
one
hand,
and
on
the
other,
and
on
the
other,,
the
version
control
of
having
two
pain
with
little
boxes
as
you
do,
for
example,
a
push
response
or
a
merge,
that
is,.
They
also
present
serious
problems
of
collaborative
development
deficiencies
or
the
level
of
automation
that
we
can
reach.
A
And,
finally,
the
third
green
of
the
seven
L's
is
that
of
the
kings
standards
and
open
source
in
the
end,.
All
these
tools
always
have
a
pretty
boy
or
a
pretty
girl,
for
example,.
If
you
use
oracle
data
integrator,,
it
is
normal
that
it
is
very
well
thought
out
and
integrated
with
the
oracle
database,
of
course,.
A
It
is
used
stage
data
because
the
ibm
solution,
that
is
to
say,
they
are
solutions
that
present
a
quite
important
blocking
and
if
they
want
to
immigrate
from
one
to
another,
it
is
practically
a
nightmare
because
in
the
end
they
are
not
based
on
open
standards.
They
are
proprietary
codes.
Therefore,
moving
from
one
to
another
It
is
super
complicated
and
we
are
a
bit
a
prisoner
of
the
vendor
that
you
have
chosen
well,.
Therefore,
the
new
data
engineering
tools
break
directly
with
this
approach
and
as
I
said,,
they
have
made
holes,
well,.
A
So
what
we've
done
from
the
data
level
is
look
at
examining
the
software
and
apply
these
concepts
like
version
control,
integration,
continuous
continuous
deployment
based
on
standards
and
proposing
the
integration,
extensibility
of
the
frameworks
and
portability,
and
what
we
are
doing
is
having
a
programmatic
approach
to
the
world
of
data
transformation
and
cleaning
it
with
the
concept
that
I
told
you
about
previously
of
the
disaggregation
of
the
integration.
We
are
having
specific
tools,
for
example,
for
extraction
and
loading
or
specific
tools
for
transformation
within
the
warehouse
using
open
standards
such
as
sql,.
A
Therefore,
the
promise
of
data
ox
is
to
somewhat
fill
this
gap
that
exists
between
data
engineering
and
engineering
of
software
in
living,
these
good
principles,
this
discipline,
maturity
and
maturity.
Well,
I
went
through
a
bit
of
an
approach,
as
I
said
before
all
as
code.
The
third
great
trend
that
is
a
matter
of
speaking
in
the
world
of
infinity
of
data
is
the
embrace
of
some
form
of
ecosystems
and
open
standards.
Well,
and
what
does
this
mean?
A
I
remember
a
few
years
ago,
approximately
in
2012
13,
when
everything
was
big
data
every
day
he
appreciated
a
new
framework,
and
when
he
saw
this,
the
hadu
ecosystem
seemed
like
there
is
a
zoo
with
so
many
animals.
What
we
have
seen
according
to
This
discipline,
has
been
maturing,.
The
years
have
gone
by,.
The
dust
has
settled
and
well,.
When
the
dust
settles
a
bit,.
What
remains
standing
has
been
what
adds
value,.
It
has
been
what
adds
value
and
well,.
A
It
did
n't
add
smoke
today
and
what
It
has
remained
standing,
because
not
so
many
things
have
remained,
for
example,
all
the
promise
of
hadoop
and
all
the
things
that
are
appreciated
around
it
have
practically
been
distilled
into
sparc.
Payton,
a
bit
has
been
chosen
as
the
default
language
to
program.
These
data
panels
that
We
will
see
later
the
concept
of
the
world
in
container
has
also
remained,
and
in
q
vernet
on
that
chrome
platform,
where
they
execute
with
containing
these
containers,
lobito
a
resurgence
of
sql.
A
There
cannot
be
a
person
in
the
world
of
data
engineering
today,
who
does
not
pass
ql
is
one
more
and
you
have
to
start
somewhere.
I
always
recommend
that
you
start
with
always
sneak
in
and
then
look
at
some
other
frameworks
like
pandas
data
frame,
etc,
etc.
But
in
the
end,
look
at
the
new
platforms
that
we
are
seeing
are
fleeing
from
the
concept
of
one
stop
shop.
Where,
for
example,
a
vendor
would
give
you
all
other
packaged,
data,
everything
already
pre-integrated,
but
not
tumorous,
from
there.
A
We
are
opening
a
bit
that
the
current
grace
is
the
concept
of
ecosystem,
the
concept
of
platform.
Where
today
you
can
be
running
a
flight
in
python,
But
hey,.
If
something
comes
out
tomorrow,
that
is
something
new
that
makes
sense,
is
relatively
easy
to
integrate
and
well,.
Don't
have
this
login
with
the
slightly
more
monolithic
concept
of
one
stop
shop,,
which
was
a
bit
the
main
one
last
time
and
finally,
the
Last,
but
not
least,.
A
You
have
a
little
cross
over
everything
in
the
use
of
the
nail
as
the
default
means
for
the
deployment
of
our
advanced
test
of
our
pipeline.
Practically
of
all
of
them,,
therefore,
summarizing
data,
engineering,
2021,
1,
separation
of
responsibilities
and
preparation
of
components
we
break
with
that
concept,
guards
topshop
monolithic
and
we
embrace
very
specialized
component
bases
based
on
open
standards
that
communicate
well
with
each
other.
Secondly,
everything
as
code
data
or
the
application
of
software
life
principles
to
the
world
of
data
is
said
of
software
testing,
automation
and
Lastly,.
A
We
embrace
ecosystems,,
we
flee
from
approaches
with
veedor,
locking
and
all
of
this
is
deployed
in
the
cloud
today,
so
that
this
introduction
a
bit
and
what
I
propose
to
you
now
is
that
we
are
going
to
build
together
a
modern
pack
line
using
all
these
technologies
well
and
that
we
are
going
to
build.
We
can
start
deploying
a
day
the
warehouse
in
the
cloud.
A
In
this
case
we
are
going
to
use
google
cloud
and
we
are
going
to
use
and
within
which
we
are
going
to
have
some
public
data
of
the
paddlers
and
who
are
the
ones
we
are
going
to
work
on.
Specifically,
we
are
going
to
work
With,
the
data
from
this,,
the
flow
of
the
posts,,
the
comments,,
and
we
are
going
to
do
some
kind
of
process
with
them,
how
we
are
going
to
process
the
data
within
disc
query,,
using
sql
with
bt
from
ita
bildt,.
A
A
virtual
sale
is
a
framework,
is
a
framework
for
the
construction
of
these
parents
of
data
transformation
using
sql
within
the
database
and
everything
as
I
always
say
from
a
programmatic,
approach,
well,.
We
are
going
to
generate
some
tables,
some
aggregations
and
then
we
are
going
to
export
this
data
in
a
format
that
is
currently
very
standard,,
such
as,
for
example,
jason.
A
We
will
export
them
from
visual
and
to
see
how
easy
it
is
to
integrate
these
platforms
and
the
most
complete
ecosystem,
because
we
will
make
this
data
the
google
cloud
store
to
a
distributed,
object,
storage
system
in
the
cloud,
and
what
are
we
going
to
do?
Then?
We
are
going
to
process
these
data
with
a
spark
cluster.
We
are
going
to
implement
a
little
park
page
program.
A
It's
a
data
pipeline
orchestrator,
that
is,
look
a
little
bit
that
it
attends
to
these
principles
that
we
talked
about
before,
the
principle
of
ipl
hyperspecialization,,
where
we
go
from
a
generic
state,,
such
as
apache,
for
example,,
the
flow,,
which
is
a
super
cool
framework.,
but
the
generic
hyperspecialized
hyperspecialize
a
recorder
to
a
specific
state
of
data,
with
all
the
advantages
that
it
has.
Second
point:
everything
that
we
are
going
to
develop
with
daxter
is
going
to
be
code,
all
the
python
code.
A
Now
we
will
see
some
example
that
it
is
simple
to
develop
this
line
and
third
country,
well,
all
based
on
open
standards,
where
we
will
be
able
to
integrate
things
like
vt
spark,
sql
or
jupiter
notebook
using
paper,
milk
and
all
of
it.
Extensible
and
open
here
is
the
git
ham
repository
where
you
can
see
that
it
is
totally
software
free.
A
So
we
are
going
to
develop
this
incomplete
page
in
your
in
using
duster
and
all
these
components
that
we
have
been
seeing
victor
and
they
must
pay
with
data
in
jason,
parker
and
visitation
with
after
jupiter.
We
are
going
to
go
very
well
well,
I
am
here
in
pay
charm
and
what
what
we
are
going
to
do
is
create
a
new
project,.
Ok
to
show
you,.
A
A
We
install
it
well
in
daxter.
There
are
two
super
important
abstractions:
it
is
the
concept
of
solid
and
timeline,
which
is
a
solid.
It
is
a
piece
of
code
where
the
logic
will
be
from
our
pipe
line,
step
that
can
make
a
sun
and
then
a
sun
and
can
run
sql
code
a
sun
and
can
run
code
with
de
bt
can
run
code
spark
can
launch
a
jupiter
dust
can
run
that
can
nou,
flake,
etc,
etc,
etc,
and
What
a
pipeline
does
is
compose
several
of
these
solids,,
that
is
to
say,
of
cinema,
well,.
A
A
A
A
A
And
what
We
are
going
to
do
it
within
our
plan,,
since
it
is
calling
the
functions
of
solid,?
It
is
that
we
already
have
a
pipeline
that
simply
uses
this
solid,
that
axis
on
the
screen,
well,,
how
we
compile
as
the
daxter
lock
that
executes
uncover
line,.
So
we
return
here
to
the
command
line
where
we
have
the
function
that
it
we
pass
the
file
to
it.
It
also
has,
of
course,
also
has
an
sdk
in
python,
but
it
also
has
this
part
of
the
interface
command.
A
We
launch
it
and
it
is
going
to
open
us
here
in
a
port
because
the
axtel
application
that
I
show
you
is
this
right
here.
So
here
where
the
pipeline
is
graphed.
In
this
case,
it
only
has
only
this
step.
Ok,
we
can
see
all
the
metadata.
The
truth
is
quite
a
lot
rich
and
in
the
play
ground
we
can
do
executions.
A
Therefore,
what
we
are
going
to
do
is
launch
this
execution
that
in
principle
it
will
only
execute
us,
as
this
solid
said,
that
it
should
write
on
the
screen
and
they
have
executed
here
we
can
well,
we
can
see
it
as
It
is
being
executed
for
the
time
etc.-
and
here
we
see
the
execution
steps
below,
and
we
can
show
you
the
output,
that
we
came
here
needing
an
error
that
simply
has
debut
themes
and
in
the
output.
Well,
we
have
here
what
we
have
written
well,
this
It
is
a
super
simple
dax
plan.
A
So
what
we
are
going
to
do
now
is
complicating
this,
a
little
more
and
the
first
thing.
I
propose
is
to
play
a
bit
with
bit
query
and
dbt
here,
well,
another
code,
a
little
more
complicated
that
we
are
going
to
do.
To
analyze
it
well,
first
of
all,
within
the
code,
I,
want
to
show
you
a
bit
of
the
structure
that
I
have,
since
I
have
a
dbt
folder
where
it
generates
a
dbt.
Project.
A
I
have
a
jupiter
folder,
where
I
have
a
jupiter
notebook
and
I
have
a
spark
capita
where
I
have
a
code.
In
paises
park,,
it's
good
that
we
have
built
it,,
because
I
think
the
best
way
is
to
first
compile
it,,
see
the
tree
a
bit
and
we
will
analyze
it
now
step
by
step,.
This
is
called
daxter
pipeline,.
A
A
A
First
for
First
we
create
it,,
we
execute
the
spark
job
and
then
branches,.
So
look
how
cool.
We
will
only
pay
a
little
for
the
time
that
we
have
this
cluster
raised
in
the
clouds
and
then
what
we
are
going
to
do
is
download
this
file
locally.
That
I
already
said
that
for
this
parker
format
and
we
are
going
to
visualize
it
with
a
jupiter
notebook
that
we
already
have
created
and
financial'.
Here
we
come
to
a
link
to
be
able
to
see
even
the
notebook,
we'll
see
it,
okay,
well,.
A
A
A
little
look
at
this
is
my
table
where
I
have
defined
the
solids
to,
for
example,
create
an
atap
log
cluster
before
deleting
it
with
this
one
here
or
launching
the
spark
job
interesting
to
say
that
baxter
already
has
active
integrations
with
datapro,
so
we
are
going
to
be
able
to
create
an
ata
cluster,
but
simply
with
this
function
here
createch
lester.
We
will
be
able
to
launch
a
spark
job
simply
with
submit
job.
Nothing
else
is
necessary,.
Everything
is
already
pre-integrated
based
on
open
standards.
A
In
this
case,
well
hadoop
spark
and
to
delete
the
cluster
well,
the
same
goes
that
is
to
say,.
It
is
already
integrated
with
both
data
pro
and
bit
cueli.
Well,
here
we
have,
for
example,
another
solid
one,.
This
one
is
a
little
longer
than
what
it
does
is
download
a
file
from
a
gcs
ship.
Ok
later
we
will
see
a
little
detail
of
these
parameters,
entry
that
we
pass
here,
but
simply
the
grace
that
I
want
you
to
see
that
they
are
all
solid.
A
What
do
I
know?
We
are
going
to
start
analyzing
now
daxter
is
integrated
with
de
bp.
The
only
thing
that
we
have
to
say
is
simply:
where
is
our
project
directory
from
better
and
it
will
execute
it.
We
can
also
pass
it
a
lot
and
it
will
execute
it
using
the
dt
as
in
the
interface.
In
this
case,
therefore,
we
cannot
pass
all
the
dishes,
the
parameters
that
we
want,.
Therefore
the
first
thing
our
country
does
is
go
to
de
bt.
As
I
said
in
de
bt,
I
have
already
created
this
project.
A
It
must
have
it.
We
are
not
going
to
go
into
details
of
everything
that
is
from
bt,
but
it
remains
with.
This
idea
is
an
open
source
framework
to
execute
transformations
within
elect
saved
in
the
cloud
with
beat
query
snowflake
in
this
case,
as
I
said,
what
we
are
doing
is
connecting
it
to
bit
query
within
bt.
The
main
concept
is
the
models
that
are
nothing
more
than
the
table
that
we
are
going
to
generate
and
add.
A
They
will
be
two
tables
derived
from
each
other
in
this
case
fixed
that
I
simply
have
two
tables
here.
All
aggregation
is
the
first
one
that
I
do
is
this.
That
is
called
stack,
overflow
stein,
and
what
I
am
going
to
do
is
generate
a
table
once,
let's
say
copy
of
this
table
here,
which
is
the
table
of
bit
query
pablo
and
sale
of
this
double
flow,
see
it
in
beat
4.
We
are
already
here
in
beat
query,
and
what
we
have
to
do
is
the
table
is
it
is
here
it
is
for
a
question.
A
A
Well,
the
number
of
answers,
the
number
of
comments
when
it
was
created,
etc.
The
user
who
asked
it
and
the
tags
that
It
has
been
put,
well,
as
I
said,.
We
do
not
return
them
to
each
mind
and
simply
what
I
am
doing
is
here
is
to
generate,
in
this
case,
the
yes,
yes,
in
this
case,,
which
now
generates
a
view
that
highlights
the
leather
one
and
I
can
refer
to
it
by
name
of
the
file
highlighted
the
flow
stage
here
to
this
other
aggregation.
A
A
In
the
project,
and
if
we
put
terram,,
then
this
transformation
is
going
to
be
executed,
well,
using
bt,
notice
that
I
found
two
models,
well,.
What
is
generating
the
view,
it
generates
it,
it
does
the
aggregation
and
the
final
table
remains,,
which
works
perfectly
well,.
What
I
want
to
do
is
the
same
thing
that
I
have
done
from
the
command
line,.
Do
it
from
daxter
to
chain
all
the
steps
together,
well,?
How
can
you
do
it?
A
It
is
something
as
simple
as
I
said
before
we
come
to
the
pipeline
and
we
already
have-
and
we
already
have
this
command
here.
Ok,
yes,
the
only
thing
it
is
going
to
do
is
execute
what
we
want
here
done.
We
are
going
to
execute
it
in
stand
alone
mode
for
that
I'm,
going
I'm,
going
to
comment
on
my
plan
and
I'm
going
to
leave
only
one
solid
one,,
which
is
this
one
from
dbt
and.
A
A
A
We
have
this
strange
running
that
practically
as
you
can
see,
it
is
calling
de
bt,
and
it
does
exactly
the
same
thing
that
we
have
seen
before
the
pimps
as
I
tell
you
is
that
duster
already
It
integrates
the
calls
to
us
from
bt,
which
is
quite
cool.
This
Russia
has
had
to
end
successfully.
We
look
at
it
below
here.
We
can
see
all
the
goals
that
are
being
generated,
all
the
output
from
the
storage
room.
Here
we
can
see
all
the
output.
A
Here
this
is
the
direct
output
Since.
It
should
generate,
as
I
say,
the
view
and
this
table
very
well,.
So
what
we
are
going
to
do
now
is
to
go
composing,
this
paula,,
making
it
more
and
more
complex,
well,.
How
are
we
going
to
define
the
relationships
between
the
different
solids
so
that
one
It
is
executed
before
and
another
after,?
So
what
we
are
going
to
do
is
embed
the
calls
here
and
fix
this
halo
cannot
be
deleted
and
it
comments
here.
A
If
you
look
at
this
solid
at
the
end,
I
am
passing
it
by
parameter
to
another
solid
that
is
here.
Víctor
y,
solid,
ford,
query:
what
is
this?
These
is
a
solid
that
executes
queries
inside
the
box.
An
sql
string
in
what
I'm
doing
well
practically
is
to
tell
it
that
the
first
one
executes
what
it
has
inside,
which
would
be
well
this
here
and
then
what
is
executed
from
outside
in
some
way
to
make
it
clearer,
because
it
has
many
parameters
here.
A
A
So
it
is
a
little
more
readable.
Well
well,
a
little.
What
I'm
doing
here,
sql
process
I,
take
this
output
and,
as
you
can
see
here,
I
put
it
in
the
creation
of
this
data
pro
to
generate
this.
Let's
say
dependency
between
step.
Well,
so
how
does
this
other
work?
This
other
solid
here,
visual
infor
queries
like
I
was
saying
it
is
another
integration
that
daxter
already
brings
directly.
That
brings
a
lot.
A
We
can
go
to
the
documentation,
and
here
we
see
the
disip
and
all
the
integrations
that
it
has,,
for
example,
with
beat
leather,
and
so
where
is
it
to
delete
it
to
throw
a
string
and
that
the
one
we
are
going
to
use
now,
contracts
pro
with
gps,
etc,
etcetera
and
well,?
It
has
a
lot,
as
I
said,
beloved,
for
example,
with.
A
Jupiter
with
air
flow,
the
truth
is
that
it's
amazing,
the
interactions
that
have
been
done
here.
We
see
a
lot
here.
We
see
the
entire
list,
dated
also
with
das,
to
execute
be
distributed
with
greg
expectation,
very
cool
for
all
the
textile
part
with
mysql,
with
request
duty
to
send,
for
example,
alerts
with
bands
for
all
processed
data
frames,
with
país
par
with
gel
with
slack.
It
is
very
cool
all
the
interactions
that
have
been
generated
to
which
we
are
going
to
see
the
execution
of
an
agreement
execution
of
an
agreement
into
account.
A
Therefore,
we
have
bisoli
ford
query
that
what
we
simply
have
to
go
through
is
a
sequel
query
and
in
my
case,
where
I
have
it,
because
the
other
configuration
file
has
defined
it
for
simplicity.
It
is
here
generated
here,
peace
of
mind
for
cloud
config
generated
a
lot
of
jason
with
some
data,
so
for
readability,
then,
here
look
at
the
string.
I
have
It.
A
Is
this
export,
I
am
telling
you
to
export
it
to
a
gcs
bouquet
in
jason
format
with
over
with
white,
and
the
query
that
I
want
to
do,,
which
is
practical,,
is
simply
a
string
and
it
collects
everything
from
this
final
table
that
we
generated
before
with
de
bt.
Ok,
then,
simply
well,
this.
This
solid
is
quite
simple,
executes
custom
queries
in
this
case
inside
that
sun,
and
what
does
it
do
then?
Instantiate
a
client
of
bit
query
in
the
object.
Type
me
of
cloth.
Query
throws
the
joe
and
get
result.
A
Well
then,
connect
with
this
collect
line.
Here
we
have
already
generated
from
bt
some
sql
transformations
and
the
export
of
the
table.
As
I
said,
the
seconds
that
we
are
going
to
do
now
is
to
process
that
data
extracted
from
bit
query
with
that
with
park
countries
to
process
peace
country.
We
need
a
spark
cluster,
as
we
have
said,
and
daxter
is
also
integrated
with
datapro
-that
data
pro
is
beta
progress
within
google
cloud.
This
functionality,
this
service
that
will
allow
us
to
execute
war,
love,
duo,
spam
plus,
radio,
jai,
tetra,
etc.
in
ephemeral
cluster.
A
Therefore,
we
will
have
to
do
three
steps:
create
the
spam
class,
run
the
spam
and
delete
the
three
park.
The
good
thing
is
that
daxter,,
as
I
said,
already
has
these
integrations,
look
here,
that
what
I'm
doing
is
the
first
thing,
I'm
doing,,
but
it's
a
bit.
The
last
thing
we've
seen
this
incarnation
operation
of
solid
is
to
delete
the
'cluster'
of
spark
that
depends
on
launching
them
from
spark.
That
depends
on
the
creation
of
john
spark
that
depends
on
sql
process.
That
is
the
previous
output.
A
Therefore,
the
way
to
read
it
would
be
from
right
to
left
so
like
what
the
way
to
create
the
data
pro
cluster,
well,.
We
are
coming
here
and
notice
that
we
already
have
a
function,
as
I
said,
integrated
into
daxter,
to
allow
us
to
create
this
clan,
this
cluster,.
What
we
are
going
to
need
is
to
give
it
the
configuration
of
the
cluster
that
we
want
to
create
and
This,
as
I
was
saying,.
The
town
is
also
taken
to
this
other
configuration
file
that
is
here
and
here
I
have
generated.
A
A
A
Here
we
give
it
to
create
cluster
and
what
we
do.
We
define
everything
we
want
here.
If
I
don't
know,
if
we
want
to
install
conda
and
if
you
see,
for
example,
If,
the
nodes
can
be
of
one
size
or
another,
for
example,,
we
are
going
to
put
here
standard
nodes,
8
and
the
workers
how
to
put
6,
for
example,.
A
And
we
can
go
customizing
it
if
we
want
to
put
internal
iphes
properties
etc.
Also,
when
we
have
defined
the
cluster
that
we
want
to
create.
What
we
are
going
to
do
is
that
the
ones
of
the
equivalent
between
this
and
here,
look
at
what
is
going
to
give
us
the
class
configuration
of
this
jason
format,,
which
is
exactly
what
we
need
to
pass
here
in
daxter,.
So
it
is
an
easier
way
to
generate
this,
and
we
make
sure
that
we
have
no
errors.
It
is
exactly
the
same
to
generate
the
job.
A
A
A
I
point
and
I
can
put
all
the
parameters
and
in
the
end
we
also
generate
the
equivalent
res
below
and
well.
We
will
be
able
to
copy
it
as
long
as
it
is.
Okay
then
notice
that
here
I
have
already
generated
this
park.
Country
code,
that
is
here,
I,
have
uploaded
it
to
a
google
store
package,
but
I
also
have
the
code
here,
because
you
have
to
take
a
look
because
basically,
what
it
does
is
from
that
table
that
we
had
before
we
have
done.
The
typical
word
count
worth
nothing
like
that
special
process.
A
Well,
what
follows
special
that
we
process
this
file,
this
file,
jason,
okay
and
then,
yes,
we
are
going
to
save
it
where
the
parquet
format
is
here,
Tuesday
a
little
bit
well,
this
we
do
with
and
then
well
process
itself.
Well,
nothing!
Then
we
use
in
that
context
to
Make
this
code
with
a
park
country
like
the
one
I
live,
in.,
Nothing,,
nothing,
special.,
This,
code,,
like
being
here,,
you
can
take
a
look
at
it,
okay,,
but
we
can,
the
pipeline
that
we
had
before,
okay,
so
look,.
A
A
And
finally,
what
we
have
is
that
data
that
we
have
processed
with
spark-
and
we
have
saved
it
in
even
format
that
is
worth
What
has
been
generated-
is
a
small
notebook
that
we
don't
have
around
here
here
since
the
community
of
the
free
version,
passed,
well,,
as
you
know,
notebooks,
don't
look
good,,
but
what
do
notebooks
look,
like?,
Well,?
What
I've
done
is
generate
it
from
a
Peter
notebook.
here,
local.
In
fact,
we
are
going
to
do
it
and
that
way
we
see
it
better.
A
A
He
does
not
hesitate.
It
is
super
simple
simply
what
he
does
is
take
the
We
read
the
data
in
parquet
format
and
simply
paint
it
with
Balotelli,
which,
in
the
end,
is
a
bit
ugly,
because
it
is
a
count
of
all
the
words
as
I
said
work
even
in
the
texts
in
the
starwars
flow
posts
and
well
here
is
the
one
that
It
is
the
one
that
is
repeated
the
most,.
A
It
will
be
the
crow,,
that's
okay,,
but
it's
good
that
simply
teaching
the
integration
of
daxter
with
notebooks
is
also
very
cool,
very
good,,
since
we
have
already
analyzed
all
the
code
practically
and
we
make
a
small
summary
of
the
pipeline
as
I
said.
What,
the
pipeline
is
going
to
do
is
call
de
bt,
2
Export
the
data
using
the
se
q
l
statement
with
the
interaction
with
bit:
query:
3
Generate,
a
cluster
of
at
approx,
4
Launch,
a
youth
with
countries
for
this
cluster
5
Delete
the
cluster
and
that's
it
Lastly,.
A
What
we
are
going
to
do
is
download
the
data,.
This
is
the
one
that
I
want
to
point
out.
Specifically,
and
finally,
run
2
as
it
runs
with
paper
1000.
The
integration
is
worth
as
we
know,
to
run
this,
these,
these
notebook
innovation
that
comes
from
the
very
cool
people
from
netflix
hey.
Well,
we
already
have
everything
ready
and
what
we
are
going
to
do
now
is
execute
the
complete
pallet
and
take
a
look
to
see
how
it
goes
so
now.
We
simply
know
that
it
efe
and
we
pass
it.
A
Here
you
have
all
the
steps.
Ok,
we
come
to
the
playground
and
what
we
are
going
to
do
is
the
execution,
but
beware
that
they
look
here
in
red.
It
is
telling
me
that
we
need
to
configure
this
section
of
download
data,
and
this
is
what
I
want
to
show
you,
well,
notice
that
here
it
is
solid,.
We
have
talked
a
lot
about
it,.
We
can
give
you
a
series
of
configuration
metadata
for
a
lot
of
things
to
say,
to
define,
well,,
simply
a
good
practice.
Description,
bring
the
documentation
closer
to
the
code,.
A
We
can
Define,
then,
the
data
that
is
going
to
enter
them,
for
example,.
If
we
were
to
pass
a
data
frame
from
solid
to
solid,
our
case,,
as
you
have
seen,,
we
are,
the
sooner
a
disk
will
be
made
and
read
it
in
later.
Steps,,
and
the
most
important
thing
is
this
right
here
is
the
confit
scheme
are
and
input
parameters
to
configure
the
step?
Well,
let's
say
are
the
typical
parameters
of
any
process
in
this
case.
What
sets
this
is
solid
and
admits
three
parameters.
Therefore,
we
can
have
reusable
solid,
reusable
components.
A
We
are
telling
you
the
uri,
well
our
reader,
a
uriz
from
where
to
download
the
file
the
name
of
the
file
and
where
to
leave
it
local,
ok!
So
that's
what
daxter
is
asking
me
here.
He
already
generates
this
configuration
in
jamel
format.
Ok,
we
could
obviously
provide
it
before
or
at
runtime.
Then
I
already
have
it
prepared
here.
So
I
can
save
it.
I
think
it
is
this.
I.
A
?
I
put,,
let's
say,,
all
the
parts
of
it
from
the
file
in
parquet
and
until
then
you
already
configured
the
input
data
to
the
input
products.
For
that
step.
This
obviously
is
extensible
to
all
the
solids
that
we
want.
What
happens
that
as
I
say
our
pipeline,
the
It's
true
that
we
have
n't
used
it,
a
lot.
Yes,,
we're
all
giving
it
this
this.
This
entry
description
of
saying
racing,
because
what
we're
simply
chaining
together
is
these
steps,
but
without
passing
anything
between
them,.
So
we
do
have
to
define
this
here.
but
hey,.
A
It's
a
more
technical,
issue,
okay,,
it's
already
launched,
execution,.
It
gives
you
something,
I'll
quote
you
below,.
It
looks
good
here,,
it's
down,
here,,
launch
the
execution,
and
well,.
It's
going
to
start
to
be
done.
Here,.
All
this
movie
that
we've
been
talking,
about,
we're
going
to
analyze
it
a
bit
all
the
steps.
Okay,
we
can
see
in
different
ways.
The
first
thing
is
going
to
be
executed.
It
must
provide
you
with
the
data
3par.
We
will
download
the
data
and
we
will
see
it
from
this
spark
notebook.
A
A
A
For
me
here,
as
I
said,
it
is
something
simply
what
we
have
taken
is
well
for
each
we
have
generated
the
titles
and
it
is
It
seems
to
me
ordered
by
the
number
of
files,,
since
it
is
etc.,
okay,,
and
that
a
bit
of
the
question
is
with
the
flow,
okay,.
Let's
go
back
to
datapro
to
see
if
the
ephemeral
master
has
already
been
generated.
A
A
And
well,
it
has
actually
been
carried
out
in
the
step
of
the
young
park
generation
that
we
already
see
here.
It
is
actively
part
of
the
country
park,
type
We,
give
it
and,
as
I
said,
we
are
going
to
see
all
the
abilities
of
Joves
Park.
We
execute
in
an
ephemeral
cluster,
and
here
we
already
have
the
whole
workout
and
simply
what
we
have
done
is
count
the
pressure
of
words,
the
typical
example
and
finally,
as
you
can
see,
also
what
we
do
is
save
it
in
parker
format,
with
snacks
compression
and
up
to
here.
A
A
A
A
Ok,
well,
the
incomplete
dance
has
already
finished,
executing
the
truth
that
the
little
step
this
from
the
jupiter
notebook
has
tried
a
lot.
We
can
see
it
I
think
it
has
been
a
little
temporarily
to
see
a
little
well
that
this
step
is
indeed
the
one
that
has
taken
the
longest
here.
We
can
see
the
notebook
signals,
as
I
was
saying
well
so
far.
It
is
we
see
that
we
have
done
a
bit
of
dexter
in
this
tool.
That
is
very
interesting.
A
Obviously
we
have
seen
a
level
100.
A
great
daxter
brushstroke
It
is
much
more
powerful..
I,
for
example,
am
researching
the
parts
of
how
to
deploy
it,
a
lot,
for
example,
a
case
of
rockets,
or
things
like
that
and
nothing
to
finish.
The
video,
a
bit,
I
think
we
will
be
collecting
these
three
principles
of
data
engineering
quite
well.
In
2021
that
we
have
been
discussing
separation
of
responsibilities
and
specialization.
We
have
a
specific
registrar
for
data
blake.
We
have
a
specific
repository
that
is
bit
query
for
sql
type
data
processing.
A
We
have
ephemeral
cluster,
the
first
three
to
run
code
at
spark
scale.
We
have
notebook
for
visualization
for
history
of
lincoln
data
and
then
obviously,
and
a
lot
more
tools
that
appear
in
this
park.
Well,
the
part
of
me
that
feels
the
jet
in
the
car
streaming
part,
etcetera,
etcetera.
Second,
all
as
code
note
that
we
simply
use
the
visual
interfaces
for
what
they
are
for
visualize,
not
to
develop
all
the
code.
We
are
doing
it
in
python,
where
we
are
going
to
have
versioning.
A
That
is
a
hit
so
that
it
can
be
made
public
to
be
able
to
apply
the
good
software
principles
such
as
testing
cc,
documentation
about
the
code,
etc,
etc,
etc.
And,
finally,
there
is
and
finally
ecosystems
do
not
sell
it
kings
and
open
standards.
We
have
used
python
used
sql
using
spark.
They
are
all
standards
and
all
are
open
source
projects
to
prevent
the
kings
from
events.
And
finally,
we
have
also
seen
the
nurserymen
the
cloud
to
take
advantage
of
the
cloud.
The
only
thing
In
this
case,,
the
local
was
running,
but
obviously,.
A
As
I
said,
before,
I
am
now
investigating
these
daxter
deployments
in
disip
and,
for
example,.
The
vernet
'cluster'
is
very
good,.
So
far,
the
video,
I
hope
you
have
found
it
interesting
to
do
this
next
action
and
this
reflection
on
the
general
data
in
2021
and
modern
principles,
well,,
getting
a
little
bit
of
the
machine.
Learning
phase
has
been,
as
I
said,,
this
most
initial
part
of
the
ingestion,
processing
and
transformation
of
data,.