►
From YouTube: Implementing DataOps with GitLab
Description
Emilie Schario talks about implementing DataOps with GitLab and answers questions from the Customer Success organization at GitLab.
A
Oh
everyone
and
welcome
to
another
exciting
installment
of
the
CS
skills
exchange
today
is
going
to
be
an
abbreviated
session
since
we're
going
to
be
ending
at
the
top
of
the
hour
for
the
Michael
McBride's
AMA.
But
today
we're
going
to
be
talking
about
how
to
implement
data
apps
using
get
lab,
and
we
have
Emily
sherry
over
here
to
give
us
an
overview
and
answer
any
of
your
bringing
questions.
So
without
further
ado,
Emily.
B
Thanks
Chris,
hey
folks,
my
name
is
Emily
Sharia
I'm
gonna
get
lab
for
almost
two
years
now.
I
was
the
first
data
analyst
at
Q,
I
moved
into
the
data
engineering
role
and
now
I'm
on
the
chief
of
staff
team.
As
an
internal
strategy
consultant
Chris
put
together
an
awesome
issue
which
links
to
the
video
that
a
talk
I
gave
a
cute
lab
commits
in
San
Francisco
I'm,
not
gonna.
Give
that
talk
here,
but
I
want
to
share
the
general
ideas
behind
that
talk
right.
So
the
first
thing
is:
what
is
data
ops?
B
We
all
know
what
the
DevOps
figure
8
is
right.
I,
don't
need
to
explain
this
piece
to
you,
but
data
Ops
builds
on
top
of
this.
Let
me
switch
marker
colors,
so
data
ops
is
about
applying
the
principles
of
that
we've
seen
be
effective
in
DevOps
to
data
and
the
difference.
The
key
difference
is
that
data
comes
in
to
this.
You
treat
it
you
develop
in
it
the
way
you
would
DevOps
or
the
way
you
would
build
a
software
application
and
then
what
comes
out
is
your
analysis.
B
B
Dub
bottom
is
reporting.
These
are
what
I
call
facts.
Reporting
is
like
how
many
new
users
signed
up
on
our
website
or
how
many
pipelines
were
run
last
month.
It's
it's
very.
Like
fact-based
questions,
you
get
the
information,
then
the
next
level
is
what
I
call
insights.
Insights
are
where
you
combine
two
pieces
of
information
from
different
data
sources,
so
that
you
actually
provide
business
value.
B
So,
in
a
really
practical
practical
example
around
your
work
at
gitlab,
you
might
be
interested
to
know
like
what
stages
of
the
product
do
people
use
in
order
to
indicate
that
they're
going
to
keep
using
a
product
a
year
from
house.
Sorry,
Linda,
quick
lark,
so
that's
insight
to
pieces
to
data
sources
combining
to
give
you
valuable
information
and
then
the
final
piece,
the
top
of
the
pyramid
is
what
I
call
predictions:
we're
not
talking
about
fancy,
machine
learning
models
or
anything
like
that.
It's
just
saying:
okay,
based
on
x
and
y.
B
Where
do
we
expect
to
be
12
months
vertical,
and
most
data
teams
are
still
stuck
in
this
bottom
square
right
here,
reporting:
that's
because
they
spend
a
lot
of
their
time.
Building
analyses
to
answer
simple
fact:
base
questions
that
win
something
in
their
data
model
changes
break,
and
so
they
spend
all
their
time
maintaining
here
they
never
get
to
move
up
the
stack
so
the
big
problems,
data
teams
have
our
data
integrity,
data
quality
and
data,
reliability
and
data
ops
is
the
way
to
kind
of
combat.
So
that's
the
short
version.
B
C
B
Great
question
so
there's
one
thing:
I'll
start
by
answering
and
that's
by
saying
there
are
two
general
approaches
to
data
movement.
The
first
you
may
have
heard
of
this
is
ETL
right,
extract,
transform,
load
and
depending
on
who
you're
talking
to
this
is
gonna,
be
the
norm
they
talked
about.
This
is
a
kind
of
an
older
way
of
working
and
the
norm
today
is
ELT,
extract,
load
and
transform,
and
the
reason
for
that
is
because
your
business
logic
will
change
right.
B
So,
let's
think
of
a
practical
example
where
an
e-commerce
company-
and
we
have
a
website-
and
we
might
say
that
a
new
user
is
someone
who
has
never
visited
our
website
before
period
and
over
time.
As
you
get
more
data
on
your
users,
you're
gonna
be
able
to
connect
that,
like
there's
cellphone
visits
and
their
computer
visits
and
then
also
if
they
have
like
I,
don't
know
if
you
can
see,
there's
a
laptop
right
there
right.
B
That's
my
personal
laptop,
that's
right
there,
that's
my
personal
laptop
and
so,
as
you
collecting
more
data,
you're
gonna
want
to
be
able
to
see
like
someone
who
connects
from
their
work
laptop
and
their
personal
laptop
and
their
iPad,
or
multiple
cell
phones
and
all
that
kind
of
stuff,
and
so
your
definition
of
new
user
will
evolve
as
you
get
additional
information.
So
the
appeal
of
the
e-elt
approach
is
that
when
you're
business
logic
changes,
you
don't
need
to
move
all
your
data
again
because
it's
already
stored
in
your
data
warehouse.
B
There
are
three
major
data:
warehouses
on
the
market:
snowflake
bigquery
and
redshift.
They
created
being
Google
redshift
being
Amazon.
Those
are
like
are
definitely
the
three
most
players
I've
seen
them
cover.
Probably
85%
of
the
conversations
I've
had
with
an
asterisk
being
that
a
big
port,
a
big
part
of
what's
left,
is
teams
that
are
using
Postgres
for
analytical
purposes.
That's
okay
to
like
get
started
with.
It
won't
scale,
as
well
as
an
analytical
data
warehouse,
but
it's
a
great
place
for
teams.
B
Okay,
so
data
comes
from
wherever
and
it
gets
loaded
into
your
database
and
then
you
do
things
to
it
and
then
it's
ready
for
consumption
and
then
your
consumption
is
held
by
a
bi
tool
or
like
a
Jupiter
notebook.
So
I
get
lab.
Data
comes
from
a
myriad
of
sources
like
Salesforce
and
Zora
and
product
usage
and
stuff
like
that
it
gets
loaded
into
snowflakes,
so
everything
in
this
section
happens
in
snowflake
and
then
RBI
to
lists
I
sense.
B
This
is
a
pretty
standard
structure
for
a
lot
of
data
teams
where,
like
machine
learning,
data
science
happens
here
but,
like
this
middle
part,
is
pretty
similar
DBT,
which
is
what
this
logo
is,
is
kind
of
B
premier,
open-source
tool
that
does
this.
There
are
a
couple.
Others,
data
form
is
a
paid
one.
Matil
ian
is
another.
People
have
homegrown
versions
of
this,
but
DBT
is
is
growing
in
popularity.
They've
been
around
for
four
years.
B
D
B
B
So
transform
DB
team,
so
this
is
kind
of
get
labs
DBT
project.
By
the
way
this
project
is
public
and
on
the
calls,
the
sales
calls
that
I've
been
on
they've
found
it
really
useful
to
point
customers
to
this
after
calls
so
take
note
of
this.
This
link-
but
these
are
this-
is
our
DBT
project,
which
has
multiple
parts
to
it.
Models
are
kind
of
the
core
part
of
what
DBT
does
and
you'll
see
that
all
of
these
models
I'm
thinking
of
a
good
example.
B
Well,
as
he
Salesforce,
because
it's
a
pretty
canonical
example,
so
Salesforce
opportunities
has
the
opportunity.
This
is
the
model
that
underlies
the
Salesforce
opportunity
analysis.
So
you
know
we
take
things
from
the
opportunity
table.
We
do
some
cleaning
and
renaming
and
clean
some
stuff
up
like
this
is
our
transformation.
So
if
we
look
back,
this
is
the
transformation
that's
being
stored
in
SCM
our
DBT
testing,
it's
also
stored
in
SCM.
So
if
we
go
to
back
to
our
DBT
project,
we
go
to
tests
here
same
thing.
You
can
find
tests
will
see
snapshots.
B
Documentation
so
DBT
ships
with
really
great
documentation.
We
host
ours
on
github
pages,
it's
a
DTD
at
lab
data.
Calm.
You
can
kind
of
see
everything
that's
going
on
here.
It
takes
a
second
to
load
because
it's
JavaScript,
but
then
we
go
see
that
same
salesforce
model
that
I
just
showed
you.
This
FTC
opportunity
right
there
and
it's
documentation
around
the
columns
and
what
they
are
and
and
what's
going
on
here
and
you
can
see
kind
of
the
dag.
B
So
this
is
why
DBT
is
really
great
in
terms
of
where
all
of
like
the
rest
of
yet
lab
outside
of
SCM
fits
in
so
plan.
They
get
lab
data
team
uses
plan
just
like
the
engineering
teams
would
ci
is
a
really
great
way
for
teams
to
get
started.
So
when
we
use
merge
requests
because
we
are
making
changes
against
those
things
that
we
just
saw,
we
use
a
process
leveraging.
One
of
the
snowflake
features
called
zero
copy
clones
where
they
make
a
clone
of
production
environment.
B
This
is
a
Deb
environment
which,
and
that's
created
whenever
a
merge
request
is
created.
It's
I
can
link
you
to
the
get
last
the
ice
cream,
but
that
allows
the
merge
requests
to
run
in
a
clone
of
production,
and
that
way
we
can
see
the
data
after
like
the
new
changes
and
make
sure
and
compare
it
state
at
the.
In
the
merge
request
to
production,
to
kind
of
see
what
the
side
effects
are.
B
So
that's
one
way
we
use
CI.
The
other
thing
is
that
some
teams-
you
see
I
pipelines
for
actually
running
there,
they're
DBT
deployments
it's
hard
when
you
have
like
a
large
DBT
deployment.
So
when
I
started
a
cute
lab,
the
data
team
was
three
people
that
then
we
were
able
to
use,
get
lab
CI
to
manage
our
deployments,
and
that
was
really
great
as
your
deployment
gets
bigger
and
your
dependency
graph
or
it's
called
a
dad,
directed
a
cyclical
graph
when
it
looks
like
this
is
only
things
related
to
Salesforce
opportunities.
B
If
we
were
to
look
at
everything
this
end,
this
graph
we're
gonna
update,
you
can
see
how
big
and
there
you
go.
So
it's
big
and
complicated
and
there's
a
lot
of
different
slices
to
this,
and
so
we
use
air
flow,
which
is
the
canonical
tool
here,
but
we
host
air
flow
using
it
and
I'm
happy
to
link
you
to
the
docs
on
that.
Does
that
answer
your
question?
Kurt.
D
Yeah,
it
does
I
think
with
I
mean
you
know
the
question
I
get,
or
at
least
recently
God
is
like
how
does
get
lab?
Do
data
science
and
that's
like
such
a
broad
topic
right
and
so
for
me,
my
thing
is
like
how
to
filter
them
down
to
whether
that's
that,
like
triangle
graph,
you
sure
like
are
you
doing
reporting?
Are
you
doing
analysis?
D
Like
that's
the
story,
I'm
trying
to
tell
yeah
so
I
think
that
gives
a
high
level
I
think
you
know,
data
science
is
new
to
me
and
probably
new
to
a
lot
of
other
people,
so
there's
still
some
concepts
I'm
trying
to
grok
and
some
of
the
like
the
tools
and
the
logos
that
I'm
trying
to
understand
like
what
they
do
and
why
one
is
better
than
the
other.
But
yeah
it
does
I
think
it's.
D
B
So
two
things
that
jump
on
me
I
jump
out
to
me
from
that
question:
one
see
if
you
can
push
to
understand
what
they
mean
when
they
say
data
science,
it's
a
really
ambiguous
term.
Some
people
use
it
to
mean
data
analytics
like,
for
example,
at
lift
and
stitch
fits
two
very
well-known
companies.
In
the
data
space
they
say
data
science,
and
that
includes
data
analytics.
Just
like
simple,
straightforward
reporting,
that's
really
different
from
how
we
use
the
term
how
I
even
use
the
term
I
don't
think
data
science
is
an
umbrella
term.
I.
B
Think
data
is
an
umbrella
term
and
data
science
is
a
subset
of
that.
But
you
know
everyone
has
their
own
language.
So
one
push
to
see
if
you
can
get
a
better
answer
around
what
they're
doing
do
they
mean
analysis?
Do
they
mean
data
engineering,
which
is
the
movement
of
data?
Do
they
mean
machine
learning?
Do
they
mean
advanced
statistical
methods,
which
is
what
I
would
actually
mean
by
data
science,
so
understanding
kind
of
more
of
like?
B
What
do
you
mean
when
you
say
data
science
is
going
to
be
the
better
way
to
figure
out
how
to
push
the
conversation,
because
if
the
answer
is
advanced
statistical
methods,
what
they
mean
is
Jupiter
notebooks
I'm,
not
a
fan
of
your
notebooks
and
so
like.
Unless
that's
what
you're?
Looking
for,
you
probably
shouldn't
recommend
that,
but
gait
lab
does
a
lot
to
make
that
part
easy.
So
that's
number
one
and
then
number
two
I'll
work
with
Chris
to
get
a
to
get
another
session
on
the
encounter
in
the
future.
Yeah.
A
Absolutely
and
please
there's
a
shameless
plug
fill
in
those
issues
with
comments,
questions
as
the
more
you
all
give
us
the
more.
We
can
be
detailed
and
do
deep
dives
where
it's
you
know,
most
beneficial
Joe
I,
see
typing
today,
but
happy
to
to
dive
deep
in
any
of
these
different
topics
and
we've
touched
on
today,
Joe.
B
Oh,
actually,
while
I'm
doing
that,
can
you
share
my
screen?
So
people
here
is
like
this:
is
a
data
structure?
Page
I'll
drop
it
into
the
doc
next,
but
you
can
see
here
what
our
diagram
for
our
data
structure
looks
like.
There
are
a
number
of
third
party
data
sources
right:
Zhora,
Zenda,
salesforce,
NetSuite,
we've
all
heard
of
these.
We
use
off-the-shelf
ETL
tools
such
as
stitch
and
five
Tran
to
move
the
data
into
our
snowflake
data
warehouse.
So
that's
what
this
big
boxes
and
then
we
do
those
transformations.
B
We
talked
about
see
those
little
DBT
logos
there
and
that
triggers
kind
of
what's
ready
for
analysis,
and
then
people
write
queries
in
size
zones
that
happen
on
snowflakes.
So
this
is
a
really
short
version.
All
of
the
orchestration
that
happens
here
happens
with
air
flow,
so
I
will
drop
this
link
in
the
chat,
but
that's
a
good
place
to
start
so.
B
I
just
needed
myself
accidentally.
Sorry,
we
use
a
hosted
Cabernets
cluster
for
that
and
they're.
One
of
the
cool
things
about
the
data
team
project,
like
I
said,
is
that
the
it's
all
public
so
get
comm
/
get
lab.
E
An
awkward
question
it
which
might
be
naive
but
I,
was
thinking
there
is
this
whole
ETL
which
we're
running
in
gate
lab
and
that
determines
what
is
the
target
schema
to
which
you
process
the
raw
data?
Every
time
you
change
it,
would
you
rerun
those
transformations
for
all
the
data,
including
the
historical
data?
How
long
would
that
take?
What
is
the
performance
implications
of
this?
That's.
B
B
So
the
answer
is,
it
depends
on
the
volume
of
data
so
for
some
volumes
of
data,
where
it's
small,
like
data
sources,
not
get
Lancome,
for
example
like
smaller
data
sources,
then,
yes,
we
rerun
everything
every
time
because
they're
sort
of
use
and
what
that
is
in
a
database
is
maybe
not
relevant,
but
it's
not
a
performance
hit
and
it's
actually
better
to
just
run
the
whole
thing
in
terms
of
performance.
Where
we're
talking
about
like
quantities
of
data,
we
only
transform
that
new
slice.
B
E
B
The
record
there's
also
another
whiteboard,
you
can't
see
it
so
I
just
have
a
little
handles
for
calls,
but
cool.
So
we've
got
a
minute
left.
Oh,
it
looks
like
there's
more
demand,
so
we'll
definitely
work
on
getting
something
scheduled
for
the
future.
Does
anyone
have
like
a
final
question?
They
want
to
get
out.
B
Yeah
so
there's
like
two
parts
to
answer
this
question:
it's
actually
a
super
super
complicated
question,
because,
if
they're
saying
like
hey
we're
getting
started
with
data
science,
my
first
thing
is
like:
do
you
already
have
a
data
analytics
organization?
Or
are
you
actually
getting
started
with
a
data
analytics
organization?
Because
if
that's
the
case,
then
my
answer
to
them
is
gonna,
be
they
need
to
be
start,
they
need
to
start
with
source
controlling
their
transformations,
just
getting
started
with
git,
and
so
for
a
long
time.
B
B
Like
data
scientists,
machine
learning
engineers
a
lot
of
times,
they're
coming
from
academia
where
business
best
practices
like
version
control,
is
still
a
foreign
casa.
So
we
need
to
spend
time
educating
on
that
need
to
really
help
them
think
about
how
to
grow
in
scale
and
create
reproducible
analyses
for
their
org.
So,
in
the
same
way
you
think
of
selling
to
like
developers
with
you,
don't
have
to
convince
them
with
version
control.
In
the
same
way,
you
need
to
teach
what
version
control
is
and
convince
them
that
they
need
version.