►
From YouTube: Pipelines for MLOps Use Cases: why does it become CREATE? [IncEng MLOps - March 30th 2022]
Description
This week we talk more about pipelines, what we have worked on so far and why they are so important for MLOps.
Pipelines with Stubbed Jobs: https://gitlab.com/groups/gitlab-org/-/epics/7681
Citer: https://gitlab.com/gitlab-org/incubation-engineering/mlops/poc/citer
Glyter: https://gitlab.com/gitlab-org/incubation-engineering/mlops/glyter/-/tree/poc/glyter
All updates: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/16
This Update: https://gitlab.com/gitlab-org/incubation-engineering/mlops/meta/-/issues/48
A
Hello,
everyone
and
welcome
to
another
update
for
the
incubation
engineering
at
gitlab
for
mlaps.
My
name
is
eduardo
and
today
we're
going
to
talk
about
pipelines
as
part
of
the
melops
great
stage.
I've
been
talking
about
this
a
little
bit
in
my
past
few
updates
on
some
of
the
efforts
that
I
have
already
done,
or
some
that
I
want
to
do,
but
I've
never
take.
I
have
never
taken
the
time
to
talk
about
this
from
a
higher
bird's
eye
ego
eye
view.
A
So,
and
I
want
to
talk
today
about
the
vision
that
I
have
for
this.
Where
do
we
want
to
get?
Why
is
this
important
for
envelopes,
how
pipelines
how
data
scientists
create
and
how
pipelines
feature
fit
into
this
great
process?
This
is
going
to
be
a
longer
update
than
usual,
but
hopefully
it's
going
to
be
interesting
one.
A
So
for
those
that
don't
know
at
gitlab,
we
kind
of
split
the
devops
into
a
lot
of
different
stages,
so
you
have
great
plan
verified,
and
today
I'm
going
to
talk
specifically
about
how
did
a
scientist
create.
So
if
you
want
to
learn
more
about
stages,
you
can
follow
the
link
on
the
previous
slide,
but
we're
going
to
talk
about
how
data
scientists
create.
A
First
and
foremost,
these
are
the
tools
of
trade
data.
Scientists
mostly
use
python,
r
and
jupiter
to
create
their
things
and
now,
whatever
ide
they
use
for
python
and
r
scripts
as
well
like
r
studio,
vs
code
and
whatnot,
but
python
r
and
jupiter
are
the
de
facto
almost
tools
for
the
trade
and
something
very
interesting
about
how
data
scientists
creating
jupiter
is
that
it's
very.
A
It's
very
iteration
by
default,
it's
very
iterative
by
default,
so
you
have
one
cell
that
you
you
edit
and
then
you
run
the
next
cell
and
then
you
go
back
to
the
previous
cell
and
then
you
go
forward
and
backwards
and
forward,
and
from
this
you
can
already
see
how
pipelines
are
taking
shape.
Think
think
about
that.
Each
of
these
steps
each
of
the
cells
is
a
job.
A
So
you
can
already
think
that
this
is
a
shaping
its
way
towards
a
pipeline,
and
but
not
only
that,
even
when
we
do
when
we
don't
use
jupiter
or
whatever
it's
very
common,
that
we
need
remote
code
execution.
So,
for
example,
I
need
to
run
a
long.
I
need
to
train
a
model,
but
sometimes
training
models
can
take
eight
hours
or
what
or
a
lot
of
time,
and
I
wanna
run
multiple
trainings
at
the
same
time.
A
So
I
can
delegate
this
into
a
into
a
cluster
or
or
something
like
that,
and
this
is
part-
and
I'm
not
talking
here
about
ci
cd,
I'm
talking
here
about
creation.
This
sometimes
happens
when
I
don't
even
have
the
code
on
a
on
git.
Yet
this
is
still
living
in
my
in
my
computer,
in
my
local
development
setup.
So,
for
example,
here
with
ray
ray
is
distributed.
A
Solution
is
a
solution
for
it's.
A
company
implements
distributed
solutions
for
for
data
science.
You
see
here.
They
already
have,
for
example,
radar
remote
that
tells
that
that
function
should
execute
on
a
specific
box
or
a
remote
instance
same
thing
here
with
with
kubeflow
pipelines,
which
is
also
commonly
used
by
the
scientists
to
implement
this
pipeline.
So
you
build
a
a
dag
to
run
your
your
data,
science
workflows
and
again.
This
is
on
the
created
you're,
not
we're
not
talking
about
various
ci
cd.
A
A
So
you
can
see
how
how
often
we
have
to
train
a
model
right
from
the
create
stage
and
train.
Our
motto
is
not
cheap.
It's
not
fast.
It's
often
not
able
to
run
on
your
own
machine.
You
have
to
offsource
this
shuffle
this
into
a
more
powerful
machine
that
has
better
gpu
or
whatnot,
also
motor
debugging.
So,
for
example,
if
I
want
to
debug
some
bias
variance
issues
with
my
model,
I
can
plot
graphs
like
this,
but
note
that
each
point
on
this
graph
is
a
trained
model
in
itself.
A
A
And,
let's
talk
just
to
show
a
little
bit
more,
how
this
is
this
is.
This
is
quite
crazy.
Is
this
is
all
the
all
the
offerings
for
from
ray
on
from
their
website?
They
have
all
of
these
great
libraries
all
of
the
this
great
product.
A
A
Learnings
are
right
on
the
create
all
the
distributed,
libraries
that
they
have
is
already
on
great,
so
yeah,
it's
it's
just
essential
for
envelopes,
for
pipelines
to
be
part
of,
creates
to
be
seen
from
the
create
perspective,
not
only
later
down
the
the
problem
and
on
this
space
there
are
very
big
user,
bring
points
that
I
have
not
me
address.
First
of
all,
it's
a
fragmented
market.
A
You
have
a
lot
of
open
source
tools
that
each
one
does
something
small
or
doesn't
do
very
well
or
whatever,
but
they
all
require
you
to
learn
that
specific
tool.
It
all
require
you
to
maintain
that
specific
tool.
They
not
always
have
the
best
ui
and
it's
just
really
complicated
for
data
scientists
to
come
in
and
choose
a
tool
on
this
area
or
or
bring
in
because
they
are
also
not
part
of
the
infrastructure
team.
Often
you
have
infrastructure
team
and
you
have
the
data.
A
Science,
team
and
communication
is
not
always
the
best
between
then
some
companies.
So
it's
really
hard
for
the
designs
to
get
the
tooling
that
they
want
and
I
think
we
can
fill
in
this
space.
I
think
we
have
a
lot
of
opportunity
here
for
gitlab
to
grow
into
this
area,
so
we
are
already
part
of
the
stack
of
many
of
these
of
this,
the
companies
other
that
we
can
look
at
so
there's
no
additional
cost
of
maintaining
this
on
another
tool
on
the
stack
we
have
great
documentation.
A
A
The
ui
was
created,
the
the
use
cases
were
created
with
pipelines
as
part
of
the
cicd,
and
what
we
need
to
do
is
expand
these
pipelines
as
part
of
both
the
verify,
but
also
of
the
create
stage
when
we
take
the
classes
that
only
look
at
this
from
the
verify-
and
we
start
looking
from
the
create
stage,
then
we
can
start
seeing
what
other
features
that
we
can
and
how
can
we
prove
this
tooling
for
this
use
case
specifically
so,
for
example,
some
of
the
things
that
we
can
do
here
is
we
have
the
whole
infrastructure
of
gitlab
runners
already
implemented.
A
We
have
the
devops
team
on
the
company
on
on
the
company
of
the
users,
already
knows
how
to
create
a
box
or
a
runner
or
anything.
We
can
use
this
gitlab
runners
as
backend
for
remote
remote
code
execution.
A
It
also
means
that,
while
I'm
iterating
imagine
that
some
jobs
can
take
six
seven
eight
hours,
I'm
not
gonna
be
commuting.
Every
change
that
I
make
so
that
it
gets
it
gets
published
or
it
gets.
It
starts
a
pipeline
only
if
I
only
want
git
commit.
No,
it
needs
to
be
faster.
A
So
I
already
created
a
pipeline
before
I
have
already
have
my
steps
already
have
my
stages
over
there.
Why
can't
I
just
port
that
pipeline
to
the
virtual
to
the
ci
cd
part.
A
So
when
I
talk
about
runners,
I
talk
about
gitlab
as
a
backend
as
gitlab
when
you
say,
for
example,
radar
remote
run,
call
a
gitlab
runner
or
an
operator
model,
suppose
that
we
have
a
gitlab.remote,
for
example,
call
a
runner
that
will
run
the
job
and
return
to
the
to
the
local
machine,
the
answer
so
that
it
connects
it.
It
gets
all
answers
together
or
if
I'm
working
on
a
jupiter
notebook,
it
can
be
that
the
cell
runs
each
cell
runs
remotely
on
a
runner.
A
Even
though
the
the
the
the
document
is
a
survey
itself,
the
code
is
executed
in
a
different
runner,
which
is
a
lot
more
powerful
which
has
gpu,
which
has
I
don't
know
how
many,
how
much
memory
you
need
for
that.
But
this
is
what
I
mean,
but
the
problem
here
is
that
pipeline
executions
right
now
is
heavily
coupled
to
the
ci
and
git
flow.
A
You
can't
really
do
this
kind
of
process
without
going
through
the
git
flow,
without
going
through
the
git
commit
get
push
start
a
pipeline
because
you
need
to
push
code
changes
as
well,
so
you're
running
pipelines
where
the
code
is
not
on
the
server,
it's
not
on
a
git
repository.
Yet
it's
still
on
your
local
machine,
it's
still
running
over
there.
It
only
exists
there.
It's
not
on
the
on
the
repository.
Yet
so
current
setup
of
gitlab
doesn't
work
for
this.
It's
not
prepared
for
this.
A
It
was
never
coded
with
this
use
case
in
mind.
What
I
did
so
far
was
citer
saturn
was
a
poc
that
I
that
I
created
a
few
weeks
ago,
where
you
have
a
local
code.
I
created
a
pre-configured
repository
project
on
gitlab
and
I
use
a
lot
of
creative
engineering
to
send
code
to
this
through
a
trigger
dynamic
pipeline
that
is
run
on
the
runner,
but
my
local
process
is
still
watching
the
runner
and
once
the
runner
is
done,
returns
the
results.
A
So,
from
the
perspective
of
the
developer
of
who
is
working
on
the
code,
everything
happens
on
their
machine.
They
don't
need
to
open
gitlab
for
anything,
everything
is
in
their
local,
it's
still
very
limited,
very
hacky,
but
it
shows
a
little
bit
how
we
can
do
this
sidereel
can
easily
be
evolved
into
having,
for
example,
a
into
our
vs
code
plugin,
so
that
we
can
run
pipelines
directly
on
the
on
the
repository
or
on
gitlab.
A
The
second
problem,
imagine
that
I
have
this
this
pipeline
with
five
steps.
I'm
testing
them
out
or
I'm
not
even
testing,
I'm
actually
working
on
them
and
pipeline
d
fails
now
suppose
that
c
strain
a
is
preparing.
The
environment
b
is
fetching,
the
data
c
is
training.
The
model
and
d
is
uploading.
The
model
suppose
that
the
first
three
steps
take
24
hours
to
finish,
which
is
not
uncommon
at
all.
A
A
Rerunning
entire
pipelines
for
mlaps
is
really
really
costly
and
one
way
to
avoid
this
is
with
the
concept
of
stub
jobs,
which
is
okay.
If
I'm
gonna
rerun
from
d,
let's
find
the
last
successful
run
of
c
and
use
the
output
of
that
pipeline
so
that
I
can
run
the
new
d
on
top
of
it.
A
I
have
an
epic
for
this
over
there,
which
I'm
gathering
support,
not
support
but
feedback
on
this,
how
we
could
approach
this
problem,
how
we
could
work
around
minimum
via
products
for
this,
but
this
is
really
game
changing
for
mlaps
and
for
many
other
use
cases,
but
for
amalops
this
is
very
important.
A
It
allows
teams
to
focus
on
iterating
on
only
one
step
of
the
time
that
one
step
at
a
time
and
reuse.
What
was
done
before
so
cube
flow
implements.
This
with
cache
plumber
implements
this
as
well
a
lot
of
the
the
the
pipeline
tooling.
A
A
Why
couldn't
I
just
have
something
that
automatically
takes
my
the
pipeline
that
I
already
have
created
translates
into
the
yamo.
That's
that
that
gitlab
is
going
to
read
and
run
that
for
me,
no
need
to
transform
anything
works
by
default
speaks
the
language
of
the
user,
so
this
is
what
glitter
is.
This
is
also
another
poc
that
I
have
it
currently
transforms.
A
jupiter
notebook
into
a
gitlab
pipeline
is
also
very
limited,
but
also
showcase
a
little
bit
of
the
experience
that
we
can
offer
users
on
the
site.
A
Pipelines,
you
cannot
talk
about
mlabs
and
not
talk
about
pipelines
in
the
create
stage.
It's
just
not
possible.
You
need
pipelines
in
the
great
stage.
It's
it's
just
so
important
from
the
beginning
of
your
of
your
ml
of
working,
an
ml
model
of
working
on
data
science
using
case
that,
starting
from
data
ops,
already
it's
already
in
there.
So
we
need
to
look
a
little
bit
put
step
back
a
little
bit
and
put
this
new
goggles
for
our
pipelines,
looking
at
it
from
the
create
stage.
What
are
the
pain
points
when
creating
pipeline?
A
Because
when
you
use
it
for
the
verify
you
call
the
pipeline
once
and
then
you
it
runs.
It
keeps
running
for
all
the
commits
or
whatnot.
It's
like
the
usability
is
a
lot
more
a
lot.
It's
a
lot
more
reusable
than
when
you
talk
about
great
stage
great
stage.
You're
always
creating
a
new
pipeline
always
always
create
a
new
pub
and
you,
you
start
a
new
piece
of
code
new
pipeline.
A
So
just
to
summarize
what
I've
been
working
on
and
what
I
want
to
work
on.
Citer
is
about
decoupling
running
a
pipeline
from
ci.
It's
about
the
coupling
running
up
or
the
the
the
running
the
pipeline
from
git
flow.
It's
about
running
a
pipeline
that
is
not
there
where
the
code
is
not
even
not
even
the
configuration
or
the
code
or
the
necessary
files
are
even
on
the
repository
yet
so
I
need
to
upload
these
files
or
it
downloads
from
somewhere
else
running
pipelines
with
stub
jobs
is
about
making
it.
It's
about.
A
Reusing
is
about
optimizing
time
and
resource
usage
for
our
users,
so
it's
about
making
sure
that
it
only
runs
when
it
needs
to
be
run
while
iterating
and
glitter
is
about
making
it
that
simple
to
transform
a
pipeline
that
you
already
had
on
the
create
stage
into
a
ci
configuration
that
you
can
run
on
the
verified
stage
and
that's
what
I
had
for
today.
Thanks
for
sticking
with
me
have
a
good
one.
Bye.