►
From YouTube: Applied ML - Building MLOps pipeline in Gitlab for Suggested Reviewer - "The first MLOps template "
Description
This video walks through the step of building an end to end ml pipeline automated from data extraction to bot service.
A
Okay,
hi
everyone
today
is
really
exciting.
For
us,
we
will
be
showcasing
how
we
build
the
first
emmelops
pipeline
for
the
applied
ml
team.
A
We
we've
had
a
lot
of
questions
on
how
we've
used
gitlab
ci
in
building
machine
learning
models,
and
here
is
our
full
full
pipeline
that
goes
from
data
ops,
to
mlaps,
to
connect
to
the
frontend
and
we'll
go
a
lot
more
details
into
it.
So
to
begin
with
I'll
start,
first
sharing
my
screen:
this
is
just
a
little
bit
of
the
basics
of
how
reviewer
recommender
the
process
actually
works.
A
In
the
background,
the
first
part
is
really
data.
We
are
using
the
merge
request
api
to
extract
data
from
there.
So
the
first
phase
is
the
gitlab
ci
service
triggers
that
process
from
pre-extracting
setting
up
right
environment
extracting
to
ingesting
to
processing.
A
Then
it
goes
to
from
that
data
lobs
to
the
emma
lops
part
of
it,
where
we
then
trigger
the
training
of
the
data
training
of
the
model,
tuning,
selecting
serialization
of
the
model,
and
all
that
is
done
in
google
cloud
storage,
and
that
is
then
connected
to
our
final
step,
where
we
are
serving
the
model
and
sending
the
output
to
the
bot.
A
That's
the
bot
that
we
see
for
our
internal
customers,
suggesting
the
reviewer
for
a
certain
mr
the
front
end
that
can
also
change
with
a
lot
more
details
later,
where
we
would
add
to
the
model,
monitoring
and
observability
part
of
it
and
also
change
the
way
we
serve
the
experience
from
abort
to
actually
a
front-end
ui.
A
So
now
I'm
going
to
actually
let
andreas
and
alexander
go
into
a
lot
more
details
into
the
pipeline.
A
B
And
show
you
how
these
pipelines
actually
look
like,
so
we
start
for
each
project.
We
create
a
scheduled
pipeline
which
triggers
every
three
days.
B
B
The
reason
why
we
decided
to
to
combine
them
into
this
one
single
yamal.
Well,
there's
two
reasons.
First,
we
wanted
to
keep
this
sequential
flow
of
making
sure
the
extraction
comes
before
a
transformation.
Transformation
comes
before
training,
and
this
seemed
like
a
very
convenient
way
to
do
this,
but
also
for
each
of
these
jobs.
We
have
each
one
has
its
own
repository
its
own
pipeline
for
running
unit
tests
for
doing
dependency
scanning,
and
this
way
we
can
keep
that
separate
from
the
actual
model
training
pipeline.
B
Otherwise
we
just
get
the
data
since
the
last
job,
then
we
run
the
actual
extraction
job.
B
B
Like
we
do
here
for
the
transform
file,
we
need
to
get
the
the
actual
main
dos
pi
that
we're
running
to
transform
the
file.
We
need
to
fetch
it
because
when
we
include
a
remote
yaml,
we
just
include
the
instructions,
not
the
actual
repository,
so
these
these
files,
we
need
to
clone
as
well
same.
We
do
for
the
the
training
job
again,
we
pass
this
with
artifacts
and
once
the
the
transformation
and
the
training
is
done,
we
use
some
small
bush
scripts
to
to
actually
persist
this
into
our
database.
B
A
That's
that's
great,
actually
going
back
to
the
jobs
alexander.
Could
you
help
actually
even
just
explain
all
the
different
stages
from
pre-extract
to
extract
to
pre-transform
all
the
way
to
post
pipeline?
A
C
C
Let
me
first
share
the
same
pipeline,
so,
for
instance,
this
is
the
pi.
This
is
the
male
ops
pipeline
for
for
our
internal
handbook.
C
So
just
to
sum
up
what
andres
said,
it
consists
of
many
jobs,
most
of
them
just
for
housekeeping,
but
mainly
there
are
three
main
stages
to
extract
data,
so
this
one
right
to
transform
these
data,
so
we
use
data
flow
jobs
to
transform
data
and
also
to
move
data
from
so
yeah
underneath
we
also
use
pops
up
between
extract
stage
and
transform
stage,
and
we
also
have
some
data
flow
jobs
that
move
there
from
from
pub
sub
to
google
cloud
storage,
then
we
also
have
a
transform
job.
C
Also,
this
is
a
data
flow
job
used
to
transform
data
into
preparing
training
and
test
data
sets,
and
so
finally,
maybe
the
also
very
so
this
is
also
very
important
stage.
Some.
This
is
the
training
stage,
so
first
we
tune
hyper
parameters.
We
select
the
best
model
for
a
given
project
and
then
finally,
we
train
the
final
model
that
will
be
published
to
google
cloud
storage
and
will
be
served
later
on
each
request.
C
So
right
now,
so
let
me
focus
on
each
stage
right
on
each
of
these
three
stages.
Okay,
so,
but
we
can
also
check
the
elapsed,
ci
file
that
we
have
right
now,
yeah
and,
as
andres
said,
these
three
stages,
they
are
located
in
individual
projects.
So
this
is
the
extract
stage
right
here
we
have
the
transform
stage,
and
here
we
have
all
the
jobs
that
relate
to
the
training
stage.
So
let's
go.
C
So
yeah,
if
you
go
to
our
extractor
report
to
the
ci
folder,
you
will
find
the
yaml
file
that
is
included
in
each
mlabs
pipeline.
So
if
we
check
this
yaml
file,
we'll
see
that
it
consists
that
it
has
only
one
extract,
merge,
request,
ci
drop
that
mainly
extracts
all
merge
requests
from
one
date
to
another
date.
So
we
took
these
dates
from
the
postgres
database
that
we
use
underneath
our
emails
pipeline.
C
So
this
is
just
one
just
one
comment
to
extract
yeah,
as
I
said,
merch
requests
with
approvers
and
also
with
the
divs,
because,
right
now
the
model
works
based
on
change
files.
So
this
y
for
each
merge
request.
We
also
need
to
extract
change
files,
changes
exactly
so
yeah,
that's
all
for
the
first
extract
stage
yeah.
For
now.
We
also
just
one
thing:
we
use
batch
of
size
like
50,
50
requests.
C
C
Okay,
if
we
go
to
this
ti
folder,
we
will
find
almost
the
same
file
as
in
the
extractor,
repo
and
yeah.
So
we
have
only
one
job
here,
also
just
to
transform
our
extracted
merch
requests
and
prepare
training
and
test
data
sets.
So
this
is
the
python
project.
We
use
the
python
sdk
to
write
the
data
flow
job
and
using
this
command
we
create
a
data
flow
job.
So
this
is
the
runner
that.
C
C
Okay-
and
maybe
this
one
is
the
most
difficult
stage
in
terms
of
the
number
of
jobs
that
we
have
so
first
we
have
the
pre-processed
dataset
job,
so
just
to
download
everything
that
we
have
for
a
given
project
from
the
google
cloud
storage
to
zip
all
these
files
for
the
next
job
yeah.
This
is
the
goal
of
this
this
step.
Now,
the
next
job
is
to
tune
hyper
parameters,
so
we
need
to
tune
in
order
to
select.
C
So
we
need
to
find
the
best
parameter
hyper
parameters
in
order
to
find
the
best
model
that
can
can
can
give
us
the
best
results
so
yeah.
So
we
we
take
the
zip
data
set
from
the
previous
job
and
we
use
this
data
set
here,
just
to
transform
sorry
just
to
tune
hyper
parameters.
And,
finally,
when
we
find
the
best
model,
we
sorry
the
best
hyper
parameters.
We
train
the
final
model
install
this
one,
so
this
is
the
job
that
is
related
to
this
step.
C
So
here
we
extract
from
the
file
the
best
hyper
parameters.
Then
we
put
them
to
the
special
yaml
file
used.
A
C
Their
model
and
then
finally,
we
train
our
model
and
yes,
the
last
step.
We
also
need
to
publish
this
model
right
now.
We
push
everything
to
google
cloud
storage.
First,
we
serialize
this
model.
Then
we
push
to
the
google
cloud
storage
and
later
the
our
backend
part
will
take
this
these
models,
deserialize
them
and
provide
recommendations.
C
So
this
is
this
is
what
we're
trying
to
find
the
best
number
of
factors,
the
best
regularization,
the
best
number
of
iterations.
So
that's
just
a
config
file
that
is
used
by
the
model
to
select
hyper
parameters
in
order
to
select
the
best
model
again.
A
Yeah,
thank
thanks
alexander
I'm
also.
I
think
I
would
also
like
to
point
out
in
which
I
is
definitely
in
our
templates
is
how
we
also
include
security
scanning
through
this
process,
which
is
something
quite
rare
for
machine
learning
engineers
to
include
sas
das
testing
as
part
of
their
cacd
template
into
building
that,
and
then
I
think
the
last
part.
A
Yeah,
so
this
one
first
just
quickly
a
the
architecture.
We
have
another
video
that
will
go
a
lot
more
in
length
in
reference
to
all
the
different
parts
and
what
we
use.
But
if
you
think
about
the
ci
file
that
we've
built
the
full
mlops
template
that
we
are
calling
for,
it's
actually
starting
all
the
way
from
that
extractor
connecting
to
pops
up
data
flow
into
google
cloud
where
the
ml
model
training
is
done
and
then
deserializing
into
backend
into
the
projects.
So
that's
the
full
sort
of
workflow
of
it
and
yeah.
C
Yeah,
just
maybe
you
know
we
forget
we're
going
to
say
we
create
the
scheduled
pipeline
on
the
project
registration,
so
this
is
done
automatically
once
we
include
this
item
plates.
This
ci
template
will
register
and
use
kettle
pipeline
for
the
given
projects,
and
this
then,
this
pipe,
this
amalebs
pipeline
will
be
run
every
three
days
and
automatically
update
the
model
dataset.
A
Yeah,
I
think,
that's
a
wrap.
A
Well,
I
hope
this
was
very
informative
if
anyone
is
keen
on
understanding
how
to
build
the
emma
loves
pipeline
using
gitlab
ci
or
have
any
questions
on
the
pipeline
food
review
recommender,
please
do
drop
a
note
for
us
in
our
applied
ml
slack
channel
or
reach
out
to
either
me
juan
alexander
and
andreas.
We
are
really
happy
to
help
in
any
part
of
this
journey
in
building
envelopes.