►
Description
This time, we explore how to create a very simple pipeline to perform Hyperparameter Optimization, using SKLearn's GridSearchCV and posting the results to the MR.
Codebase: https://gitlab.com/gitlab-org/incubation-engineering/mlops/hyperparameter-tuning-exploration
Exploring Hyperparameter Optimization with GitLab: https://gitlab.com/groups/gitlab-org/incubation-engineering/mlops/-/epics/6
A
Hello
and
welcome
everyone
to
another
session
on
exploring
gitlab
for
machine
learning
and
data
science
use
cases
today
we're
continuing
the
our
path
on
exploring
gitlab
pipelines
for
hyper
parameter,
optimization
and
this
time
we're
gonna.
Do
the
simplest,
hyper
parameter,
optimization
pipeline
just
going
back
a
bit.
Why
are
we
doing
this
hyper
parameter?
Optimization
is
a
very
tedious
long
process
within
a
machine
learning.
A
So
it's
the
perfect
candidate
for
acid
pipeline.
It's
also
the
first
step
towards
autumnal.
So
if
you
think
that
the
algorithm
that
you
choose
in
itself
is
a
hyper
parameter,
you
can
think
as
automl
as
an
extension
of
hyperparameter
tuning
and
within
this
is
gitlab
ci,
a
good
tool
for
this
use
case.
It
makes
sense
within
the
within
the
gitlab
ecosystem,
but
is
it
the
tool
that
we
currently
have
ready
for
this?
A
So
if
you
want
to
follow
up,
this
is
just
a
part
one.
I
have
a
summary
of
everything
on
the
hyperparameter
exploration,
epic
and
I
recently
published
the
part
0
of
this.
If
you
don't
know
what
hyper
parameter
optimization
is
you
might
want
to
check
out
it's
just
an
explanation
of
the
concepts
behind
hyper
parameter
optimization.
A
So
what
are
we
going
to
do
today?
We're
going
to
create
kind
of
the
hello
world
of
of
this,
so
we're
building
up
here
the
foundations
of
what
we're
going
to
work
on
the
next
few
parts.
So
just
the
most
basic
hyper
parameter,
optimization
pipeline.
We
can
think
of
so
we're
going
to
use
synthetic
data
that
is
small,
we're
going
to
have
a
very
simple
model,
we're
going
to
use
the
mode.
The
simplest
hyper
parameter,
optimization
that
we
can
think
it's
going
to
be
not
going
to
be
parallelized.
A
It's
just
going
to
go
serialized
one
after
the
other.
We're
going
to
just
have
a
few
number
of
parameters
and
the
results
are
going
to
be
posted
directly
into
the
mr.
So
no
external
tools,
no
model
registry
for
comparison,
no
hyper
opt
no
bayesian
approach.
No
parallelization,
nothing
like
that.
That
will
come
soon,
but
it's
nice
to
build
upon
so
without
further
ado.
A
Let's
see
it
in
action,
so
I
have
this
over.
A
A
So
suppose
I
have
this
this.
A
This
repository,
which
I'm
using
for
this
code,
so
I
have,
for
example,
all
that
I
have
a
simple
model
that
runs
I'll
go
through
the
code
later.
But
what
what
is
important
at
this
point
is
that
I
have
this
hyper
parameters
over
here
and
I
have
created
a
gitlab
pipeline
over
here.
That
will
run
the
optimization
whenever
a
new
merge
request
is
created.
So
let's
say
I
want
to
create
a
I'm
gonna
create
I'm
gonna
come
over
here
with
the
hype
parameters,
and
then
I'm
gonna
create
I'm
gonna
edit.
A
This
I'm
gonna
say:
okay,
I
wanna
test
an
additional
hyper
parameter
over
here
or
value
for
the
mean
sample
split,
which
would
be
30..
It
doesn't
really
matter
so
I
can
create
a
new
branch,
a
new
branch.
A
With
new
parameters,
which
will
start
a
new
merge
request,
so
I
can
just
I
can.
I
could
write
a
nice
message,
but
I
won't
do
this
right
now.
So
if
I
look
over
here,
it's
checking
checking
checking.
A
A
So
you
can
see
here
now
already
that
it
already
installed
all
the
packages
that
we
need,
everything
that
was
defined
on
the
on
the
requirements.txt.
A
A
Okay,
so
now
we
can
see
that
it
generated
all
the
trainings,
so
five
mo
training
models
for
each
combination
that
we
have
each
one
taking
one
second,
so
it
was
about
nine,
it
trained
the
model,
90
different
times
yeah
and
then
formatted
the
results
and
then
published
to
the
mr.
A
So
now
we
can
go
back
into
the
mr
that
I
just
created
and
that's
not
it
that's
this
one,
and
then
we
can
see
that
it
actually
generates
a
comment
into
the
mr
with
the
best
accuracy
and
the
table
of
the
parameters
and
the
results.
So
you
can
see
that
the
difference
between
the
the
the
worst
and
the
best
the
best
cases
is
about
almost
0.1
percent
on
1.5.
A
So
that
is
actually
quite
big.
So
if
you
think
here
it's
not,
but
if
you're
running
a
multi-million
million
dollar
company
revenue-
and
you
can
increase
your
revenue
by
1.5
percent-
that's
quite
a
lot
so
yeah.
So
this
is
the
most
the
simplest
thing
we
can
do
and
now
I
can
go
a
bit
over
the
code
in
itself.
So
very
simple!
This
is
a
very
simple
pipeline.
A
I
start
by
generating
a
fake
data,
so
instead
of
using
a
real
data
or
some
of
the
data
sets
out
there,
what
I
do
I
generate
seven
different
variables
randomly
and
then
it
creates
a
function
y
that
is
true
or
false,
depending
on
this
equation
that
I
just
I
don't
know
typed
randomly,
but
the
fact
is
that
y
is
either
true
for
false
and
then
so.
A
This
is
a
classification
problem
and
y
is
completely
determined
determined
by
those
variables
and
then
what
I
do
is
that
I
remove
two
of
those
variables,
so
I
three
actually
of
those
variables.
So
instead
of
having
seven
variables
for
prediction,
I
only
have
four.
This
means.
I
don't
have
the
entire
information
available
to
me,
which
makes
it
cool
for
machine
learning.
We
can
use
the
data
to
try
to
recover
the
actual.
Why
then
iu?
A
I
run
the
optimize
sqlearn
script,
which
is
also
very
simple:
it
trains
the
model
which
is
a
random
forest
classifier
with
specific
random
state
just
to
define,
and
it
uses
the
grid
search
cv
from
from
sqlearn
to
optimize
the
data,
the
the
the
hyper
parameters.
It's
a
very
simple
algorithm.
It's
just
try
all
combinations
available.
So
if
you
have
three
parameters
each
one
with
three
different
values-
or
in
this
case
we
have
three
one
with
three
parameters:
other
two
another
two.
That
means
we
have
twelve
combinations.
A
Then
we
will
try
five
times
each
one
of
this
of
these
combinations,
so
it
loads
these
hyper
parameters
from
the
the
hyper
parameter
file
that
I
just
showed
before,
and
you
can
configure
configure
very
easily
very
quickly
and
it
passes
directly
into
the
optimizer,
then
a
very
simple
script,
and
that
does
the
formatting,
so
it
picks
up
the
results.
The
csv
results
transforms
into
markdown
and
then
a
final
one
that
just
publishes
this
to
the
mr.
So,
instead
of
instead
of
this
is
not
related
to
this
project.
A
In
itself,
this
can
be
used
in
any
project.
You
have,
it
just
passes
down
and
sends
a
message
into
the
mr.
A
So
this
was
a
very
simple
example.
It's
the
start
of
our
exploration,
but
we
can
already
see
some
really
good
important
points
over
here
and
I
think
the
biggest
pain
point
for
myself
is
that
the
iteration
speed
of
changing
the
code
committing
them
testing
on
the
gitlab
ui
checking
again
waiting
for
the
gitlab
pipeline
to
run,
checking
changing
again,
and
this
whole
cycle
takes
a
really
long
long
time
and
it
could
be
a
lot
faster
if
it
ran
locally
it
does.
A
We
do
have
some
some
tools
like
the
pipeline
editor
and
the
validate
the
gitlab
validation
from
on
the
vs
code
marketplace,
which
help
a
lot.
But
the
fact
that
you
need
to
the
problem
is
that
you
need
to
test
them
the
thing
running
and
it
just
runs
on
gitlab
ui.
It
doesn't
run
locally,
there's
no
solution
for
running
them
locally.
So
that's
very
unfortunate
up.
Next,
we
at
the
the
the
part
two
of
this
is
trying
to
make
it
parallel.
A
So
if
you
saw
before
it
just
runs
a
single
pipeline,
it
doesn't
parallelize
the
there
is
the
the
runs
which
is
not
optimal
in
this
case.
It's
not
a
problem
because
it
run
takes
one
second
to
finish,
but
imagine
that
you
have
a
it
takes
five
six
hours
to
trade,
a
model
which
is
not
unreasonable.
A
It's
common
to
have
this
kind
of
application
where
it
takes
six
five
hours
days
weeks,
sometimes
to
train
so
making
it
parallel
is
quite
important,
and
this
is
what
we're
gonna
do
on
the
next
on
the
next
session
and
then
the
part
three
a
little
bit
going
beyond
is,
instead
of
using
this
predetermined
approaches,
where
you
have
every
single
combination
already
stored
or
computed
from
the
get-go,
we
use
a
little
bit
more
of
the
interior
iterative
algorithms
that
update
the
possible
values
every
iteration.