►
From YouTube: OpenShift Case Study cnvrg.io ML Platform | OpenShift Commons Gathering|Virtual Red Hat Summit 2020
Description
OpenShift Case Study cnvrg.io
Building ML Platform on OpenShift
OpenShift Commons Gathering
@ Virtual Red Hat Summit 2020
A
Hi
hi
everyone,
my
name
is
Johan
and
I'm
CEO
and
co-founder
of
converge.
Converge
is
a
machinery
platform
built
on
top
of
open
shift
and
kubernetes.
We
help
teams
to
manage,
build
and
deploy
machine
learning
all
the
way
from
research
to
production.
We
help
to
breathe
science
and
engineering
teams
and
we
provide
IT
with
an
environment
to
manage
all
machine
learning,
resources,
utilization,
infrastructure
and
more.
We
start
converged
because
data
scientists
are
spending
65
percent
of
their
time
on
DevOps
and
also
85
percent
of
the
models
don't
get
to
production.
A
This
is
happening
because
there
are
two
different
audiences
in
the
machine
learning
world.
First
is
the
IT
side
focused
on
production,
machine
learning,
focus
on
infrastructure
of
X
capex.
On
the
other
side,
you
have
the
data
scientist,
a
team,
that's
usually
focused
on
algorithms
insights
and
they
spend
65%
of
their
time
on
DevOps.
A
Now
what
converge
and
RedHat
provide
is
a
solution
to
solve
exactly
that.
We
provide
everything,
data,
scientist
and
DevOps
needed
out-of-the-box
a
managed,
kubernetes
deployment
on
any
cloud
or
on
from
environment,
a
fully
automated
installation
and
lifecycle
management
or
the
application,
and
also
all
tools.
Data
scientists
need
for
machine
learning,
AI
from
research
to
production
and
an
open,
flexible
container
based
code
first
data
science
platform,
which
integrates
to
any
kind
of
tools
that
you
already
have
in
your
ecosystem.
A
Machine
learning
today
is
fragmented,
broken
between
a
lot
of
different
tools,
scripts
plugins
and
connected
stacks
there.
On
the
left
side,
you
have
the
MN
ops
DevOps,
a
lot
of
work
around
configuration
and
installation
scheduling,
resource
management,
lifecycle,
collaboration
and
a
lot
more
on
the
right
side.
You
have
all
the
data
science
workflow
from
data
selection
to
data
preparation
model
research,
which
probably
need
to
be
versioned
in
the
in
the
middle.
Then
you
have
a
lot
of
experimentation,
training
of
different
models,
visualizing
models,
validation
of
models,
tuning
and
deployment.
A
Once
you
deploy
the
model,
it
does
it's
not
stopping
there.
You
also
need
to
monitor
and
proactively
iterate,
so
if
there
is
some
sort
of
model,
decay
or
new
data,
that's
coming
into
the
model.
How
do
you
re
trigger
this
kind
of
pipeline
this
pipeline?
It
involves
both
research
and
a
production
deployment
into
a
single
we
continuous
training
and
continuous
deployment
mechanism
because
of
this
complex
environment
and
procedures,
a
85%
of
models
don't
get
to
production.
This
is
a
problem
identified
by
a
lot
a
lot
of
different
companies.
A
This
is
a
paper
published
by
Google
a
few
years
ago.
It's
describing
the
hidden
technical
depth
in
machine
learning
systems.
What
happened
when
a
company
is
trying
to
get
machine
learning
from
prototyping
to
real
production
scale?
Suddenly
they
face
a
lot
of
different
challenges:
challenges
around
resource
management,
who
is
using
what
GPU,
terrific
infrastructure
monitoring
of
models
both
in
training
and
also
in
production,
a
lot
of
stuff
and
very
little
in
every
action,
magic.
A
Full
stack
container
base
just
like
OpenShift
and
it's
okay,
open,
meaning
that
you
can
use
any
kind
of
framework
and
any
kind
of
container
to
build
your
models.
The
verge
accelerates
everything
from
research
production
across
any
infrastructure.
You
have
the
open,
open
shift
and
converge
solution,
work
side
by
side,
so
we
use
that
converge
open
ship
to
distribute
jobs
across
the
different
compute
resources,
so
think
of
it
that
you
can
have
one
control
plane
for
all
AI
and
you
can
attach
different
compute
resources
to
the
platform.
A
So
if
you
have
open
shift
on
Prem
or
open
shift
in
the
cloud
or
a
hybrid
and
mix
of
the
two,
you
can
have
all
the
different
clusters,
a
on
clusters
unified
in
one
environment,
and
then
your
data
scientists
can
run
machine
learning
workflows
on
any
of
the
workers.
Only
any
any.
The
parts
or
containers
that
you
Pro
that
you
have
access
to.
A
So
you
get
one
platform
to
manage
training
of
models
to
manage
research
on
Jupiter
notebooks
through
yes
code
or
something
else
you
can
use,
auto
scaling
and
cloud
bursting
and
a
lot
of
nice
features
built
in
terms
of
the
pipelines.
So
converge
provides
a
solution
to
build
machine
learning
pipelines
around
all
your
different
machine,
machine
learning,
compute
and
different
jobs.
Each
component
here
in
the
graph
that
was
built
with
converge,
can
be
running
on
a
different
open
shift
cluster.
A
So
you
can
have
this
preparation
step
running
on
a
spark
cluster
on
premise:
a
CPU
cluster
built
on
top
of
open
chief.
Then
you
can
have
GPU
training
in
the
cloud
or
GPU
training
in
the
cloud
or
on
from
using
also
the
open
shift
platform
and
the
last.
The
deployment
can
also
be
deployed
on
a
public
cloud.
Cluster
close
serve
also
an
automation
tool
for
building
models,
so
you
can
run
this
pipeline
on
pre-processing
model
selection
and
the
model
deployment
every
day
or
every
week,
or
even
based
on
the
new
data.
A
That's
coming
into
the
flow,
but
whenever
there
is
a
new
version
of
the
data
this
tree,
this
flow
can
be
triggered.
Automatically
flows
are
versioned
tracked
in
runtime
and
also
during
the
building
of
the
flow.
So
you
can
always
see
for
every
model
that
you
build
exactly
how
it
was
built
with
what
kind
of
data,
what
kind
of
metrics
hyper
parameters?
Algorithms
everything
centralized
now.
A
One
of
the
nice
unique
things
that
we
have
together
with
openshift
is
that,
besides
the
fact
that
each
node
here
on
the
graph,
each
component
can
run
on
a
different
computer
resources,
convert
automatically
scales
up
the
cluster
and
free
the
resources
when
this
job
is
over.
So
in
this
case,
I'm
using
multiple
clusters,
but
think
of
it
that
you
can
have
one
cluster
for
spark
or
deep
learning
and
for
classic
machinery
and
converge
will
orchestrate
all
the
different
jobs.
With
the
help
of
overture.
A
A
Okay,
so
this
is
the
converge
UI
you
can
think
of
the
converter.
I
sort
of
like
get
github
designed
for
data
science
makes
everything
really
simple.
You
can
share
models,
you
can
share
resources,
experiments,
research,
everything
can
be
shared
and
you
can
have
all
your
data
science
team
in
one
single
platform,
so
data
scientists,
data
engineers
and
IT.
Now
before
we
dive
into
one
of
the
used
cases
here,
converge
relies
on
openshift
computers,
so
you
can
attach
multiple
clusters.
A
Convert
itself
can
be
installed
on
open
chip,
and
then
you
can
attach
openshift
and
kubernetes
clusters
directly
to
the
platform
from
the
UI.
Now
one
of
the
nice
things
here
is
that
you
can
track
utilization,
so
you
can
make
sure
that
you're,
using
all
the
resources
and
if
you're
not
then
see
exactly.
What's
the
what's
the
holdup
you,
we
also
provide
Ravana
in
Cabana
and
other
nice
open-source
tools
built
in
into
the
platform
and
into
each
cluster
that
you
connect.
A
A
A
All
right,
we'll
go
into
amnesty,
to
show
you
how
you
build
and
deploy
in
a
bottle
and
converge
so
at
convert.
We
support
a
lot
of
different
ways.
You
can
start
building
models.
We
even
support
the
most
basic
way,
which
is
spinning
on
a
jupiter
notebook
or
even
BS
cone.
So
you
can
choose
to
run
vs
code
on
any
of
the
OpenShift
clusters
that
you
have.
You
can
choose
any
compute
template
you
want.
A
Once
you
spin
up
a
resource,
convert
will
allocate
of
the
CPU
and
memory
that
you
need
we'll
spin
up
the
container,
get
your
code
from
jet
or
from
convert
and
you'll
have
a
working
environment
up
and
running
in
a
few
seconds
instead
of
a
few
hours
of
static
up.
This
is
all
in
the
high
security
standards
that
openshift
provides
and
it's
the
fastest
way
to
get
a
visual
vs
code
running
on
a
remote
machine.
A
Alright,
next
up
is
flows,
and
this
is
from
the
slides
I
showed
earlier.
So
this
is
a
really
fast
way
to
build
any
kind
of
machine
learning
pipeline.
You
want,
in
this
case
I'm
loading
data
from
an
object
storage.
It
could
be
mean
IO,
it
could
be
your
own
object,
store
cloud
object
or
anything
you
want
here,
I'm
running
it
now
on
a
spark
pre-processing.
A
A
Each
of
those
components
here,
like
I
said,
can
run
on
a
different
computer,
so
this
can
run
on
a
spark
cluster.
This
can
run
on
the
GPU
in
the
cloud.
This
can
run
on
my
on
premise
GPU.
This
can
run
on
a
CPU,
so
I
get
like
the
flexibility
to
run
any
kind
of
compute
any
kind
of
task
on
any
kind
of
compute
node
that
I
want.
So
I
can
have
one
PI
point
spread
across
all
the
different
compute
resources
there
have
when
guarantees
high
utilization
for
and
the
best
tool
for
the
best
job.
A
A
So,
what's
going
to
happen
here,
the
unloading
data
from
an
object
store
I'm,
then
running
a
pre-processing
step
using
spark
and
verge
automatically
passed
the
output
data
as
an
input
to
all
those
three
different
models.
Each
of
the
models
will
be
running
as
many
as
time
as
I
said
it
in
the
internal
parameters.
Then
I'm
going
to
automatically
pick
the
best
model
that
that
I
have
and
deploy
it
as
a
web
service.
This
web
service
is
also
I'm
going
to
pick
the
model
based
on
accuracy.
A
Of
course,
I
can
customize
it
based
on
any
metric
that
I
want
and
I'm
going
to
pick
the
best
model
and
deploy
it
as
a
web
endpoint.
This
is
really
cool
because
in
one
pipeline,
I've
been
doing
pre-processing
on
spark
training
on
GPUs
or
CPUs
and
then
deploying
to
a
remote
openshift
cluster
by
one
of
the
counters
or
something
once
this
pipeline
is,
is
triggered.
Converge
automatically
tracks
everything
all
the
input
and
all
the
output
is
automatically
versioned.
A
Data
is
version
and
across
the
different
component,
and
you
get
this
table
that
you
can
track
in
real
time
and
you
can
see
all
the
different
executions
of
the
graph.
You
can
go
into
specific
experiment
of
the
graph
and
see
resource
utilization,
hyper
parameters
and
metrics
are
automatically
being
tracked.
You
can
see
metadata
about
the
run
and
we
also
automatically
plot
your
accuracy
and
metrics.
A
So
it's
great
also
for
research
like
you,
can
track
and
monitor
the
different
models,
see
what
kind
of
hyper
parameters
work
best
for
with
what
kind
of
models
you
can
even
select
a
few
models
like
this
and
click
compare,
and
then
you
see
all
the
different
models.
Side
by
side
and
exactly
what
happened
in
each
model
in
every
stock
in
terms
of
serving
so
we
we
selected,
we
did
model
selection.
The
best
model
was
automatically
deployed
to
a
end
point
as
an
end
point,
so
we
use
open
shifting
the
back
end
for
this
as
well.
A
We
deployed
your
file
and
function
as
a
web
service
on
OpenShift,
and
the
cool
thing
is
besides
the
fact
that
we
completely
automated
DevOps.
We
helped
data
scientists
and
their
engineers
to
get
sort
of
like
an
x-ray
to
the
model.
You
can
see
all
that's
happening
and
all
the
input
and
all
the
output
are
automatically
tracked,
so
you
can
build
new
datasets
using
this
data
or
you
can
see
the
activity
of
the
model.
A
What
happened
recently,
you
can
also
deploy
new
versions
of
the
model
and
using
canary
release
that
helps
you
to
gradually
deploy
new
models
and
continuously
test
them,
as
they're
being
deployed.
We
also
have
cabana
and
Griffin
are
great
tools
for
IT
and
their
Maps.
To
maintain
and
monitor
the
endpoint
and
last
part
is
low,
is
a
continuum
machine
learning.
This
allows
data,
scientists
or
engineers
to
monitor
models
in
production
in
commitment.
Today,
let's
say
you
have
five
data
scientists,
10
data
scientists.
A
Each
of
them
has
five
models
in
production
very
soon,
you'll
have
around
50
models
in
production
to
monitor
that's
quite
hard,
and
each
model
requires
different
kind
of
monitoring.
The
one
we
did
is
that
we
made
it
extremely
simple
for
engineers
to
add
alerts
to
their
models,
so
you
can
track
model
confidence
or
you
can
track
data
quality
or
any
kind
of
parameters
that
you
want
and
then
trigger
an
email
to
the
data
scientist.
So
let's
say
your
model
receives
that
input.
You
can
automatically
send
an
email
to
the
data
scientist
hey.
A
Another
cool
example
that
we
have
a
lot
with
our
customers
is
being
able
to
track
the
confidence
of
the
prediction,
and
if
that
confidence
prediction
drops
below
0.5,
for
example,
over
a
specific
period
of
time,
then
automatically
rain
a
model
returning
the
model,
so
it
will.
It
will
basically
trigger
the
pipeline
that
we
just
built
before
make
sure
the
new
data
is
being
fetched.