►
Description
OpenDataHub: The Open Source Toolbox for Data Scientists
Audrey Reznik (Red Hat)
http://opendatahub.io/
OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html
Find out more about OpenShift Commons, please visit: https://commons.openshift.org
A
So
hello,
everyone,
my
name,
is
audrey
resnick,
I'm
part
of
the
red
hat
openshift
data
sciences
team.
As
a
data
scientist,
I've
had
the
pleasure
or
maybe
the
torment
of
delivering
aiml
models
to
production.
So
in
a
world
of
jupiter,
notebooks
terminal
servers
get
lab
runners,
s2i
containers
and
openshift.
A
You
don't
know
how
glad
I
am
to
have
discovered
open
data
hub
in
this
presentation.
I'll
give
you
a
brief
kind
of
background
on
how
open
data
hub
started.
What
open
data
hub
is,
and
hopefully
I'll
be
able
to
have
enough
time
to
conclude
my
presentation
with
a
quick
demo,
it's
not
going
to
be
a
live
one
it'll
be
with
slides,
but
the
demo
will
show
you
how
to
go
ahead
and
deliver
an
ml
model
which
dwells
kind
of
into
the
world
of
fraud.
Detection
a
bit
history
on
how
open
data
hub
started.
A
It
started
internally
within
red
hat
as
a
platform
on
openshift
for
data
scientists
to
basically
go
ahead
and
store
their
data
and
run
their
data
analysis
workloads,
hence
kind
of
the
phrase
data
hub
and
fairly
early
on.
It
was
realized
that
data
scientists
and
data
engineers
requirements
for
tools
and
really
anything
to
do
with
aiml
components,
were
pretty
different
from
devops
requirements,
so
data
scientists
I
can
attest
to
this
as
a
data
scientist-
are
mostly
ui
driven.
A
So
the
main
points
of
kind
of
sharing,
machine
learning,
workflows
done
in
notebooks
and
moving
a
model
to
production
and
managing
the
mode,
while
in
production
monitoring
it
making
sure
that
your
predictions
are
accurate
watching
for
any
data,
drifts,
resource
usage,
gpu
memory
and
whatnot.
Those
are
all
very
important
to
us,
and
these
are
things
that
were
combined
together
as
multiple
tools
and
components
to
kind
of
obtain,
an
end-to-end
ai
ml
platform.
Hence
we
have
this
open
data
hub
being
not
a
single
application,
but
really
a
platform
with
multiple
tools
running
on
openshift.
A
A
So
open
data
hub
is
really
how
red
hat
does
artificial
intelligence
and
machine
learning
internally.
On
openshift
and
we've
learned
quite
a
lot
from
running
machine
learning,
workflows
on
openshift,
we
faced
kind
of
still
face
a
lot
of
challenges
and
issues
that
we
try
to
kind
of
resolve
and
provide
solutions
in
the
open
data
hub
issues
and
challenges.
There
are
maybe
three
or
four
first
of
all
is
the
people
in
the
aiml
projects,
there's
always
a
team
of
data
scientists,
data
engineers,
devops
product
owners,
business
developers
that
need
to
collaborate
and
work
together.
A
Secondly,
there's
sharing
and
collaborating
around
the
ml
development
is
difficult,
sometimes,
most
of
the
time
it
can
be
manual
and
really
can
be
error
prone.
Thirdly,
another
important
challenge
are
just
the
computer
resources
themselves.
Aiml
workloads
are
compute
heavy
and
cpu
memory.
Storage
are
not
unlimited
resources.
I
think
we
all
know
that
and
definitely
they're
not
unlimited
resources
in
any
development
or
production
environments
that
we're
working
with.
And
fourthly,
which
is
the
final
challenge
and
one
that
is
very
critical,
is
delivering
to
production
and
the
production
development
life
cycle.
A
Sometimes
that's
not
as
easy
as
it
sounds
so
today,
open
data
hub
internally
runs
aiml
work,
workloads
such
as
application
logs.
So
in
our
internal
open
data
hub
clusters,
we
run
anomaly
detection
on
multiple
red
hat
application
logs
we
have
cluster
metrics,
we
gather
and
analyze
the
cluster
metrics
or
sorry,
the
cluster
logs
from
openshift
clusters,
and
we
have
an
ai
ops
team
dedicated
to
finding
or
predicting
any
issues
that
may
occur
there.
And
finally,
we
have
customer
support
data.
A
A
When
we
look
at
what
the
data
scientists
do
we're
really
looking
at
the
next
phase,
which
is
model
development
and
what
we're
doing
is
we're
looking
at
the
data
analysis
of
our
data,
picking
certain
features
going
ahead
and
creating
a
model
going
ahead
and
then
training
and
then
doing
some
model
validation.
A
The
very
last
phase
goes
into
the
devops
realm
whoops
back
up
into
the
devops
realm
and
that's
really
moving
and
serving
the
model
into
production.
This
phase
is
not
kind
of
a
static
one-stop
model
serving
delivery
phase,
but
it's
a
constant,
optimization
phase,
so
the
cycle
of
monitoring,
optimizing
and
serving
is
a
constant
cycle
that
happens
really
for
the
lifetime
of
your
model
and
again,
at
the
end
of
the
day.
It's
that
collaboration
between
your
data
engineers,
your
data,
scientists,
your
devops
and
any
of
your
business
developers
that
you
have
so
next.
A
What
I
wanted
to
basically
show
is
show
you
is
just
a
diagram
and
show
you
where
you
can
actually
find
open
data
hub.
So,
first
and
foremost,
open
data
hub
is
an
operator
that's
installed
from
the
openshift
operator
hub.
So
you
see
that
I
have
an
openshift
screen
here
and
we
can
go
ahead
and
then
choose
the
opendatahub
and
when
you
look
at
the
open
data
hub
you're
able
to
see
that
there
are
various
tools
that
you
might
be
able
to
use.
A
A
So
we
go
ahead
and
we
take
all
these
different
open
source
projects
such
as
kubeflow
and
we
adapt
them
to
run
on
openshift
and
we
package
them
basically
within
an
operator
and
then
we
go
ahead
and
offer
it
on
operator
hub.
So,
of
course,
kubeflow
is
pretty
big,
and
this
and
the
central
component
in
open
data
hub
and
we
add
other
components
and
you
can
see
them
on
the
screen.
There
we
add
things
such
as
grafana
spark,
prometheus,
jupiter,
hub
kafka,
etc.
A
So
this
slide
here
really
shows
all
the
different
tools
and
components
that
are
provided
by
the
open
data
hub
platform
and
it
just
addresses
basically
a
specific
functionality
in
the
end-to-end
ai
ml
workflow
and
again,
this
will
look
very
similar
to
the
slide
that
we
saw
just
two
slides
ago,
where,
first
of
all,
we
focus
on
data
analysis.
A
We
have
storage
integration
which
could
be
our
self
storage
working
with
postgres,
sql
or
mysql.
We
have
to
have
some
way
of
doing
data
exploration,
so
we
might
use
superset
or
hue
if
we're
interested
in
our
metadata.
We
might
have
something
as
hive
metastore
then
for
big
data
processing.
We
may
use
something
as
spark.
A
Those
are
things
that
the
data
engineer
and
the
business
analyst
are
very
interested
in
then
we
move
on
to
the
artificial
intelligence
and
machine
learning,
so
the
data
scientist
domain,
but
a
data
scientist
they
may
jump
into
an
interactive
notebook
such
as
jupiter,
go
and
go
and
do
some
of
their
work
in
there.
If
they
want
to
go
ahead
and
train,
fine-tune
their
model
or
work
with
a
distributed
model,
they
may
use
something
as
pie
torch.
They
may
use
something
like
spark
for
machine
learning
applications
themselves.
A
There
might
be
various
libraries
that
they're
interested
in.
In
that
case
they
can
use
the
open
data
hub,
ai
library
and
then
finally
they're
going
to
go
ahead
and
look
at
how
they
can
deliver
some
of
their
their
services
for
their
model
or
deliver
their
model
through
cube
pipelines
or
maybe
airflow.
A
A
Some
of
the
services
again
might
be
using
something
pipelines
such
as
keep
flow
pipelines
or
maybe
argo
and
then,
finally,
if
we
want
to
actually
take
a
look
at
what's
going
on
with
our
model,
we'll
use
some
sort
of
monitoring
tools
such
as
grafana
or
prometheus,
so
the
open
data
hub
comes
with
an
ecosystem
and
again
this
is
provided
by
red
hat
and
certified
partners
and
basically
to
help
enable
our
customers.
We
built
this
ecosystem
around
this
open
data
hub
and
we
feel
that
it
provides
our
customers
with
a
faster
go-to-market
strategy.
A
A
We
work
with
third
party
vendors
to
get
them
certified
to
use
ubi
images
and
certified
operators.
Then
these
partners
become
certified
partners
that
will
provide
support
for
their
tools,
integrated
with
open
data
hub,
and
we
could
look
at
some
things,
such
as
selden
or
anaconda,
anything
that
we
might
use
for
for
model
serving
etc.
A
A
So
the
first
thing
that
we're
going
to
do
is
just
basically
log
into
your
openshift
account
from
there
we're
going
to
go
and
proceed
to
the
open
data
hub,
dashboard,
we're
logged
in
as
a
developer
and
to
do
any
of
the
navigation.
We
would
use
the
left
panel
navigation
bar
so
right
now
we're
looking
at
the
topology.
A
So
I
would
just
proceed
to
the
open
hub
dashboard
by
clicking
on
the
odh
dashboard
operator
and
then
click
the
open,
url
button
what'll
happen
is
we'll
be
presented
with
some
sort
of
open
data
hub
screen
and
we'll
have
a
large
choice
of
options
to
choose
from.
As
I
mentioned,
odh
contains
a
number
of
tools
that
you
can
build
and
manage
and
deploy
your
models.
We're
going
to
take
on
the
role
of
a
data
scientist
and
work
on
a
fraud
detection
model.
A
So
when
we
open
jupiter
hub
we're
first
going
to
have
an
option
to
determine
the
type
of
notebook
that
we're
going
to
use,
we're
just
going
to
use
a
basic
machine
learning,
workflow
notebook
that
we
can
use
to
deploy
a
fraud,
detection
model
and
again
just
a
reminder,
we're
looking
at
legitimate
and
fraudulent
transactions
that
are
in
a
bank.
So
we
would
go
ahead
and
just
accept
the
other
defaults
and
choose
spawn
to
continue.
A
When
we
put
the
model
into
production,
we
actually
go
back
to
the
openshift
side
and
we
use
pipelines
so
we're
deploying
the
machine,
learning
pipelines
into
production
with
openshift
pipelines
and
we'll
see
how
we
can
use
the
services
to
make
predictions
when
we
go
back
to
the
main
openshift
console
and
select
pipelines,
you'll
see
in
this
case
there's
a
pipeline
that
we've
already
created.
So
what
we
do
is
we
could
click
on
the
pipeline
and
see
the
pipeline
details
and
remember.
This
pipeline
is
going
to
help
deliver
our
models
or
our
model.
A
So
once
the
pipeline
is
finished,
we
have
a
model
or
a
rest
service,
that's
built
with
source
to
image
or
s2i,
and
at
this
point
what
we'll
want
to
do
is
take
that
pipeline
service.
More
specifically
the
url,
because
we're
going
to
be
using
that
url
and
you'll
see
at
the
bottom.
I
have
a
service
url
such
as
pipeline
operator
data,
hub
user,
one,
etc,
etc,
and
we'll
be
using
a
request
library
in
python
to
interact
with
our
rest
service
that
we've
just
managed
to
deploy.
A
So
if
we
go
jump
into
the
jupiter
notebook
to
interact
with
our
model
services,
we'll
go
ahead
and
just
replace
our
default
host
with
that
generated
url
from
those
pipeline
services
that
we
have
running,
then,
if
we
go
ahead
and
run
our
services
make
that
request
and
then
run
our
model,
we'll
have
the
model
making
its
predictions.
In
this
case,
we
have
a
lot
of
legitimate
predictions
on
the
right
hand,
side
of
the
screen
you
can
see
under
predictions.
That
could
mean
that
we're
very
good.
A
We
if
we
as
we'd,
run
this
model
a
little
bit
longer
we'd,
probably
see
some
fraudulent
predictions
coming
up.
All
in
all.
That
looks
very
good.
So
then,
what
we
want
to
do
is
we
want
to
actually
go
and
take
a
look
at
graphically
what
our
legitimate
and
fraudulent
detection
transactions
look
like
over
time.
So
we
can
go
back
to
odh
and
we
would
launch
grafana.
A
I
apologize
for
the
screen
being
sort
of
or
the
screen
capture
being
sort
of
fuzzy,
but
what
that's
doing
is
what
it's
showing
you
is
over
the
course
of
a
day
the
number
of
legitimate
transactions,
which
can
should
be
a
lot
larger
than
the
fraudulent
detections
which
we
are
detecting.
A
Now
we
have
the
ability
to
visually,
monitor
our
service
for
fraudulent
and
legitimate
transactions,
and
that's
all
going
through
using
the
open
data
hub
services
where
we
were
able
to
deploy
a
jupiter
notebook
and
go
ahead
within
that
jupiter
notebook
and,
at
the
end
of
the
day,
get
our
model
running
and
then
go
back
into
open
data
hub
for
another
tool.
That
will
allow
us
to
actually
see
some
of
the
services
that
we
have
running
from
our
model
in
the
back
end.