►
Description
OpenDataHub Fraud Detection
ML SIG
OpenShift Commons
July 12 2019
Open Data Hub project is a reference architecture for an AI and Machine Learning as a service platform for OpenShift built using open source tools.
B
A
All
right,
so
in
this
presentation,
what
I
will
do
is
give
a
little
introduction
on
what
open
data
hub
is
and
then
I'll
dive
deeper
into
the
fraud
detection
use
case
that
we
implemented
using
open
data
hub
and
at
the
end,
I
will
have
a
demo,
but
it
is
a
recorded
demo.
It's
not
a
live
demo,
just
because
I
don't
have
the
cluster
with
everything
running
at
this
moment,
all
right.
So,
let's
start
so.
A
We
also
run
this
platform
internally
within
Red
Hat,
and
we
have
a
lot
of
internal
customers
where
they
might
have
to
use
it
and
that's
where
we
learn
from
and
we
learn
what
the
pain
points
are,
and
we
try
to
bring
in
the
tools
to
basically
solve
any
any
pain,
points
or
issues
that
are
needed
with
the
end
to
end
AI
platform
and
I
just
want
to
say
that
the
AI
ml
and
anthem
is
very
complicated.
It
is,
it
is
what
not
like
any
other
simple
software
engineering
platform.
A
It
requires
many
users
of
the
platform
and
a
lot
more
tools
and
a
lot
more
complexity.
Then
then,
what
we're
used
to
software
engineering
problems?
So
with
this
in
mind,
we
came
up
with
the
open
data
hub
project
and
today
it
is
an
operator
that
you
can
download
from
the
community
operators
on
open
sure.
It's
basically
and
one-stop
easy
install
for
all
that.
For
that
all
the
tools
that
you
need
to
run
your
AI
m.
That's
one
right
there!
A
Well
the
tools
that
we
bring
in
we
started
out
with
the
basic
tools
and
I'll
show
you
in
a
little
bit
what
our
road
map
looks
like,
but
basically
what
we
thought
of
the
users
of
AI
ml.
We
started
with
the
data
scientist,
so
we
pre
invited
you
to
have
not
work.
We
thought
of
the
data
engineer
and
we
provided
self
storage,
and
then
we
also
thought
about
the
processing
part,
which
is
SPARC
and
I'll,
talk
in
depth
about
this
in
a
second.
So
we
also
have
tools
for
DevOps.
A
Ai
ml
is
not
just
training
models.
It's
also
serving
them
and
monitoring
them
and
being
able
to
monitor
their
feedback
flow
back
into
the
al
AI
ml
models.
Alright.
So
let's
move
on
to
the
next
slide.
This
is
kind
of
it's
a
busy
slide.
It
is
a
high
level
kind
of
a
higher
level
reference
architecture
for
the
end-to-end
AI.
A
Everything
is
open,
chef,
kubernetes
native,
so
everything
runs
on
that
platform
and,
as
you
can
see,
as
I
mentioned
earlier,
we
have
multiple
users
for
that
platform.
We
have
the
data
scientists,
the
business
analysts,
the
data
engineer
their
folks
engineers.
All
these
people
need
to
use
these
tools
in
the
platform
starting
from
the
bottom.
You
know
again,
it's
the
first
software
engineering
problem
with
AI
ml.
We
always
start
with
data.
A
Your
data
can
be
in
motion
or
can
be
static
somewhere
in
storage
such
as
in-memory
databases
or
data
lakes,
or
a
relational
databases,
or
it
could
be
in
motions,
for
example,
it
could
be
in
Kafka
or
coming
on
an
SVG
interface
from
self
etc.
So
after
we
get
the
data,
we
can
tag
the
data
or
data
clean,
the
data
etc
and
then
give
it
to
the
data
analyst,
which
is
the
next
green
level.
Here
that
we
see
the
data
analyst
will
take
the
data
and
they
will
create
the
models
they
will
analyze
it.
A
They
will
try
to
make
meaningful
predictions
or
meaningful
studies
out
of
it.
Once
that's
done,
then
you
basically
have
models
that
you
need
to
serve
and
that's
where
Selden
comes
into
place
for
the
model,
lifecycle,
ml
flow
or
I
think
we
demoed
that
here
previously
for
stats
coming
out
of
the
model,
the
applications
that
use
the
model.
A
A
We
also
have
to
secure
the
model
network,
security
and
governance
come
into
play
so,
as
you
can
see,
pretty
complicated
and
to
end
platform
all
the
way
from
the
bottom
from
the
platform,
all
the
way
up
to
actually
serving
the
model
and
monitoring
it
on
the
left
side,
I
have
a
just
a
very
simple
sample
of
a
workflow
that
we
have
that
we
use
most
of
the
time
and
it's
very
related
to
the
fraud
detection
areas
that
you
will
see
coming
up
here.
We
have
the
data
served
on
set.
We
take
that
data
from
staff.
A
We
transform
the
data
who
create
the
models
using
Jupiter,
hub,
Jupiter,
notebook
and
spark
or
tensorflow.
We
run
experiments
on
it
and
get
data
out
of
Emma
flow.
Then,
after
we're
happy
and
we're
satisfied,
we
deploy
using
Selden
and
openshift,
and
now
the
model
is
served
and
once
it's
served
together,
metrics
and
we
display
it
on
the
graph
on
our
dashboard
and
sultan,
has
interfaces
for
Prometheus
so
that
you
can
extract
a
metrics
from
the
southern
system
itself
and
the
model
itself
and
you'll
see
that
in
the
demo,
that's
coming
up
for
fraud
detection.
A
Just
a
brief.
Our
next
try
just
a
brief
outline
of
the
roadmap
for
open
data
hub.
What
we
have
today
and
what's
coming
down
the
pipeline.
So
the
initial
release,
which
was
earlier
this
year
included,
like
I,
said
the
basics
for
a
data
scientist
to
grab
data
and
do
some
analysis
included
Jupiter
hub
and
it's
multi-user
Jupiter
help
multi-user
spark
clusters
same
things.
If
you
have
multiple
users
using
the
same
open
data
hub
installation,
they
can
have
their
own
Super
Hub
and
their
own
spark
cluster
and
also
included
stuff
Nano.
A
The
latest
release
that
we
have
that
we
released
a
couple
of
weeks
ago
included
Selden
for
serving
beaker
X,
which
is
a
notebook
that
has
the
notebook
image.
It
has
a
lot
of
good
tools
for
better
easier
data
analysis.
It
also
includes
included
GPU
support
and
Jupiter
hub
and
we
added
prometheus
and
go
fauna
so
that
you
can
do
your
monitoring
and
came
out
of
the
box
already
monitoring
the
spark.
That's
one
coming
down
a
lot,
a
flying
and
August
and
of
August
release,
we're
gonna,
add
a
lot
of
really
interesting
tools.
A
We're
gonna,
add
the
AI
library,
open
data.
Have
the
air
library
which
I
did
not
talk
about
this
here,
but
I
think
we
demoed
this
some
time
and
the
ml
stick
will
come
again,
probably
demo.
It
again
include
our
cargo,
which
is
the
native
workflow
for
AI
ml.
Many
is
offensive,
will
also
have
stuff
installed
by
rook
and
that's
basically
at
a
high
level.
That's
it
for
the
roadmap.
A
B
I'm
just
unmuting
everybody
now
so
you
just
are
all
self
melted
self
melted
right
now,
muted
used
to
be
I
want
wanted.
A
couple
of
questions.
I
know
that
we've
created
this
open
platform
open
data
hub,
but
it
I
know
I've
heard
it
is
already
being,
and
you
said,
say:
Mass
open
cloud:
are
there
other
places
where
it's
being
used
in
production,
yeah.
A
A
C
A
B
A
So
any
also
NES
3
interface.
We
already
have
that.
We
have
examples
for
you
to
use
s3
interface
to
any
storage,
not
necessarily
stuff,
and
so
the
interface
is
the
interface
that
we
use
right
now.
So
we
have
a
stuff.
On
the
first
release,
we
had
a
step
nano
pod
running
and
interface.
We
used
was
s3.
So
if
you
have
another
pod
there
that's
running
another
storage
that
has
s3.
You
can
also
do
the
same
thing.
I.
A
A
So
what
is
the
fraud
credit
card
transaction
news
case,
so
we
wanted
to
come
out
with
the
use
case
kind
of
captures
the
whole
end-to-end
AI
ml.
In
this
case,
although
I
think
at
the
other
end,
which
is
which
is
the
feedback
loop
back
to
water,
you
serve
the
model,
it's
not
really
in
this
use
case,
but
we
capture
the
beginning,
which
is
getting
the
data
getting
the
scientists
to
explore
the
data
and
then,
after
the
scientist,
decides
ok,
this
is
the
best
model.
We
serve
the
model.
A
Well,
what
we
did
is
we
grabbed
some
data
from
Cagle.
It's
credit
card
transaction
data.
This
data
set
included
time
of
the
transaction
amount
of
the
transaction
and
21
hidden
features
of
the
transaction
and
they're
hidden
to
protect
consumer
neighbor.
So
we
took
that
data
and
what
we
did
is
we
used
all
the
tools
that
we
have
in
the
open
data
hub
to
kind
of
flow.
Through
this
exploring
the
data
fixing
the
data
baiting
the
model
and
serving
tomorrow,
we
wanted
to
create
a
model
that
can
predict
a
fraud
transaction,
so
you
feed
it.
A
One
of
these
credit-card
transaction-
it
will
tell
you
this
is
fraud
or
this
is
not
fraud.
You
also
wanted
to
monitor
the
model,
so
we
collected
models
on
the
model
and
we
showed
these
metrics
using
a
graph
on
the
dashboards.
Of
course,
metrics
were
collected
from
Prometheus
alright.
So
let's
move
on
to
a
high-level
architecture
slide
again
a
little
bit
more
busy
that
I
want
to
talk
to
it.
So
at
the
left
side,
you'll
see
the
users
of
this
use
case.
We
have
the
data
scientists
and
the
end
user.
These
are
the
data.
A
Scientist
is
the
person
who
is
creating
the
models
in
the
lesson
there
I'll
talk
about
in
a
little
bit
is
the
person
who's
doing
the
credit
card
transactions,
and
then
we
have
the
dev
ops
he's
the
person
who
is
monitoring
and
making
sure
everything
is
running.
So
we
start
with
the.
So
we
downloaded
the
data
and
we
saved
the
data
and
stuff.
A
That's
credit
card
transaction
data
and
the
data
is
around
two
hundred
thousand
transactions,
and
then
we
gave
this
to
the
data
science
and
said
here.
Take
this
data.
Tell
us
what
you
can.
How
can
we
predict
the
transaction
and
then
the
data
scientists?
What
they
did
is
they
use
the
Jupiter
have
notebooks
and
they
did
their
analysis.
They
used
spark
for
some
analysis.
You
get
the
data,
as
you
can
see
here
in
the
gray
spark
box
that
they
have
their
own
spark
cluster.
A
They
have
their
own
dripper
hub,
notebook,
that's
fun
to
play
with
and
then
after
they
came
up
with
the
best
model,
the
way
they
think
and
they
analyze
and
I'll
show
you
an
example,
notes
book
that
we
have
here.
They
know
they
came
up
and
said.
Ok,
this
is
the
best
model
that
we
can
come
up
with.
You
took
that
model.
We
saved
it
as
a
file
called
model
pickle
and
we
save
it
in
ourselves.
So
he
comes
Seldon
and
created
a
Selden
custom
resource.
What
what
custom
resource
does?
A
Is
that
grabs
that
model
from
work
and
serves
it
and
it
serves
it
as
an
endpoint
now
to
simulate.
We
don't
have
our
transactions
coming
into
the
platform
right.
So
this
will
me
like
that.
What
we
did
is
we
created
a
coffe
cup
producer
that
Kefka
producer
is
gonna.
Read
part
of
the
connect,
our
transactions
every
one
to
five
seconds
randomly
and
it's
going
to
hit
the
Selden
rest
interface
for
the
model
and
then
that's
going
to
bring
back
a
prediction
saying:
okay,
this
transaction
was
fraud
or
this
reduction
was
not
brought
during
that
time.
A
But
all
this
is
happening
all
this
metrics
and
data
is
collected
by
Prometheus,
and
it's
shown
in
Griffin
our
battery
boards
for
the
dev
ops
to
kind
of
watch
and
see
how
things
are
operating
and
that's
at
a
high
level
of
what
this
use
case
is,
and
it's
a
that
I
have
next.
Basically,
like
I
said
these
are
the
transactions
most
of
the
stuff
that
needed
storage
was
and
stuff.
A
We
used
a
notebook
for
data
exploration,
used
spark
for
bringing
the
data
into
a
data
frame.
We
used
SK
learn
to
create
the
model
that
was
a
random
forest
classifier.
We
saved
the
model
in
a
file,
we've
used
Selden
to
serve
it,
and
we
used
Kafka
producer-consumer
to
simulate
the
transactions
and
then
Prometheus
and
grow
fond
ofor,
metrics
display
I.
A
Think
that's
all
I
have
right
now.
If
there's
no
questions,
I
can
move
on
to
the
video
of
the
demo,
so
I'm
going
to
switch
over
here
right,
so
we
should
be
seeing
right.
Now
is
what
I've
had
openshift
portal?
Yes
right,
all
right!
So
we'll
start
by
showing
you
know
the
pods
in
the
platform
that
we
just
described
really
quickly
and
then
we'll
move
on
to
them
with
that.
A
So
you'll
see
here
that
we
have
the
Kafka
operator
and
you
will
see
in
a
little
bit
that
we
also
have
multiple
Kafka
pas
they're
all
running
you'll
see
the
girl
on
a
pod
in
the
Jupiter
hub.
Jupiter
have
database
I
just
want
to
point
out.
There's
this
to
Jupiter.
We
had
two
users
using
it
at
this
point.
This
was
an
open
to
TLC's,
a
user
and
user
11,
so
they
both
have
their
own
Jupiter
hub.
You'll
see
the
model.
There's
two
models
are
being
served.
A
One
is
on
the
full
200
K
and
one
is
just
the
example
that
I'll
show
just
thinking.
Let's
see
the
seldom
core
and
the
Prometheus
in
Southern
Cross
we're
here
the
spark
cluster
and
again
same
thing,
the
spark
cluster.
We
have
two
users,
you'll
see
to
spark
clusters
with
workers
and
masters
for
each
cluster
and
then
at
the
end,
the
sisters,
the
string
operator.
So
that's
it
for
alright,
let's
move
on
so
this
is
a
notebook
that
our
data
scientists
use
to
kind
of
explore.
A
What's
the
best
way-
and
this
is
just
a
sample-
I-
wouldn't
say
that
this
is
you
know,
production
already
or
anything
like
that,
so
we
uploaded
the
credit
card
data
to
that
nano
on
a
bucket
called
open.
We
get
a
200
request
back
for
only
a
sample
of
the
data
which
is
tanking
over
there
and
that's
just
the
exploration
part,
but
not
the
actual
production
part.
A
So
we
uploaded
this
bucket
called
open
and
we
used
spark
our
spark
cluster
and
you'll
see
that
once
you
open
this
notebook,
you
already
have
a
pointer
to
your
to
your
own
spark
cluster
in
iOS
environment.
Very
well,
then
you
just
connect
to
that
spark
cluster
and
you
get
a
handle
and
a
session
to
the
spark
cluster.
Well
here
we're
just
reading
the
CSV
file
from
from
our
self
storage,
and
you
know
on
earth.
A
It'll
take
time
and
you'll
see
that
only
10k
out
of
the
200k
we
read,
and
it
will
show
you
here
and
a
little
bit.
That's
only
a
small
fraction
of
these
transactions
are
actually
fraud,
which
makes
this
data
set
skewed.
But
that's
okay
for
this
for
this
little
demo,
so
speeding
the
transaction.
It
will
take
a
little
time
here.
A
A
We
will
only
take
the
features
for
the
feature
on
the
drop
time
in
class
and
use
class,
which
says,
fraud
or
not
fraud,
as
the
prediction
vector
we
do
the
model
fit,
and
then
we
will
create
the
model
using
random
forests
and
you'll
see
all
these
V
features.
These
are
the
features
that
are
hidden
and
you'll,
see
that
it
trained
on
7500
transactions
and
the
Titan
2500
or
left
for
test
and
you'll
see
that
in
a
little
bit
coming
up
what's
creating
the
model.
A
Number,
so
we
took
all
the
features
at
first,
that's
what
the
data
scientist
says
did
they
took
all
the
features?
First
indicated
that
model,
which
is
pretty
big.
Normally
you
don't
have
all
these
features.
You
want
to
pick
the
most
important
features
and
you'll
see
that
so
we
did
the
confusion.
Matrix
and
I
won't
go
through
this
deeper.
So
basically
the
confusion
matrix
shows
you
what
was
predicted.
A
What's
a
fraud,
what's
really
fraud,
what
was
predicted
not
for
and
what's
with
the
numbers,
so
you'll
see
those
matrices
here
and
it's
just
one
way
of
seeing
how
good
or
how
bad
your
model
is.
And
again
this
is
just
you
know,
testing
for
this
use
case.
It's
not
an
in-depth
test
just
to
show
what
the
features
you
can
do
and
what
the
tool
is.
So
we
check
the
important
features
here.
A
We
took
the
model
and
we
plotted
the
features
based
on
importance,
and
you
can
see
here
that
at
the
top,
seven
or
top
nine
are
important
and
then
the
tailor's
off
the
rest
were
not.
So
we
took
the
seven
important
features
that
you
see
here:
the
V
numbers
and
then
we
recreated
the
model
again.
So
we
recreated
the
model
using
the
important
features,
and
we
did
play
around
with
these
metrics
that
you
see
the
estimator
and
the
depth
data.
Scientists
did
some
tweaking
here
and
there
so
we
model
fit
again.
A
Well,
the
output
is
modeled
optical
and
the
number
of
features
this
time
are
only
eight,
well
the
confusion
matrix
again.
We
do
it
again.
This
is
just
calling
up
and
doing
here,
but
this
is
just
to
show
that
you
can
do
it
and
look
at
it
again
and
then
before
we
serve
the
model,
we
do
like
a
little
test.
We
give
it
not.
We
filter
on
the
test
data
for
not
fraud,
and
we
send
it
not
fraud
that
you
that
out
zero.
As
you
see,
we
do
it
again
with
fraud.
A
You
can
see
one
here
for
all
the
fraud,
it's
being
son,
so
we're
happy
with
the
model
data
scientists
happy
with
the
model.
What
he
does
is
ok,
I'm
gonna
load
it
to
tough
again
and
they
put
it
in
a
specific
cold
model,
and
now
the
dev
ops
part
comes
may
be,
or
it
could
also
get
a
data
science
student
it
for
for
actually
serving
the
world,
so
we're
just
showing
here
that
it's
being
uploaded-
and
it's
successful
this
next
steps
that
we
she
will
be
shown
which
is
logging
into
the
open
shift
cluster.
A
You
can
do
that
on
terminal,
we're
just
showing
it
here
in
the
notebook
just
to
make
it
easier
for
us,
so
you
log
into
the
cluster
and
you
create
a
new
project
and
we
create
a
new
custom
resource
for
Sheldon
called
seldom
deployment.
You'll
see
it
here
and
that's
all
them
deployment.
What
it
does
is
it
grabs
the
model,
that's
incest
and
serves
it,
and
it
exposes
a
rest
interface.
So
you
can
see
we
have
two.
A
A
A
A
All
right,
so
that's
it
for
this
notebook,
not
fraud,
giving
a
serum
and
they
do
the
dashboards
that
we
have
running
on
the
cluster.
This
is
the
first
dashboard
that
we
see
in
Griffin.
This
is
actually
showing
all
the
metrics
coming
out
of
the
model.
So
it's
the
first
graph
is,
is
graphing
probability
of
a
fraud
vs.
amount,
nothing
interesting.
There
you'll
see
this
the
red
spikes
or
the
spikes
that
are
saying
fraud.
The
next
one
is
the
probability
of
fraud
versus
v70
in
here,
I.
Think,
there's
something
interesting.
A
We
see
dips
for
317
every
time,
we're
from
fraud
same
with
the
second
one,
which
is
v10.
We
also
see
dips.
So
this
is
just
some
interesting
things
that
you
know
you
can
look
at
for
a
lot
of
time
and
try
to
come
up
with
something.
This
is
the
core
metrics
for
Selden.
It
just
shows
what
the
errors
are:
HTTP
errors
and
the
success
rate
and
requests
per
second
to
you,
the
model.
A
Then
we
move
on
to
another
dashboard,
which
is
the
calf
cow.
So,
like
I
said,
we
used
cough
a
cough,
got
to
kind
of
simulate
the
transactions,
and
here
it's
showing
us
how
many
brokers,
how
many
partitions,
how
many
messaged
rates
are
coming
in
and,
like
I
said
we
are
randomly
generating
messages
between
1
and
second
to
5
seconds.
A
Moving
on
to
the
cluster
monitoring
board,
and
here
you'll
see
all
the
monitoring
coming
from
open
shift
clusters
such
as
memory
usage,
much
memory,
you
are
using
how
much
CPU
were
using
there's
interesting
part
here,
but
CPU
pod
usage
per
pods,
either
operator
hub
using
the
top
CPU
and
then
I
think
this
is
really
interesting
and
then
pod
memory
usage.
You
can
see
the
spark
cluster
here.
We
use
really
love
and
using
a
lot
of
it,
not
a
lot
of
money
but
top
memory.