►
Description
OpenShift Commons Gathering in Seattle on Nov 7, 2016 talk by Red Hat's Will Benton on Big Data, Apache Spark, OpenShift and Kubernetes.
A
Good
morning,
so
my
name
is
Will
benta
and
I'm,
a
software
engineer
at
Red
Hat.
So
we've
seen
this
morning,
how
openshift
makes
it
possible
to
develop
and
deploy
applications
I'm
going
to
be
talking
about
a
particular
kind
of
application,
data
driven
applications
and
I'm,
going
to
talk
about
my
team's
experience,
developing
analytic
applications
and
using
data
science
at
scale
and
talk
about
how
our
infrastructure
requirements
changed
as
we
went
from
doing
analytics
as
a
workload
to
developing
and
deploying
and
maintaining
analytic
applications
and
how
openshift
makes
that
possible.
For
a
little
bit
of
background.
A
For
a
little
more
than
two
years
now,
I've
been
leading
a
team
focused
on
data
science
in
red
hats,
I,
sort
of
emerging
technology
and
emerging
technology
group
in
Red
Hat,
and
we
really
wanted
to
figure
out
what
the
data
driven
applications
of
tomorrow
are
going
to
look
like
on
open-source
infrastructure's.
So
we
wanted
to
look
for
problems
inside
Red
Hat
and
help
people
take
things
that
they
may
be
prototyped
on
a
single
machine
and
bring
them
into
production
scale.
A
We
wanted
to
go
from
models
that
were
sort
of
black
boxes
for
predictions
to
user
interpretable
models
that
told
you
something
about
the
real
world
and
didn't
just
give
you
a
yes
or
no,
and
ultimately
we
wanted
to
do
all
this.
Well,
you
know
having
at
least
as
good
predictive
performance,
as
you
know,
existing
solutions
of
things
we
were
looking
at.
A
So
when
we
started
off
in
this
effort,
we
had
a
lot
of
experience
with
apache
spark,
which
is
a
framework
for
data
processing,
I'll
talk
more
about
in
a
minute,
but
we
had.
We
had
a
bunch
of
spark
machines
in
a
dedicated
compute
cluster.
We
had
a
cluster
a
networked,
POSIX
file
system
in
essentially
the
same
rack
as
our
compute
nodes,
and
we
were
orchestrating
everything
with
Apache
mesos.
This
worked
pretty
well
when
we
were
just
a
small
development
team,
we
could
sort
of
coordinate
with
one
another
and
say:
hey
I
need
this.
A
Many
machines
for
this.
This
model
I'm,
going
to
train
and
and
sort
of
informally
allocate
things
as
we
needed
to
where
it
sort
of
broke
down
is
when
we
wanted
to
take
the
things
we'd
built
and
put
them
into
production
and
share
our
applications
and
our
data
with
our
collaborators
and
and
ultimately,
what
we
ran
into
is
that
this
was
a
great
way
to
run
analytics
workloads.
But
the
problem
is
that
analytics
isn't
really
a
separate
workload
anymore.
We
really
put
analytics
into
production
today
as
part
of
contemporary
data
driven
applications.
A
So
what
we
wanted
to
do
going
forward
is
have
a
way
that
we
could
use
a
shared
spark
cluster
both
for
development
work
and
for
production
work,
so
both
sort
of
interactively
and
its
scale.
We
want
an
easy
way
to
share
our
cluster
and
our
work
with
collaborators,
and
we
wanted
a
really
nice
continuous
deployment
workflow
so
that
we
didn't
have
to
so.
We
didn't
have
to
do
a
lot
of
extra
work
to
update
our
applications,
and
we
wanted
to
do
all
of
this
without
becoming
expert
system,
administrators
or
scheduler
policy
ninjas
as
well.
A
So
in
the
rest
of
the
talk,
I'm
going
to
talk
about,
apache
spark
introduced
it
for
those
of
you
who
are
just
becoming
familiar
with
spark
and
talk
about
why
it's
a
great
fit
for
contemporary
microservice
architectures
I'll
talk
about
some
architectures
for
contemporary
data
driven
applications.
I'll
talk
about
two
of
the
particular
issues
we
had
to
solve:
to
bring
spark
to
open
shift,
scheduling
tasks
and
dealing
with
persistent
storage
and
then
Allah
sort
of
show
you
how
you
can
get
involved
in
use
this
stuff
yourself.
A
So
we'll
start
with
some
background.
How
many
people
in
here
have
heard
of
apache
spark
great
grants?
How
many
people
have
used
it
before
great
great?
So
all
this
will
be
review
for
some
of
you,
but
we'll
go
we'll
go
through
this
just
so.
Everyone
has
the
same
context.
The
tagline
for
spark
is
that
it's
a
fast
and
general
framework
for
distributed
data
processing,
but
that's
really
only
part
of
the
story
right
fast
in
general.
A
That's
great,
but
I
think
the
really
compelling
thing
about
spark
is
that
it's
actually
easy
to
use,
unlike
a
lot
of
other
frameworks
for
distributed
data
processing.
If
you
think
of
MPI
or
Hadoop
MapReduce,
these
sort
of
other
frameworks
are
based
on
an
execution
model.
That's
easy
to
execute
in
parallel
right.
A
A
Imagine
your
conventional
sequential
collection
and
a
programming
language.
It
contains
some
values,
probably
all
of
the
same
type,
and
if
we
want
to
distribute
it,
we
have
a
couple
of
ways
we
can
do
that.
We
can
divide
it
up
into
partitions
and
put
each
partition
on
a
separate
machine.
We
can
divide
in
partitions
a
bunch
of
different
ways,
but
some
of
the
most
common
are
taking
ranges
of
contiguous
elements
putting
each
of
those
in
its
own
partition
or
hashing
each
element
and
putting
everything
that
lands
in
the
same
hash
bucket
in
its
own
partition.
A
Failures
right
right,
we're
never
to
have
failures,
so
these
partitions
can
go
away
when
the
machines
that
are
holding
them-
or
maybe
some
other
machine
right
goes
away,
but
we
have
a
way
to
reconstruct
them,
and
this
is
where
immutability
and
laziness
really
start
to
pay
off,
because
when
you
have
one
of
these
are
dd's.
You
have
some
values,
either
in
an
immutable
collection
in
heap
in
your
application
program
or
on
stable
storage
somewhere
that
form
the
basis
for
any
of
these
are
d
DS
and
you
build
them
up
by
doing
operations
on
them.
A
You
have
two
kinds
of
operations
you
have
transformations.
This
is
where
immutability
comes
in.
These,
don't
actually
do
anything
to
the
underlying
collection.
They
create
a
new
collection
because
we're
lazy,
they
don't
actually
create
a
new
materialized
collection.
They
create
sort
of
a
recipe
for
a
new
collection.
A
So
in
this
case,
we
have
a
very
small
collection
and
we
want
to
apply
an
operation
to
it,
a
filtering
out
all
of
the
even
numbers
we
haven't
actually
created
a
new
collection
here,
we've
just
created
a
way
to
say,
given
this
underlying
collection,
how
would
we
get
to
a
new
one?
We
can
keep
applying
operations
on
this
RDD
say:
multiplying
every
element
by
three
expanding
the
collection
so
that
we
have
twice
as
many
elements,
including
each
element
and
its
successor,
and
we
still
haven't
computed
anything
here.
A
We
just
have
a
recipe
for
how
to
get
the
collection
we
ultimately
want.
If
we
do
a
different
kind
of
our
DD
operation
called
an
action
will
actually
get
a
value
out
of
this,
and
in
this
case
we'll
do
an
action
called
collect,
which
schedules
these
computations
on
our
cluster
materializes
the
collection
as
a
sort
of
array
in
our
application
program.
A
Obviously,
if
we
want
to
use
a
collection
more
than
once
this
is
this
is
sort
of
wasteful,
so
spark
gives
us
a
way
to
suggest
something
is
going
to
be
used
again
and
cash.
These
intermediate
results.
We
can
see
what
this
looks
like
operationally
by
looking
at
the
sort
of
high-level
architecture
of
a
spark
application.
The
main
application
is
called
the
driver
and
it
interacts
with
a
cluster
manager
which
schedules
executor
'he's,
which
are
basically
just
microservices-
that
compute
partitions
of
these
r
dds.
So
the
driver
will
serialize
a
function
out
to
these
executor
'he's.
A
So
this
r
DD
is
pretty
pleasant
to
use,
but
it's
really
a
lot
like
an
assembly
language
or
an
intermediate
representation
for
distributed
computation.
It's
general,
it's
usable.
It
provides
all
the
primitives
you
need,
but
a
lot
of
programmers
might
want
to
work
at
a
higher
level
and
the
spark
ecosystem
which
is
built
on
this
rdd,
and
this
scheduler
concept
includes
a
lot
of
higher-level
libraries
for
special
purpose
tasks
since
being
able
to
cash.
These
intermediate
results
as
a
big
benefit
for
things
that
we
want
to
iterate
over.
A
You
can
imagine
that
a
lot
of
these
libraries
are
things
where
you
have
to
do
it.
Eration,
I,
think
of
graph,
traversals,
think
of
processing,
structured
queries
like
with
the
database.
Query,
planner,
I,
think
about
optimizing
machine
learning
models
and
spark
even
provides
a
way
to
treat
a
stream
of
values
as
many
small
rdds
one
for
each
window
over
the
stream
and
thus
use
the
same
same
abstraction
to
program
streaming
and
batch
workloads
and
out
of
the
box.
Bark
provides
a
few
different
ways
to
deploy
this.
You
can
use
a
sort
of
self-managed
cluster
manager.
A
You
can
use
apache
may
so
sore.
You
can
use
Hadoop
yarn
if
you,
if
you
have
an
existing
Hadoop
installation.
What
I
want
to
tell
you
today
is
that
we
can
actually
run
these
standalone
spark
clusters
on
top
of
open
shift
and
there's
work
going
on
in
the
Kuber
Nettie's
community
and
with
some
people
on
my
team
to
actually
have
native
support
for
scheduling,
spark
a
sonikku
brunette
ease
as
well.
A
A
Anyone
delighted
by
microservices
everyone
else
in
between
is
that
fair
to
say,
yeah
I,
assume
most
people
would
be
in
between
micro
services,
aren't
a
panacea
right.
We
do
get
some
extra
complexity
in
having
to
define
these
interfaces
and
orchestrate
these
components,
but
we
get
a
lot
of
compelling
benefits
in
return,
and
one
really
nice
benefit
is
that
spark
is
a
really
natural
fit
for
these
kinds
of
architectures,
so
just
to
review
why
this
makes
sense.
You
know
if
we
have
stateless
microservices,
our
services
or
commodities,
one
of
them
goes
away.
A
A
But
it's
a
lot
easier
to
say:
I
want
to
examine
the
surface
area
of
this
well-defined
API
than
it
is
to
say,
I
want
to
explore
some
hidden
state
inside
a
stateful
service
and
produce
a
test
fixture
that
will
reproduce
a
bug
this
can
hold
for
performance
problems
too
right.
It's
a
lot
easier.
Those
of
you
who've
done
large-scale
systems.
Performance
work
know
that
it's
a
lot
easier
to
look
at
an
individual
component
with
an
SLA.
A
That's
not
meeting
its
SLA
and
saying
something
needs
to
change
here
than
it
is
to
look
at
a
big
system
and
figure
out
where
things
are
going
wrong.
There
are
a
lot
of
social
benefits
of
micro
services
to
but
I'm
not
going
to
be
the
fourth
person
to
talk
about
those
this
morning,
so
we
can
that
for
left
later.
So
another
nice
thing
about
spark
is
that
it's
it's
really
a
natural
fit
for
these.
As
I
mentioned,
these
executor
czar
really
microservices
that
compute
partitions
of
distributed
collections.
A
They
are
essentially
stateless
if
we
ignore
caching-
and
caching
is
just
an
optimization,
so
women
when
so
they
do
what
they're
told
and
when
one
of
them
goes
away.
We
know
how
to
replace
it,
because
we
have
the
underlying
collection
and
we
have
the
operations
to
apply
to
it
to
get
back
to
that
value.
A
So
the
good
news
is
that
if
we
want
to
make
a
contemporary
data-driven
application
on
openshift
spark
is
a
great
place
to
start.
But
if
we
really
want
to
take
advantage
of
contemporary
application
architectures,
we
need
to
think
about
analytics
as
a
component
of
an
application
and
not
just
as
a
component
of
not
just
as
a
separate
workload.
So,
let's
start
by
taking
a
high-level
look
at
what
a
data-driven
application
actually
has
to
do.
A
A
data-driven
application
is
really
a
lot
like
any
other
application,
except
that
typically
you're,
transforming
and
aggregating
data
from
different
sources,
using
that
to
train
predictive
models
and
then
you're,
using
those
predictive
models
to
transform
and
make
predictions
from
your
data,
you're
saving
predictions
and
raw
data
to
archival
storage
and
you're,
supporting
a
few
different
kinds
of
user
interface,
maybe
you're
letting
developers
or
data
scientists
install
new
models,
modify
how
models
are
trained.
You
have,
of
course,
a
typical
end
user
web
or
mobile
UI.
A
You
have
reporting
UI
for
the
business
side
and
you
also
have
a
management
interface
so
that
you
can
tune
how
your
application
is
deployed
and
make
sure
it's
performing.
Well.
So
we'll
start
our
review
of
architectures
by
looking
at
a
couple
just
really
quickly
that
are
good
analytics
as
a
workload
architectures.
The
first
one
is
probably
going
to
be
familiar
to
a
lot
of
people.
This
is
the
conventional
data
warehouse
architecture.
A
We
think
about
a
stream
of
events
arriving
we're
going
to
transform
those
somehow
apply
some
business
logic
rules
and
put
them
in
a
database
that
we've
optimized
for
fast
concurrent
updates.
We're
going
to
call
this
a
transaction
processing
layer.
Stop
me
if
you've
heard
this
before
now.
This
is
this
is
great,
because
these
fast
databases
can
deal
with
a
high
volume
of
events,
but
they
don't
have
any
analytic
capabilities
to
extend
this
with
analytic
capabilities.
We're
going
to
periodically
send
the
contents
of
our
transaction
processing
data
base
to
another
database.
A
That's
optimized
for
a
lot
of
concurrent
reads
and
complex
queries.
We
can
then
do
some
analysis
on
that.
Typically,
a
folding
multi-dimensional
data
into
a
spreadsheet
so
that
it's
comprehensible
and
you
can.
You
can
make
a
report
out
of
it.
We
can
also
feed
those
analysis,
results
back
to
the
business
logic
and
use
in
the
transaction
processing
half.
A
Finally,
we
may
want
to
allow
analysts
to
engage
in
sort
of
interactive
exploration
of
our
data
as
well.
We
call
this
the
analytic
processing
layer,
so
this
architecture
has
worked
really
well
for
a
long
time
for
a
lot
of
a
lot
of
sort
of
applications.
The
downside
is
that
it's
really
hard
to
do
this
with
stateless
services
and
it's
really
hard
to
scale
out
the
transaction
processing
side.
A
One
approach
to
actually
scaling
out
compute
and
storage
is
the
data
Lake
approach
that
was
popularized
by
the
Apache
Hadoop
project,
and
the
idea
here
is
that
you
have
a
uniform
abstraction
for
federating
all
the
data
you
care
about
a
distributed
file
system.
If
you,
if
you
run
out
of
space,
you
can
add
more
space
and
those
nodes
are
going
to
stick
around
for
a
while.
When
you
get
application
events
that
you
care
about,
you
append
them
to
files
in
the
distributed
file
system.
A
Those
events
aren't
going
to
be
in
the
format
you
care
about,
for
training,
a
predictive
model
or
running
your
data
driven
application,
so
you're
going
to
schedule
some
jobs
to
run
those
and
the
way
the
scheduling
is
going
to
work
is
that
the
jobs
that
operate
on
particular
parts
of
the
data
will
migrate
to
the
machines
that
store
that
data.
So
you
get
scale-out
compute
on
top
of
your
scale
at
storage.
A
This
was
a
great
approach
in
that
it
let
people
really
exploit
commodity
hardware
and
get
scale
out,
and
also
the
really
compelling
benefit
of
being
able
to
store.
A
lot
of
data
was
huge
for
a
lot
of
organizations.
The
downside
is
that
the
programming
model
is
pretty
low
level
and
that
it's
not
really
a
great
fit
for
the
kind
of
elastic
architecture
we
want
to
see
in
a
cloud
native
application.
You
can't
scale
your
storage
and
compute
independently.
A
If
this
is
what
you're
doing
so,
the
legacy
architectures
we've
just
seen
solved
a
lot
of
problems,
but
they're
not
the
best
way
to
have
these
kinds
of
applications
in
production.
Now,
in
2016
we're
going
to
look
at
some
architectures
now
that
are
more
suitable
for
contemporary
containerized
data-driven
applications.
A
The
lambda
architecture
is
sort
of
an
interesting
approach
to
modernizing
that
classic
data
warehouse.
Instead
of
having
two
databases,
you
have
sort
of
two
analysis
layers.
You
get
a
stream
of
events
and
you
multiplex
those
both
to
a
stream
processing
layer
and
to
a
distributed
file
system
in
the
stream
processing
layer.
A
A
Reflects
the
fact
that
the
way
we
design
streaming,
Algar,
algorithms
and
stream
processing
systems
has
changed
a
lot,
and
the
idea
is
that
everything
is
your
queue.
Everything
is
on
the
queue,
so
we
have
events,
we
put
them
on
a
message
queue
with
a
raw
data
topic.
We
transform
those
and
write
them
to
a
pre
process.
Data
topic,
our
analysis,
jobs,
simply
read
from
pre
processed
data
topic
and
right
to
an
analysis,
results
topic
and
then
our
UI
components
just
subscribe
to
the
analysis,
results
topic
and
present
those
to
users
in
different
ways.
A
So
this
is
really
a
nicer
way
to
do
analysis
where
everything
is
a
stream,
but
it
does
assume
that
you
have
a
sophisticated
stream
processing
system
and
that
the
analyses
you
want
to
do
are
actually
expressible
as
streaming
algorithms.
Now
we
can
do
a
lot
of
things,
the
streaming
algorithms.
We
can't
necessarily
do
everything,
so
those
are
good
assumptions
to
keep
in
mind.
A
We
know
that
we
can
run
this
in
a
contemporary
containerized
environment,
but
also
spark
is
general
enough
to
implement
the
analytic
processing
side
of
a
conventional
data,
warehouse,
the
stream
processing
or
batch
processing
sides
of
the
land
architecture
or
the
stream
processing
side
of
the
cap
architecture
as
well,
so
spark
is
really
really
general
and
gets
us
a
lot
of
benefits
there
now.
Is
it
enough
to
just
have
spark
as
the
basis
for
our
contemporary
data-driven
applications?
Well,
you
can
still
go
wrong
right.
A
You
can
still
say:
I
have
spark
and
I'm
going
to
deploy
it
like
something,
that's
just
analytics
as
a
workload,
if
you
did,
if
you
have
one
resource
manager
for
your
applications
and
one
resource
manager
for
your
compute
cluster,
you
get
into
this
situation
where
you
have
to
make
scheduling
decisions
in
two
different
places
and
you're,
probably
not
using
your
resources
as
efficiently
as
you
can.
A
better
way
to
do.
A
This
is
to
schedule
with
one
cluster
one
spark
cluster,
either
logical
or
actual
per
application,
so
that
your
unit
of
scheduling,
granularity
is
actually
the
application.
And
then
your
resource
manager
is
able
to
make
the
right
decisions
for
each
application,
making
sure
that
application
components
are
Co
scheduled
with
the
compute
they
they
depend
on
the
compute
requirements
of
our
applications
can
change
over
time,
though.
In
this
example,
we
have
an
application
with
a
huge
task
backlog,
but
only
a
single
spark
executor.
Now
in
a
conventional
standalone
spark
deployment.
A
We
can
use
sparks
support
for
dynamically,
allocating
resources
to
temporarily
give
it
more
executor,
zeeeee,
and
we
can
do
something
similar
here,
but
since
we're
wired
in
to
open
shift,
we
actually
have
a
service
that
reads
the
metrics
from
our
running
spark
application
and
says:
give
me
some
more
give
you
some
more
nodes
to
run
on.
So
we
can
extend
our
elasticity
a
layer
below
the
spark
cluster
into
the
control
plane.
A
So
in
my
team's
original
deployment,
we
had
a
gluster
file
system
living
in
the
same
rack
as
some
of
our
compute
nodes.
Posix
file
systems
are
really
great
right.
Like
you
can
you
can
you
can
run
grip
on
them?
You
can
you
can
do
all
of
these,
these
kinds
of
things
that
you're
familiar
with,
but
this
was
a
problem
for
us
on
a
management
perspective,
because
we
wanted
to
share
our
data
with
other
users.
A
Managing
access
control
became
became
sort
of
a
pain,
and
it
also
meant
that
we
were,
depending
on
a
path
versus
the
service
right.
Another
option
that
people
use
for
storage
is
to
actually
have
HDFS
as
a
peer
to
their
application.
Scheduler
this
this
can
work
pretty
well,
but
you
can't
get
that
advantage
of
co-locating
your
storage
and
compute.
That
is
something
that
HDFS
depends
on
to
some
extent
for
performance.
A
So
I
want
to
talk
about
a
different
solution
for
storage,
which
is
just
using
object
stores.
There
are
a
lot
of
different
implementations
of
the
s3
API.
You
can
think
of.
You
know
Amazon's,
obviously
the
Swift
project
from
OpenStack
and
SEF,
and
you
get
fine-grained
access
control
with
these
and
you
can
access
them
from
a
range
of
libraries.
A
Now
the
idea
that
you
need
to
co-locate
your
storage
and
compute
was
sort
of
inspired
by
the
popularity
of
HDFS
and
it's
sort
of
an
interesting,
interesting
assumption
that
I
want
to
take
a
minute
to
address,
but
one
of
the
things
that
you'll
see
on
these,
you
can
scan
these
QR
codes
and
read
these
whole
papers.
I
would
encourage
you
to
encourage
you
to
do
that,
but
one
of
the
things
that
you
see
in
contemporary
contemporary
workloads
is
that
a
lot
of
analytic
processing
workloads
actually
do
fit
in
memory
once
you've
pre-processed
your
data.
A
Finally,
frameworks
that
are
designed
to
depend
on
having
local
storage
may
make
some
assumptions
about
how
expensive
storage
is.
Now
we
know
that
writing
to
physical
disks
isn't
actually
cheap,
but
people
assume
that
it
is,
and
so
you
get
these
cases
like
like
this
one,
where
you
have
a
distributed
file
system
used
to
glue
jobs
together
and
you're
spending,
almost
a
third
of
your
time,
just
moving
temporary
files
around.
A
A
So
really,
the
other
thing
to
consider
about
storage
is
that
you're
reading
that
big
data
from
your
storage
once
and
then
your
pre
processing
it
and
operating
it
on
medium
data,
that's
going
to
be
cached
as
you
train
your
model
and
we're
spending
most
of
our
time
operating
on
data
that
fits
in
memory.
In
this
case,
we
could
optimize
the
part
where
we're
reading
the
big
raw
data
off
of
storage,
but
that's
going
to
get
us
diminishing
returns
compared
to
making
this
faster.
A
So
you
may
not
need
to
take
to
optimize
by
co-locating,
compute
and
storage.
There
are
both
practical
and
philosophical
reasons
why
you
might
want
to
might
not
need
to
do
this,
but
if
you
do
it's
possible
to
get
that
rack
or
data
center
locality
pretty
easily
without
sacrificing
a
lot
of
flexibility,
and
there
are
smart
people
working
on
making
it
possible
to
even
get
that
same
machine
colocation
without
sacrificing
a
lot
of
flexibility
either.