►
Description
OpenShift at Royal Bank of Canada
Raj Channa Royal Bank of Canada (RBC)
Dhwanil Raval Royal Bank of Canada (RBC)
at OpenShift Commons Gathering 2019
Red Hat Summit
Case Study: OpenShift @ Royal Bank of Canada (RBC)
A
A
Am
Racha
no
I'm
part
of
RBC's
tech
infrastructure
group,
where
I
run
the
technology
strategy
and
research
function
and
with
me,
is
my
colleague
Donald
Raoul,
who
is
going
to
help
us
with
a
demo
at
the
end
of
my
talk
here.
So
the
talk
today
is
about
container
Ising
spark
on
kubernetes
and
OpenShift
we're
going
we're
going
to
go
through
RBC's
journey
about
container
izing
it.
What
options
were
on
the
table,
what
we
choose
and
why
we
chose
it.
A
But
before
we
go,
let's
take
quick,
a
quick
look
at
what
the
traditional
big
data
ecosystem
is
at
the
very
bottom.
We
have
the
storage
layer
which
consists
of
a
bunch
of
commodity
servers
which
are
hooked
up
together.
It's
usually
a
Hadoop
HDFS
file
system
that
takes
care
of
all
the
storage
subsystem
and
it
says
storage
there,
but
generally
the
computer
also
runs
there
as
well
right
because
the
whole
Hadoop
philosophy
is
run.
A
A
That's
one
of
those
things
but
really
wanted
customizability
in
a
big
enterprise
like
us,
where
you
have
multi-tenancy
a
lot
of
people
running
their
applications
on
it
and
you
always
have
one
or
two
application
teams
who
want
to
go
through
to
a
newer
version
of
spark,
for
example,
a
new
version
of
some
of
the
runtime
and
when
you're
running
a
they
can't
generally
do
Italy.
You
know
in
a
traditional
spark,
a
traditional
big
data
system.
A
We
also
wanted
to
make
on-demand
provisioning
of
the
spark
cluster
make
available
to
our
users,
so
they
could
provision
this
on
demand
and
scale
auto
scale
as
well
right,
so
predictability
for
SL
A's.
This
was
a
big
issue
for
us,
because
you
know
some
of
the
jobs
were
taking
a
long
time
to
run.
They
were
not
meeting.
Sla
is
the
real
time
jobs
were
interrupting.
The
batch
jobs,
etcetera,
etcetera.
A
Now,
in
the
last
picture
that
we
showed
we
saw,
there
were
a
lot
of
other
open
source,
open
source
frameworks
that
were,
on
top
of
it
and
one
of
the
things
that
you
can't
generally
do
very
well
in
a
traditional
Big
Data
system.
Is
you
can't
scan
these
open
source
software's
very
efficiently
in
a
very
regular
manner
as
well?
A
So
we
also
wanted
to
be
able
to
scan
these
things
at
each
runtime
and
provide
a
better
security
posture
for
us.
The
last
thing
is
infrastructure,
optimization,
really
we,
what
we've
seen
is
running
the
same
spark
job
on
a
traditional
system
versus
running
it
on
openshift
in
kubernetes,
really
needed
about
33%
less
hardware
right.
So
there
was
a
lot
of
if
she's
there
as
well,
so
just
want
to
go
over
a
little
bit
of
a
high-level
overview
of
what
we
ended
up
doing.
So
this
is
the
general.
This
is
partic
guinness
the
spot
for
engine.
A
You
know
the
few
orchestrators
that
allow
you
to
orchestrate
workload
and
start
of
spark.
We
spoke
about
yon
missiles
and
the
standalone
Spock
standalone
scheduler
is
the
default
scheduler,
which
comes
with
a
spark
distribution
right,
and
then
you
have
kubernetes.
Now,
in
the
last
previous
slides
I
showed
you,
we
went
from
yarn
to
kubernetes,
but
in
reality
we
didn't
end
up
doing
that
right
as
we
went
through
it,
we
actually
use
kubernetes
in
conjunction
with
the
standalone
scheduler,
and
this
is
kind
of
how
we
ended
up
running
our
spark
jobs.
A
So
you
have
proven
it
is
in
the
bottom,
we
put
the
standalone
scheduler
on
top
and
then
spark
jobs
running
through
the
scale,
a
standalone
scheduler.
This
might
look
a
bit
odd
to
you
guys,
but
we'll
go
over.
Why
we
did
it
this
way,
and
you
know
some
of
the
shortcomings
of
running
everything
directly
and
kubernetes
as
well
before
we
dive
into
it.
There's
one
more
technical
thing:
I
need
to
go
through
is
the
difference
between
a
you
know,
a
two
different
spark
deployment
modes.
A
So
this
standard
open,
shaped
architecture,
I'm
not
going
to
go
over
this.
But
let's
look
at
well
how
the
cluster
mode
actually
worked
for
us.
So
in
the
cluster
mode,
the
client
actually
submits
parts
of
my
job
to
open
shaft
and
it
actually
hits
the
API
server.
And
if
we
look
I'm
gonna
blow
this
up
and
the
sparks
of
my
job
is
actually
going
directly
into
the
capabilities.
Api
server
host
here
right
and
the
deploy
mode
is
cluster.
A
This
becomes
important
in
in
the
future
because
we
couldn't
actually
use
this
cluster
mode
in
the
end
and
then
so
from
the
API
authentication
server.
It
actually
goes
to
the
scheduler
and
the
driver
gets
installed
on
the
node
and
then
the
driver
installs,
all
the
executors.
So
this
the
general
overview
of
this
cluster
mode.
We
could
now
use
this
for
a
couple
of
reasons.
A
One
we
were
using
spot
2.3
2.4
is
the
latest
version
right
now
where
we
were
using
2.
We
had.
We
have
to
use
2.3
because
of
apps
and
not
ready
to
go
to
2.4,
yet
2
or
3
did
not
the
2.3
spark
distribution,
which
was
did
not
support,
client
mode,
which
was
very
important
to
us.
It
did
not
have
high
spot
support
the
most
important
thing
there
was.
It
did
not
have
support
for
Kerberos
support
for
it
for
the
Hadoop
file
system
itself.
A
Now,
when
it
starts
off
when
it
starts
the
driver
in
the
worker
nodes,
it's
really
a
part
that
starts
up
there.
There
is
no
stateful
set
or
controller,
so
we
really
did
not
also
have
aliveness
Pro,
but
a
readiness,
probably
along
with
it.
So
we
went
ahead
and
then
looked
at
what
other
options
existed
out
there
in
the
industry
and
in
the
market
and
one
of
the
things
out
there
was
a
Shenko
which
was
the
standalone
client
mode
right.
A
So
we
tested
this
out
as
well.
So
here
with
the
ocean,
cozy
Ally
we
deploy
to
the
spark
cluster,
we
started
off
with
zero
workers
initially
and
then
scale
it
up.
So
the
the
master
runs
on
one
of
the
nodes
and
then
what
we
do
is
we
do
a
scale
up
the
worker
nodes
and
once
we
do
that,
the
worker
nodes
come
up
there
and
then
then
we
do
the
sparks
submit
right.
A
A
So
the
standalone
mode
kind
of
fixed
two
of
the
issues
that
we
had
one
was
client
mode
support
and
did
have
high
spark
support
as
well,
but
what
it
did
not
have
again
was
support
for
Kerberos
and
they
had
the
file
system
itself.
So
this
is
where
we
kind
of
got
stuck
right.
We
spent
probably
over
a
month
trying
to
figure
out
how
to
fix
this
problem,
and
you
know
during
this
time
we
were
doing
a
lot
of
research
contacting
industry
experts
trying
to
see
what
what
was
out
there
in
the
market.
A
They
could
help
us
sell
this
problem,
because
without
this
Kerberos
support
there
was
no
moving
forward
for
us
right.
So
the
next
two
slides
will
actually
show
we'll
talk
about
what
we
did
to
resolve
this
and
what
we
had
to
develop.
So
after
trying
to
figure
out
from
the
industry-
and
you
know
we
didn't
find
anything
where
we
did
was
we
started
looking
at
what
yon
was
actually
doing
to
solve
this
problem
right.
So
this
is
the
different
components
you
have.
The
Kerberos
domain
controller
here
is
for
client.
A
So
what
we
did
was
we
kind
of
mimicked
it
right
there
on
the
worker
nodes.
What
we
did
was
there
was
a
bootstrapping
script
or
a
startup
script,
where
we
did
a
similar
k
in
it
as
well,
and
once
we
did
that
the
spark
workers
as
a
next
step,
the
spark
was
actually
convert
the
Kerberos
tickets
into
delegation
tokens
and
using
some
post
Docs,
after
which
the
spark
client
actually
submits
the
job,
and
then
it
gets
access
to
the
HDFS
filesystem
and
on
openshift.
This
is
how
it
looks.
A
We
we
have
a
spark
template,
which
is
a
customized
spark
template.
We
just
deploy
that
with
zero
worker
nodes
initially
and
then
the
master
gets
deployed,
after
which
we
scale
up
similar
to
the
previous
slide,
and
then
the
worker
nodes
get
deployed,
and
now
the
client
submits
the
job
right.
The
client
isn't
again
not
submitting
it
to
the
k8s,
the
kubernetes
api
server,
it's
submitting
it
to
the
spark
master.
The
deploy
mode
here
is
client.
A
A
You
know
PI
spark
in
client
mode
support.
Actually
the
top
three
things
in
there
comes
from
the
standalone
distribution
of
spark
itself
and
the
bottom
few
things
actually
come
from
kubernetes
stateful
set
this
I
would
say
this
is
not
ideal
but
works
for
us
right
now.
We
would
allow
to
see
this
completely
supported
and
integrated
with
kubernetes,
especially
the
Kerberos
part
of
it
right
so
a
little
bit
on
logging
and
monitoring
what
we
did.
A
Essentially,
we
had
five
bits
installed
on
one
of
the
nodes
which
goes
and
collects
the
stuff
from
the
different
nodes
and
it
you
know,
then
it
goes
to
elasticsearch
and
then
cabana
and
for
monitoring
we
use
Prometheus
in
Grapher.
Now
all
the
parts
exposed
matrix
to
a
particular
port
and
then
from
there
it
goes
into
graph
Hana
I.
A
B
So
like
now,
we
gonna
see
some
real
action
for
spark
running
on
openshift,
so
like
essentially
so,
like
essentially
I'm
gonna
divide,
my
demo
into
three
parts
for
simplicity.
So
first
is
gonna,
be
like
environment
setup
and
second
is
going
to
be
I'm.
Talking
about
the
template
which
we
have
put
together
and
the
components
it's
building
on
openshift
and
finally,
the
spot
cement
job.
So.
B
So
so
on
your
screen,
you're,
looking
at
the
OpenShift
environment,
which
we
have
put
together
for
this
demo
and
and
along
with
that,
we
got
EMR
cluster.
That
sorry
replaces
Hadoop
service
with
its
own
KDC
to
simulate
kind
of
secure
environment
like
secure,
stepheson,
whelming
and
now,
as
you
can
see,
I'm
doing
kn8
or
to
get
the
reddit
actually
a
fresh
ticket,
and
so
that
I
can
go
and
talk
to
HDFS
and
I'm
querying
the
HDFS
data
and
the
in
full
input
file.
But
tht
is
my
data
set
against
which
I'm
gonna
I'm
gonna
execute.
B
We
should
have
put
together
for
shade
full
side
along
with
elastic
and
monitoring.
So
so
now,
let's
go
ahead
and
deploy
this
template
so
OC
applying.
So
we
have
spark
template
up
and
running
now
we
can
go
and
deploy
it.
So
essentially,
there
are
a
whole
bunch
of
parameters
which
you
can
tune
to
deploy
this,
but
essentially
I'm
tuning.
Few
parameters
like
a
PVC
I,
am
asking
for
10
gig
of
PVC
to
store
my
lozano
for
mastering
and
workers
the
computer
requirements
for
my
mush
report.
There's
a
spark
moisture.
B
To
course,
if
you
went
to
gig
memory
and
I'm
asking
for
0
workers
I'm
starting
with
a
0,
so
I
can
scale
up
later
when
I'm,
when
I'm,
okay
to
start
my
job
and
finally
I'm
asking
for
the
worker
computer,
that's
a
for
Korsak
new
and
for
gig
memory
and
at
the
lash
I
am
I'm
passing
some
of
the
HDFS
HDFS
parameters,
there's
the
user
principal
and
the
key
tab.
So
here
this
key
T
I
am
getting
into
work
reports
using
using
config
map,
but
we
can
use
like
OpenShift
secret
or
the
ward
secret.
B
So,
as
you
can
see,
it
has
created
whole
bunch
of
objects
in
my
project
so
like
for
better
visualization.
Let's
move
on
to
OpenShift
console
so
to
look
at
what's
going
on
there,
so
essentially
it
has
created
three
applications
so
like
one
is
for
logging,
the
spark
environment
itself,
Prometheus
for
monitoring
and
the
spark
environment
itself
like
a
that's
a
master
and
and
zero
worker
for
now,
because
I'm
gonna
scale
up
later.
B
So
it's
a
zero
zero
replicas,
that's
fine
I
got
master
up
and
running
so
low.
So
now,
let's
look
at
the
OpenShift.
I'm,
sorry
spark
master
environment,
so
zero
worker
is
no
running
application.
No
compute,
that's
fine,
but
the
master
is
up
and
running.
No.
No!
Let's
look
at
the
Prometheus
I'm
using
all
the
to
log
in
I'm
passing
my
openshift
credential
to
log
into
Prometheus.
A
B
B
Yes,
also,
we
got
to
know
elastic
up
and
running
as
well
so
like,
as
you
can
see
in
just
a
few
seconds
or
in
just
a
few.
Second,
we
got
fully
operational
spark
environment
up
and
running
with
its
own
logging
and
monitoring
solution,
and-
and
we
are
good
to
go
to
to
to
point
our
spark
job
against
this
partnership,
which
we
are
just
wind
up.
So
essentially,
this
is
the
simple
shell
spirit
which
we
put
together.
It's
a
three-part
so
like
for
she's
gonna
scale.
B
The
workers
I
am
asking
for
two
two
replicas
of
our
workers
and
in
second
strap.
It's
going
to
make
sure
that
all
the
workers
which
I
am
asking
for
based
on
now
on
the
number
of
replicas
I'm,
asking
it's
up
and
running
you
know,
and
in
second
step
is
it's
actually
executing
the
spark
job
and
I
am
just
I'm,
just
looping
in
because
it's
very
small
job
and
I'm,
pointing
it
to
spark
moisture
which
I
just
created
with
a
deploy
mode,
client
and
some
of
the
executors
memory
and
compute
requirements.
B
And
if
you
see
here,
that's
the
spark
token
I
am
I'm
passing
the
spark
token
of
HDFS,
with
the
spark
submit
and
I'm
gonna
I'm
gonna
run.
My
word
count
at
PI.
Again,
it's
the
HDFS
data
I,
don't
need
to
pass
whole
URL
because
I
am
using
default
ifs
and
coresight
XML
and
finally,
it's
on
it's
gonna
tear
down
the
workers
to
zero
and
you
can
even
choose
to
to
tear
down
the
whole
cluster
and
spin
up
when,
when
you're
good
to
go
so
let
both
options
are
available.
B
So
so
now
let's
go
in
and
execute
the
script,
so
the
the
straight
full
set
was
patched
as
you
can
see,
it's
moving
from
zero
to
two:
it's
not
going
to
take
more
than
five
seconds
now.
Let's
look
at
the
logs
here,
you
can
see
it
has
now
HDFS
token
and
so
that
all
the
workers
can
go
and
talk
of
HDFS
and
it
should
be
same
for
for
worker
one.
B
So
if
you
see
the
spark
job
has
started
already
and
spark
master
has
now
two
worker
and
all
the
computer
required
to
run
the
spark
job
and
one
job
is
already
for
one
job
is,
is
running
already
so
now.
Let's
look
at
the
monitoring,
so
Prometheus
is
already
no
detected,
their
workers
are
up
and
running,
and
actually
it
has
even
started
scrapping
it.
So
now
let
us
look
at
what
are
the
metrics?
It
is
serrated,
so
it's
the
actual,
generating
Holmen
shock
matrix
same
improvement
is
getting
a
whole
bunch
of
data
from
from
workers.
B
But
let's
look
at
the
heap
usage
and
if
you
concentrate
on
the
graph
on
the
right
side,
you
can
see
some
data
points
such
points
generated
for
the
worker
and,
as
you
scale
up,
you
can
see
more
data
points
here
and
similarly
it
has
some
metrics
for
master
as
well.
Like
number
of
alive
worker
went
from
0
to
2
and
you
can
put
some
kind
of
alerting
here
before
you
start
your
job
and
a
similarly
application
for
so
like
with
Prometheus
and
spark
integration.
B
We
got
full
full
observability
into
a
vegetable
spark
State,
including
JVM,
like
how
like
house,
my
job
is
doing,
etc
so,
like
that
was
more
about
monitoring
aspects
of
it.
So
now,
let's
look
at
the
login,
so,
like
a
Suraj
mentioned,
we
are
using
file
bit
to
ship
our
logs
to
to
to
ship
our
logs
from
a
master
networker
to
elasticsearch
and
now
I'm,
using
Cabana
console
to
to
see
those
loans.
B
So
now
I
am
streaming
logs,
live
all
the
way
from
all
the
workers
and
masters
to
here,
so
that
so
that
be
created
there
in
a
Big
Data
developer
can
troubleshoot
their
job
in
real
time.
They
don't
need
to
wait
for
the
logs
to
be
available
or
on
some
central
location,
etc.
Right
so
like
here,
it's
more
about
like
it's
more
about
like
logging
aspects
of
this
whole
operationalization.