►
Description
How do you deal with building global, cloud-native SaaS applications at the speed of DevOps? Ram is a product leader within Oracle Cloud Native team. Via real-life case studies, he will discuss scaling to 100s of clusters across geos, achieving high availability, and providing the necessary monitoring and tracing, while covering all bases from a security standpoint.
Presenters:
Ram Kailasanathan, Senior Director Product Management @Oracle
Richard Bair, Senior Director Engineering @Oracle
A
Okay,
let's
get
started.
Thank
you.
Everyone
for
joining
us
today
welcome
to
today's
cncf
webinar,
highly
scalable
software
as
a
service
apps
on
oracle,
kubernetes
engine
real
life
case
studies,
I'm
jerry,
fallon
and
I'll,
be
moderating.
Today's
webinar
we'd
like
to
welcome
our
presenters
today,
rahm
cali,
sinathan
senior,
director
of
pro
product
management
at
oracle
and
richard
bear
senior
director
of
engineering
at
oracle.
Just
a
few
housekeeping
items
before
we
get
started
during
the
webinar,
you
are
not
able
to
talk
as
an
attendee.
There
is
a
q,
a
box
at
the
bottom
of
your
screen.
A
Please
feel
free
to
drop
your
questions
in
there
and
we'll
get
to
as
many
as
we
can.
At
the
end.
This
is
an
official
webinar
of
the
cncf
and,
as
such
is
subject
to
the
cncf
code
of
conduct.
Please
do
not
add
anything
to
the
chat
or
questions
that
would
be
in
violation
of
the
code
of
conduct.
Please
be
respectful
of
your
fellow
participants
and
all
presenters.
A
B
Thank
you,
jerry
hi,
hello,
everyone.
Thank
you
all
for
joining
the
webinar.
This
is
ram
kailasanathan
here,
I'm
a
senior
director
of
product
management
within
the
oracle
cloud
native
team.
C
Yeah-
and
this
is
my
name-
is
richard:
richard
baer,
I'm
the
senior
director
of
engineering
on
the
oracle,
iot
applications
and
blockchain
applications
teams,
and
we
are
users
of
oke.
B
Thank
you
richard.
It's
been
a
great
experience
working
with
richard
over
the
last
year
and
a
half
and
folks.
The
goal
of
this
webinar
is
to
basically
share
how
one
of
our
marquee
customers,
essentially
iot,
apps
and
blockchain
team
within
oracle,
have
onboarded
and
scaled
onto
oracle.
Kubernetes
engine
really
share
all
the
learnings
and
best
practices
with
this.
B
B
78
percent
of
respondents
are
using
kubernetes
in
production
and
impressive
growth
from
compared
to
58
percent
last
year.
B
In
that
same
survey,
kubernetes
public
cloud
was
the
most
preferred
destination
for
using
kubernetes,
and
then
also
there
was
one
other
takeaway
with
regards
to
containers.
Containers
are
making
their
way
to
production
environments.
We
saw
84
percent
of
respondents
mentioning
that
they
are
using
continuous
in
production
and
impressive
jump
from
73
percent
into
2018..
B
When
you
take
a
look
at
cloud
infrastructure,
oracle
cloud
infrastructure,
it
should
think
about
it
as
a
bunch
of
high
performance
servers,
storage
and
database
services,
supporting
millions
of
iops
and
scaling
up
and
down.
Based
on
your
need.
This
is
the
true
enterprise
cloud
you've
heard
mentioned.
We
don't
focus
on
micro
instances
or
vms,
with
time-sliced
fractional
cpu
allocations.
B
We
focus
on
providing
what
businesses
need
to
run
real
production
workloads,
workloads
that
we
have
to
scale
up
as
well
as
scale
out
workloads
that
may
require
the
reliability
of
a
solid
trans.
Traditional
hardware
infrastructure,
in
addition
to
the
plan,
to
fail
approach
of
cloud
workloads
that
need
low,
latency
access
to
storage
networks,
and
we
provide
businesses
with
simple
pricing
and
predictable
costs.
B
And
if
you
take
a
look
at
this
diagram
here,
we
offer
a
whole
gamut
of
cloud
services
all
the
way
from
containers
to
data
movement,
services,
to
autonomous
databases,
to
events
streaming
for
for
aiml
workloads,
and
you
know,
offer
a
whole
slew
of
choices
for
our
customers.
From
a
compute
storage
and
networking
perspectives.
B
And
our
philosophy
of
building
cloud
services
is
fairly
consistent.
You
know,
if
you
take
a
look
at
any
of
the
services
that
I
mentioned
in
the
previous
slide,
it's
fundamentally
about
delivering
tools
and
services
that
are
complete,
integrated
and
open.
By
that
I
mean
delivering
tools
and
services.
B
You
know
in
the
case
of
cloud
nato
everything
from
continuous
integration
to
deployment
to
registry,
to
orchestration
scheduling.
You
know,
management
and
operations.
When
you
start
thinking
about
day
two
and
beyond,
and
basically
we
invest
in
and
leverage
open
source
technologies
as
the
fundamental
basis
of
basis
for
portability.
B
We
actively
participate
in
the
community
driven
open
source.
You
know,
forums
investing
obviously
in
kubernetes
docker,
and
we
have
our
fn
open
source
initiative,
which
is
for
serverless
and
which
is
basically
a
serverless
offering
the
only
serverless
offering
that
is
based
on
open
source
and
then
we
very
actively
engage
with
cncf
and
a
whole
slew
of
other
open
source
forums.
B
As
I
mentioned
before,
through
contributions
and
sponsorships,
we
actually
we
have
an
act
to
support
for
several
other
initiatives,
including
helidon,
which
is
a
java
e
initiative
and
just
to
mention
a
few
aspects.
There
and
then
moving
on
to
the
last
pillar
managed
services.
B
We
differentiate
on
quality
of
implementation,
service
and
operation
excellence
in
terms
of
providing
a
full
and
transparent
management
that
is
open,
source,
compatible
standards
compliant,
and
you
can
essentially
deploy
all
your
applications
onto
oracle
cloud
infrastructure
that
is
essentially
an
enterprise
grade
that
offers
an
enterprise
grade
performance
security,
higher
higher
availability
and
governance.
B
When
we
actually
sat
down
really
talking
to
several
of
our
enterprise
customers
and
startups,
who
wanted
to
deploy
their
applications
onto
kubernetes,
you
know
we,
we
did.
You
know
a
number
of
these
customer
requirements
gathering
sitting
down
with
obviously
real
customers,
but
also
leverage
cncf
survey
as
well.
One
thing
that
really
came
out
was
kubernetes
is
is
very
important,
is
you
know,
is
essentially
think
of
it
as
a
hub
for
modern
cloud
native
way
of
developing
applications,
but
kubernetes
by
itself
is
not
everything
right.
B
I
mean
customers
consistently
mentioned
about
a
number
of
these
observability
tools,
including
logging
monitoring
and
also
mentioned
about
service
meshes
and
additional
networking
technologies,
storage
interfaces
and,
most
importantly,
security
and
compliance
in
addition
to
the
core
cube
and
it
is
offering,
and
so
that
that
has
become
essentially
our
primary
sort
of
data
point
for
building
additional
services
offering
additional
services
on
top
of
kubernetes.
B
We
start
all
the
way
from
you
know
our
customers
front
and
center
and
all
our
you
know
from
day
one
design
thinking
and
it
starts
from
easy
to
start
in
adopt.
B
We
offer
a
console,
that's
very
intuitive
and
and
easy
to
use,
and
then
we
offer
a
whole
slew
of
additional
capabilities
for
you
to
try
a
bunch
of
things
before
you
know.
You
start
serious
application
development
on
oracle
cloud.
Then
we
get
into
the
core
application
development
aspect
here
you
know
we
are
living
in
the
world
of
apis.
B
Api
design
is
very
important
and
and
cloud
we
offer
simple
to
use
services
such
as
cloud
shell
resource
manager,
think
of
it
as
an
automated
services
for
for
infrastructure
as
code
and
then
obviously,
source
control
and
container
registry
are
all
very
important
aspects
of
this
aspect
of
the
overall
development
life
cycle.
B
Then,
when
we
get
into
deploy,
obviously
we're
going
to
share
a
bunch
of
things
about
kubernetes
today,
but
in
addition
to
kubernetes,
we
also
offer,
as
I
mentioned
before,
functions
and
and
streaming
services
as
well
for
for
some
of
the
workloads
that
might
want
to
choose
these
technologies.
B
And
then,
when
you
go
to
day
two
and
day
day
day,
two
and
beyond,
we
start
looking
at
a
number
of
these
observability
tools,
including
monitoring
and
logging,
and
then
we
also
offer
event
service
notification,
service
and
email
service
for
a
bunch
of
these
use
cases
and
being
an
enterprise-focused
company.
B
We
have,
we
think
about
security
and
compliance
from
day
one
and
that
sort
of
reflective
across
all
the
services
that
we've
built
thus
far.
B
Now,
before
transitioning,
I
want
to
sort
of
say
that,
as
I
mentioned
before,
richard
and
I
we've
worked
together
for
the
last
year
year
and
a
half
very
closely
and
iot
apps
is
one
of
the
marquee
customers
of
oke
and
in
the
second
part
of
the
presentation
webinar
here
richard
is
going
to
take
you
through
a
bunch
of
learnings
from
a
customer's
perspective
thanks
richard.
C
All
right
thanks
ram,
so
yeah,
I'm
really
excited
to
do
this
presentation,
and
hopefully
some
of
the
content
that
I
have
here
will
be
useful
to
the
folks
who
are
on
the
webinar
and
those
who
are
watching
online.
C
C
So
today
I
really
have
two
items
on
the
agenda:
I'm
going
to
talk
a
little
bit
about
the
iot
project
itself,
so
that
you
have
some
context
for
what
we're
doing
and
then
just
as
far
as
that'll
be
useful
to
understand,
then
how
we're
using
kubernetes
and
then
I'm
going
to
be
talking
about
four
lessons
that
we've
learned
through
our
own
sort
of
failure
and
experience.
C
So
the
first
thing
that
I
need
to
do
is
just
kind
of
you
know
brush
up
my
bona
fides
here
we
are
running
north
of
17
000
cores
across
our
kubernetes
clusters
and
and
we
have
more
than
50
different
oka
clusters
that
we
are
running
workloads
on,
so
we've
got
some
pretty
big
clusters
and-
and
we
have
a
lot
of
clusters,
so
we're
looking
at
scale
in
two
dimensions
here,
we're
talking
about
scale
vertically
and
horizontally
and
there's
a
lot
of
things
that
you
have
to
solve
when
you're
facing
those
problems,
and
so
that's
kind
of
the
genesis
for
what
we're
going
to
be
talking
about
today.
C
Now
the
internet
of
things,
products
at
oracle,
we
have
their
sas
applications.
We
have
five
different
sas
applications
that
we
sell
and
on
the
blockchain
application
side
we
have
one
for
supply
chain
management,
it's
called
track
and
trace,
and
each
one
of
these
applications
is
essentially
designed
in
the
same
manner
and
all
of
them
are
using
ok.
C
So,
as
I
mentioned,
we
started
using
oh
okay
around
the
late
2018..
I
noticed
on
your
slide
earlier
around
that
there
were
a
lot
of
people
who
were.
Let's
say
there
were
two
interesting
categories
on
there.
One
was
the
number
of
people
who
were
reporting
that
they
had
some
sort
of
cultural
change
that
they
needed
to
make
on
their
team,
and
there
was
another
category
where
it
was.
C
The
number
of
people
who
needed
training
on
okay
and
the
fascinating
thing
was
that
the
percentage
of
people
reporting
that
in
2000
in
in
the
march
or
the
earlier
survey
was
like
zero
and
the
ones
reporting
that
in
the
later
survey,
were
substantially
higher,
which
sort
of
indicated
that
they're
kind
of
in
a
transition
period
in
late
2018
towards
production.
That
was
certainly
our
experience.
C
We
have
our
products
have
been
out
there
since
about
2016..
We
were
on
a
couple
different
clouds
inside
oracle.
Before
we
came
to
oci
and
internally
previously,
we
were
using
docker
already
for
development
and
for
testing,
but
when
it
came
to
production,
we
were
deploying
on
vms,
and
so
when
we
made
the
transition
to
oci,
we
wanted
to
change
the
way
we
were
doing
things
so
that
we
had
the
container
workflow
from
you
know:
development
all
the
way
through
to
production.
C
C
Ok
and
we
did
some
research
and
we
tried
out
ok
and
it
was
super
easy
to
get
started
with,
and
so
we
said:
okay,
let's,
let's
do
it
we're
going
to
use
okay
and
oci
for
everything
we
do
from
development
testing
staging
and
production.
C
We
literally
don't
have
any
other
virtual
machines
that
we
use
for
anything,
that's
not
somewhere
in
okay
or
on
oci.
Every
developer
on
our
team
has
some
quota
on
a
development.
Okay,
cluster
and
all
of
our
automated
sqe
tests
are
run
on
on
the
development
or
staging
clusters.
C
So
we're
really
happy
with
being
able
to
use
this
stuff
in
development
as
well,
and
I
should
also
acknowledge
that
now
that
we
have
a
year
and
a
half
under
our
belt
here
that
using
okay
makes
kubernetes
a
lot
easier
than
it
would
have
been
if
we
were
running
it
ourselves
and
the
okay
team,
you
know
the
the
really
essential
thing
for
anybody.
Who's
done
this
in
production.
You
know
you
know
this
is
true.
The
essential
thing
is
not
that
there's
another
team
providing
you
a
service.
C
The
essential
thing
is
that
the
team
that's
providing
a
service.
Does
it
well
and
the
okay
team
does
a
really
good
job
providing
their
service.
We
just
don't
run
into
problems
with
their
hosting.
It's
worked
really
really
well
for
us,
so
so
we're
using
oke
all
the
way.
So
let's
just
go
ahead
and
dive
into
some
of
the
lessons
that
we've
learned,
like
I
say,
through
failure
and
experience,
which
is
the
best
teacher.
C
So
the
first
thing
is
this
lesson:
number
one
things
die
you
know
operating
anything
at
scale
requires
careful
attention
to
the
architecture
and
the
design
to
make
sure
that
everything
is
highly
available
and
kubernetes
provides
us
all
of
the
tools
that
we
need
to
be
able
to
be
highly
available.
C
Kubernetes
gives
us
these
concepts
of
deployments
and
stateful
sets,
which
you
know,
kubernetes
uses
to
automatically
maintain
a
configured
number
of
replicas.
So
if
we
say
that
we
need,
you
know,
10
web
servers,
it's
going
to
try
to
make
sure
there's
10
if
one
of
them
dies,
kubernetes
automatically
tries
to
bring
one
back
up.
C
Honestly
when
I
was
evaluating
kubernetes
to
begin
with
the
fact
that
it
it
helped
do
this
was
really
powerful
to
me.
It
was.
It
meant
that
it
wasn't
something
that
I
had
to
do
myself
on
top
of
vms
or
some
other
way.
Kubernetes
had
this
built-in
support
for
bringing
systems
back
up
if
they
die,
kubernetes
monitors
every
pod's,
readiness
and
liveness.
C
It
uses
health
checks
figures
out
whether
something
is
healthy
or
not.
So,
even
if
you're,
even
if
your
container
is
running-
but
it's
not
healthy,
it
can
report
that
kubernetes
will
kill
it
and
get
you
a
new
one.
Those
sorts
of
things
are
really
handy
in
production
for
making
sure
that
your
system
stays
alive
through
all
the
bumps
and
bruises
that
happen
in
life.
C
B
C
This
happens
in
real
life
in
real
clouds,
and
it's
really
a
great
feature
that
we're
able
to
make
use
of
kubernetes
also
gives
us
a
way
to
declare
how
many
pods
in
a
deployment
or
stateful
set
can
go
down
at
once
before
we
lose
availability,
and
this
is
called
a
pod
disruption
budget
and
it's
another
really
great
feature
that
kubernetes
provides
and
makes
some
things
a
lot
easier
to
automate,
because
you
can
tell
kubernetes
hey.
C
I
want
to
take
down
these
five
nodes
in
my
node
pool
and
it
goes
okay,
I'm
gonna
go
take
those
down,
but
then
it
looks
at
your
pod
disruption,
budget
and
says
I
realize
I
can't
take
all
five
down
at
once
without
hurting
your
availability.
C
So
I'll
just
take
down
these
two
that
I
can
right
now
and
then
once
those
are
back
up
and
those
containers
are
back
up
and
running
I'll,
go
move
on
to
the
to
the
other
pods
that
I
can
take
down
and
so
that
that
kind
of
automation
is
really
handy,
because
there
are
many
times
when
you
do
need
to
go
through
your
entire
fleet
of
nodes
and
recycle
them
which
we'll
talk
about
in
a
bit.
So
that's
a
the
pod
disruption.
C
Budgets
are
another
great
feature
that
kubernetes
gives
you
so
okay,
kubernetes
provides
me
all
these
tools
that
I
need
for
a
high
availability,
and
you
know
a
lot
of
people
when
they
come
to
this
they're
used
to
programming
like
if
you're
new
to
the
cloud
right,
you're
used
to
programming
on
your
computer.
Your
computer
doesn't
die
that
often
maybe
you've
even
had
a
few
workstations.
It's
like
yeah
every
now
and
again
they
die,
but
not
that
often
right.
C
So
why
is
it
such
a
big
deal
that
everything
is
highly
available
and
the
answer
is
it's
a
little
bit
more
complicated,
there's
a
lot
of
reasons
why
a
pod
may
die,
and
some
of
these
things
are
not
it's
not
like
instability
in
the
cloud.
That's
that's
not
it.
What
it
is
is
it's
architectural
benefits
that
you
get
if
you
design
your
system
to
rapidly
kill
things.
Probably
people
on
this
caller
are
familiar
with
the
concept:
cattle,
not
pets.
C
So
there's
a
lot
of
reasons
why
you
may
have
a
container
or
a
pod
die,
so
a
pod
could
die
if,
for
example,
it
just
becomes
unhealthy.
We
certainly
all
of
us
who
write
software
familiar
with
this.
You
know
you
ran
out
of
memory
or
you
had
a
core
dump
or
something.
You
know
something
went
wrong
system
just
stopped
responding
for
some
reason.
C
All
those
could
be
reasons
why
it
became
unhealthy
and
the
easiest
thing
to
do
is
to
just
well
reboot
it
which
is
essentially
killing
the
pod
and
getting
a
new
one.
Somebody
could
have
killed
it
manually.
Kubernetes
provides
controls
for
deleting
pods,
and
if
somebody
deleted
a
pod,
then
it
needs
to
come
back
up.
C
The
cluster
could
have
been
scaled
out.
When
I
say
the
cluster,
I
mean
the
kubernetes
cluster
itself.
Perhaps
that
got
scaled
out.
Maybe
you
had
the
10
nodes
in
your
cluster
before
now
you
have
20
nodes.
All
of
your
pods
are
still
on
the
original
10
nodes.
Well,
you
just
added
10
new
ones.
You
would
like
to
rebalance
your
workload
that
involves
deleting
pods
on
one
node
and
bringing
them
back
up
on
another
node.
So
that's
another
reason
why
your
system
might
go
and
that's
what
I
mean
by
architecturally.
C
C
It's
also
possible
that
the
node,
the
virtual
machine
that
your
pod
on
itself
will
die
and
you
might,
you
might
have
a
fair
number
of
those,
and
this
can
happen
under
normal
circumstances.
Again,
for
example,
the
virtual
machine
that
that
is
that
node
might
need
a
new
image
in
our
team.
C
You
may
find
that
the
cubelet
needs
to
be
upgraded,
and
that
requires
recreating
the
node
and
sometimes
things
just
stop
working.
We
found
that
with
some
virtual
machine
shapes,
sometimes
the
the
virtual
machine
would
just
be
unhappy,
and
you
know
the
recourse
was
to
restart
that
virtual
machine
or
to
delete
that
virtual
machine
and
get
a
new
one
created.
C
All
those
things
are
just
kind
of
normal
and
common,
and
if
you've
designed
for
a
high
availability,
then
they
become
easy
and
you
get
to
reap
the
benefits
of
that.
So
that
was
the
first
lesson
we
learned
is
that
things
die.
You
need
to
be
prepared
to
deal
with
it
and
it's
totally
common.
It's
not
a
it's.
C
Not
a
once
a
quarter
thing:
it's
not
a
once
a
year
thing
if
you've
designed
for
it,
you
can
actually
use
that
to
your
advantage
and
and
you'll
experience
it
a
lot
and
it
won't
cause
you
any
problems
at
all.
In
fact,
it's
just
normal
part
of
your
daily
business.
C
Okay,
so
lesson
number
two.
It's
kind
of
I
don't
know
this
year,
I'm
starting
to
feel
like
I'm
a
little
bit
older,
and
so
I
had
a
little
joke
here.
So
yeah
so
persistent
pains,
it's
not
just
something
you
get
in
old
age.
C
So
kubernetes
has
this
concept
of
a
persistence,
a
persistent
volume-
and
we
usually
just
call
this
a
pv.
Each
pv
is
backed
by
some
actual
storage
volume.
Typically,
it's
a
oci
blocks,
flux,
storage
volume
and
those
block.
Storage
volumes
are
awesome.
They
have
a
lot
of
great
features
to
them.
They've
you've
got
the
ability
to
do
immediate
snapshots
of
it.
C
So,
for
example,
if
you're
backing
up
data
on
that
thing
and
you
take
snapshots
it,
it's
it's
fast
and
and
really
cool,
so
the
things
that
we've
been
able
to
do
with
those
persistent
volumes,
those
block
volumes,
but
there's
kind
of
a
dark
side.
C
We
found
that
there
are
some
actual
issues
if
you
overuse
persistent
volumes,
so
the
first
time
that
you
come
to
something
like
kubernetes,
you
might
be
in
your
mind,
you
might
have
the
mindset
of
each
one
of
my
pods
is
like
a
virtual
machine,
and
every
virtual
machine
has
a
block
volume.
So
I
will
have
some
kind
of
persistent
volume
associated
with
every
pod
in
my
kubernetes
cluster,
but
that
runs
into
a
bunch
of
different
problems,
and
you
really
don't
want
to
go
down.
Go
down
that
road.
You
know.
C
In
our
experience
we
found
that
we
were
using
persistent
volumes
too
freely
and
it
was
causing
us
a
lot
of
problems.
So
here's
six
of
sort
of
problems
that
we
ran
into
one
problem
is
just
sort
of
the
mechanics
of
it.
So
those
block
volumes
don't
come
for
free
and
your
container
is
not
a
virtual
machine.
C
C
So
this
is
kind
of
really
important
because
it
means
that
if
my
underlying
node
only
supports
let's
say
32
mount
points
for
block
volumes,
then
I
cannot
run
if
every
single
one
of
my
pods
had
a
persistent
volume.
I
would
be
limited
to
32
pods
per
node,
which
is
kind
of
a
shame,
because
nodes
can
really
support
more
than
100
pods.
C
So
unless
your
nodes,
sorry,
unless
your
pods
are
using
a
tremendous
amount
of
cpu,
what
you
would
find
is
that
you
were
limited
in
scale,
not
by
cpu
or
memory,
but
by
the
number
of
mount
points,
and
so
that
can
that
can
end
up
being
a
problem.
For
you,
the
more
persistent
volumes
you
use,
the
fewer
pods
you're
going
to
be
able
to
run
per
node.
So
that
was
the
first
thing
that
we
found.
C
We
also
encountered
some
bugs,
as
you
know,
you'll
find
in
any
software,
and
we
found
that
sometimes
our
persistent
volumes
got
mounted
as
read-only
and
there's
a
a
I'll
talk
about
in
a
bit.
There's
a
fix
for
this,
but
you
know
it's
just.
I
think
the
underlying
principle
here
is
that
persistent
volumes
are
just
another
piece
of
technology
and
the
more
pieces
of
technology.
You
employ
the
more
likely
something's
going
to
go
wrong,
so
limiting
that
extra
piece
reduces
the
number
of
potential
issues
that
could
go
wrong.
C
Another
issue
is
that
persistent
volumes
are
per
availability
domain,
so
in
a
availability
domains
like
a
availability
zone
in
aws
right,
so
you
go
to
some
region
and
you've
got
three
distinct
data
centers
each
one
of
them
is
a
different
ad
that
are
all
part
of
the
same
region
and
these
block
volumes
are
part
of
a
particular
data
center,
and
so
but
pods
are
not
and
your
kubernetes
cluster
spans
all
three
availability
domains
in
a
region.
C
So
if
a
pod
moves
from
one
availability
domain
to
another,
then
it'll
fail
to
attach
to
the
persistent
volume
that
it
left
behind.
You
know
it's
kind
of
like
getting
up
to
move
states
and
finding
out
you
left
your
pet
behind
or
something
you
move
to
a
new
availability
domain,
but
your
your
data
was
left
behind
and
it
wouldn't
detach.
So
this
was
a
problem
that
we
ran
into,
but
there's
a
new
driver
in
kubernetes
called
csi.
C
It's
a
plug-in
model
for
persistent
volumes,
and
okay
now
has
implemented
that
and
it's
designed
to
help
resolve
this
exact
issue.
So
hopefully
issue
number
three's
gone
away
at
this
point.
We're
not
running
into
it
right
now,
but
this
was
a
problem
that
we
had
encountered
previously.
C
Another
issue
we
had,
which
is
kind
of
specific
to
oci.
I
guess
I
mean
every
cloud:
has
these
sorts
of
limits,
but
I'm
not
I'm
not
up
on
what
the
specific
limits
are
for
everybody
else,
but
in
oci,
persistent
volumes
have
a
minimum
size
of
50
gigabytes,
and
the
problem
that
we
had
was
that
this
was
just
too
expensive
for
development
because
we
have
well.
C
I
can't
you
know
we
have
a
big
development
team
and
we're
building
a
lot
of
stuff,
and
so
we
have
a
whole
lot
of
these
persistent
volumes
that
we
were
creating
and
it
was
just
a
lot
of
disc,
even
though
in
development
we
only
needed,
like
maybe
500,
megabytes,
for
each
one
of
these
persistent
volumes,
we
had
to
get
them
in
a
minimum,
50
gigabyte
slice
in
production.
This
isn't
a
problem
for
us,
because
our
data
storage
for
internet
of
things
is
much
larger
than
50
gigabytes.
C
So
it
really
wasn't
an
issue
in
production,
but
in
development
it
was
an
issue
for
us,
so
what
we
ended
up
doing
was
setting
up
our
own
ceph
cluster,
which
is
kind
of
a
off-the-shelf
block
volume
system.
If
you
will
so,
we
got
some
some
of
the
dense
io
bare
metal
machines
from
oci,
and
we
set
ceph
up
on
that
and
it's
able
to
give
us
persistent
volumes
that
are
a
lot
smaller
in
size.
C
C
Another
issue
that
we
ran
into
with
persistent
volumes.
We
also
found
that
there
is
a
limit
to
the
number
of
persistent
volumes
that
you
can
have,
and
this
has
again
csi
should
have
addressed
this
and
we
haven't
seen
this
problem
recently,
but
just
so
that
you're
aware
of
it
there.
You
know
these
things
are
sometimes
deeper
than
what
you
look
at
on
the
surface
level
right.
C
The
problem
that
we
had
was
that
once
we
exceeded
a
certain
number
of
block
volumes,
there
was
so
much
api
traffic
that
was
going
to
the
block
volume
service
that
the
block
volume
service
was
rate,
limiting
it
and
that
was
leading
to
attach
errors
on
the
kubernetes
side.
So
kubernetes
was
like.
Oh,
I
can't
attach
this
block
volume,
because
I
can't
talk
to
the
block
volume
service
because
you've
been
rate
limited.
So
that
was
another
issue
that
we
ran
into.
C
That
happens,
because
the
old
flex
volume
implementation
on
kubernetes
wasn't
doing
any
caching,
and
so
it
would
basically
periodically
kubernetes
asks
the
cubelet
and
it
says
hey.
I
need
you
to
go
run
a
report
basically
and
find
out
if
which
persistent
volumes
are
still
attached
to
you
and
which
ones
are
not
so
that
I
can
update
my
database
on
the
kubernetes
control
plane
and
so
when
it
would
do
that,
it
would
go
to
the
block
volume
service
and
then
it
would
get
rate
limited.
C
C
But
but
this
was
another
issue
that
we
had
run
into,
and
there
was
one
other
thing
that
was
kind
of
interesting.
Is
that
if
you're
doing
I
o
intensive
tasks
with
block
volumes,
they're
a
lot
slower
than
say
a
local
disk,
or
something
like
that?
Unless.
C
Really
large
block
volumes,
because
the
number
of
iops
that
you
can
get
on
an
oci
block
volume
is
really
high
and
block
volumes.
Also,
like
I
said,
they're
really
cool
because
of
the
snapshotting
that
they
do
and
and
also
everything
is
encrypted
at
rast
and
everything
else,
but
because
those
block
volumes
can
be
sliced
from
large
to
small.
You
get
less
bandwidth
when
you
go
with
the
small
ones.
C
Even
though
they're
network
attached
they're
really
fast,
really
good
hardware,
but
you
have
to
have
large
block
volume.
So
again,
when
you're
talking
about
testing
and
things
like
that,
if
you're
doing
small
block
volumes,
you
would
find
that
there's
kind
of
a
disadvantage
on
on
that
when
it
comes
to
performance.
C
So
you
might
think
well,
okay
block,
you
know,
persistent
volumes
are
all
bad.
What
do
we
do
so
there's
a
few
first
off.
If
you
have
to
have
a
block
volume
or
a
persistent
volume,
I
should
say
for
storing
customer
data.
Then
you
should
use
a
pv,
it's
not
like
it's
not
like.
They
don't
work,
they
work
really.
Well,
they
work
right
out
of
the
box.
Everything
you
know
just
works.
The
problem
is
when
you
have
lots
and
lots
and
lots
of
little
pbs.
C
That's
when
you
start
to
run
into
some
scale
issues,
but
persistent
volumes
don't
have
to
be
backed
by
a
block
volume.
There's
you
can
also
block
back
them
with
what
in
oci
is
called
fss,
which
is
basically
nfs
for
oci,
and
we
actually
make
use
of
this
for
some
of
the
things
that
we
do
and
it's
really
cool.
So
one
of
the
things
that
we
do
is
our
developers
want
to
have
access
to
a
heap
dump
if
the
pod
runs
out
of
memory.
C
I
need
to
attach
a
persistent
volume
to
this
thing
so
that
the
heap
dump
will
be
persistent
or
or
crash
dump
or
whatever
will
be
persistent
across
pod
restarts
and
you'd
be
right,
but,
as
I
mentioned,
having
one
of
those
on
every
single
pod
is
going
to
cause
a
problem.
So
instead
what
we
found
was
that
we
could
create
a
persistent
volume
that
was
backed
by
fss.
C
We
attach
it
to
all
of
our
pods
and
then,
when
the
pod
you
know,
does
a
heap
dump
either
because
it
ran
out
of
memory
or
because
we
manually,
you
know,
invoke
the
command
to
make
it
heat
them
or
if
it's
going
to
do
a
crash
log
or
something
like
that,
it
gets
written
into
this.
C
This
fss
backed
persistent
volume,
and
then
we
have
completely
other
jobs
running
in
kubernetes
that
go
through
that
fss
and
they
find
these
volumes
or
these
these
dumps
and
they
process
them,
and
you
know,
turn
them
into
something
that
we
can
share
with
the
developer
and
all
that
kind
of
stuff
behind
the
scenes
so
having
a
shared
file
system
for
that
is
really
cool,
because
you're
able
to
get
multiple
pods
all
using
the
same
underlying
storage
unit
and
fss
scales.
C
With
your
usage,
you
pay
for
what
you
use.
So,
unlike
a
block
volume
where
you
say,
okay,
it
has
to
be
50
gig
minimum
with
fss.
It's
like
well,
however
much
you
use,
it
just
keeps
scaling
out
for
you,
so
that
was
one
use
case
that
we
found.
That
was
really
neat
for
persistent
volumes
and
kubernetes,
and
another
point
that
I
wanted
to
make
about
something
that
we
learned
was
how
to
deal
with
logs.
C
So
one
of
the
main
problems
that
we
ran
into
early
on
is,
it
seems
silly,
but
it's
true.
It
was
just
our
logs.
We
would
be
logging
things
to
disk
because
that's
the
default
behavior
of
most
log
systems
and
when
we
would
log
it
to
disk
eventually,
the
log
file
would
get
so
large
that
it
would
fill
up
all
the
space
on
the
node
and
when
you
so
the
way
that
kubernetes
and
docker
works
right
is
that
you
have
your
node,
which
is
your
virtual
machine.
C
C
So
what
we
learned
was
that
logs
need
to
be
given
an
explicit
size
and
rotation
so
that
when
they
start
to
you
know
so
that
they
can't
fill
up
space,
that's
the
primary
thing,
because
the
firs,
the
first
thing
that
we
did
when
we
saw
that
was
we're
like.
Oh,
we
need
to
get
those
logs
off
of
the
nodes
hard
disk,
so
we
will
attach
a
block
volume
to
this
thing
that
it
can
write
its
logs
to.
C
But
then
we
we
found
that
that
was
running
into
all
these
other
problems
with
having
too
many
persistent
volumes.
So
we're
like
okay,
you
know
what
else
can
we
do
so?
What
we
ended
up
doing
is
we
already
have
a
kibana
elasticsearch
elk
system
set
up
that
we're
sending
all
of
our
logs
to
anyway.
So
actually,
the
only
logs
that
we
needed
locally
was
just
what
we
needed
for
as
a
buffer
until
we
could
send
it
off
to
our
logging
system.
C
So
we
scaled
down
the
size
of
potential
logs
to
a
small
number
like
100k
or
250k
or
whatnot,
and
then
have
a
rotation
on
it.
So
we
just
have
a
couple
files
rotated
and
that's
it,
and
then
it's
just
as
a
buffer
that
we
can
then
send
everything
off
to
the
elk
servers
where
we
can
actually
inspect
all
the
logs
for
our
entire
system
and
every
pod
in
kubernetes
can
have
ephemeral,
storage.
It's
it's
super
great
for
small
disk
usage.
You'd,
never
use
this
for
storing
customer
data
and
all
the
data.
C
That's
in
ephemeral,
storage,
just
vaporizes
whenever
the
pod
goes
away,
so
you
really
it
is
ephemeral,
it
doesn't
last
it's
not
persistent,
but
if
all
you're
doing
is
storing
some
buffer
for
some
little
things
or
you
just
need
a
little
scratch
space
or
something
like
that.
Just
to
use
ephemeral,
storage
and
don't
even
bother
with
with
any
type
of
persistent
volume.
C
B
C
So
you
know
when
we
first
got
started
using
kubernetes.
We
never
put
requests
and
limits
on
anything
we
just
didn't.
You
know
we
just
didn't
and
we
started
seeing
pods
getting
evicted
all
over
the
place
and
we
found
out
that
it
was
because
resources
started
to
get
overused
and
then
kubernetes
was
evicting
pods
and
it
didn't
have
any
guidance
on
what
to
evict
when
or
where.
So
for
our
particular
workload.
C
It's
very
much
memory
intensive.
So
we
set
the
request
and
limit
to
be
the
same
for
memory
so
that
we
have
dedicated
memory
for
each
one
of
our
pods,
but
cpu
for
us
is
totally
over
provisioned.
So
we
we
set
the
request
to
something
really
really
small
and
we
don't
even
bother
with
the
limit,
because
you
know
use
as
much
as
you
want
or
we
set
the
limit
to
something
high
like
you
know,
so
it
doesn't
use
all
cpu
on
the
system.
C
But
it's
not
it's
not
strict
right,
because
I
want
to
be
able
to
over
provision
cpu
on
that
thing.
So.
A
C
You
the
ability
to
do
this,
and-
and
if
you
make
use
of
this,
then
it
actually
is
really
helpful
to
you,
because
it
gives
you
a
little
bit
better
control
over
what
gets
scheduled.
Where
and
what
happens
when
you
start
running
into
your
memory
limits
and
I
think
for
us,
the
the
biggest
deal
in
stability
was
getting
our
memory
just
right.
C
When
we
set
our
memory,
we
had
to
make
sure
that
it
was
set
large
enough
to
account,
for
we
did
a
lot
of
java,
so
we
had
to
make
sure
we
had
enough
jvm
memory
and
enough.
You
know
os
memory.
If
you
will
system
memory
for
the
stuff
is
going
to
run
same
thing
if
you're
running
go
or
something
else,
you
have
to
just
make
sure
to
get
those
sizes
right,
and
if
you
do,
then
we
found
that
things
got
really
stable
at
that
point.
C
Okay.
Last
lesson:
lesson
number
four
is
absolutely
essential.
This
you
know
we
would
have
absolutely
no
chance
of
running
anything
at
scale
without
good
monitoring
tools,
and
we
invest
a
lot
in
making
sure
that
all
the
monitoring
is
working
really
well
for
us
and
it's
something
that
we're
continually
working
on.
So
every
one
of
our
okay
clusters
runs
a
prometheus
in
which
gathers
all
the
metrics.
Together
we
have
a
global
thanos,
which
is
a
basically
it's.
It's
like
a
distribute.
C
It
distributes
prometheus
queries
to
all
those
little
prometheus
servers
running
in
all
the
clusters,
and
it
gives
us
a
single
place
that
we
can
go
and
get
metrics
for
all
of
our
production
systems.
We
have.
Similarly,
we
have
a
single
kibada
and
es
cluster
for
all
of
our
logs,
and
so
we
have
a
single
place
that
we
can
go
to
look
for
logs
across
our
all
of
our
systems.
We
use
the
alert
manager,
which
is
the
technology,
comes
with
prometheus
in
order
to
raise
our
pager
duty
alerts.
C
C
One
really
important
point
is
this
health
check,
so,
in
addition
to
the
prometheus,
we
have
this
health
check
and
the
health
check
is
used
by
default
by
kubernetes,
but
we
add
a
lot
more
to
that
health
check
than
just
what
kubernetes
makes
use
of,
and
we
use
that
as
input
also
into
our
that
the
health
check
data
also
goes
into
our
prometheus
system,
so
that
we
can
do
some
really
advanced
visualizations
that
are
useful
for
us,
for
knowing
what
the
health
of
our
system
is,
and
the
important
point
here
is
that
you
know
additionally
you'll
be
doing
monitoring,
probably
for
your
core
infrastructure,
but
the
real
value
comes
when
you're
doing
monitoring.
C
C
You
can
select
the
the
type
of
cluster
that
which
is
something
that
we
you
know
it's
a
taxonomy
we
use
for
our
system,
it
doesn't
apply
to
everybody,
and
then
you
can
pick
a
specific
cluster
or
you
can
say
all
clusters
and
you
can
give
it
a
time
range
and
it'll.
Tell
you
how
many
alerts
you
had
during
that
time
and
what
kind
of
alerts
you
had
and
so
forth.
C
We
have
some
dashboards
that
show
us
the
health
of
particular
clusters
or
again,
all
clusters
or
specific
vms
and
specific
availability
domains,
different
ways
of
slicing
and
dicing
it.
But
the
dashboard
will
show
us
how
many
nodes
we
have
in
this
cluster,
how
many
failed
nodes
and
where
they're,
at
what
the
health
of
them
is
like
and
so
on
and
so
forth.
C
Those
are
like
the
application
level
concepts
of
our
different
things
that
I
depend
on
as
an
application
up
or
down,
and
this
has
been
really
really
useful
for
us
and
it
comes
from
our
advanced
health
checks
that
tell
us
in
more
detail,
what's
going
on
in
a
particular
service
instance,
so
that,
when
we're
running
into
a
problem,
it's
very
easy
to
see
because
it
comes
up
red
and
then
we
have
some
of
the
normal
gadgets
that
also
come
with
prometheus,
where
we
can
see
the
cpu
usage
for
all
of
the
different
pods
for
this
service
instance,
all
the
memory
being
used
by
these
also
the
http
error
rates
are
golden
signals
that
tell
us
you
know,
what's
what's
customer
experiencing
on
this
service
instance,
and
so
all
right.
C
So
that's
that's
our
four
lessons
that
that
I
wanted
to
share
with
you
today.
First
one
is
that
things
die
so
design
for
aha
from
the
beginning.
The
second
one
is
to
use
your
persistent
volumes
wisely.
They
can
be
really
really
useful
and
we
use
them
a
lot,
but
they
do
come
at
a
cost.
So
don't
just
don't
just
think
that
you
can
use
persistent
volumes.
C
Lively
third
is
limits
and
requests
really
help
to
make
things
stable
and
the
fourth
one
is
that,
like
as
a
as
somebody
who's
responsible
for
watching
this
stuff
in
operations,
you
know
my
background.
I
was
an
application
developer.
Then
I
joined
sun
microsystems
and
I
was
an
architect
on
a
java
platform
and
then
I
came
to
work
on
iot
and
sas
applications
and
I
did
not
appreci
until
I
was
in
this
position
of
running
things
in
production
on
the
cloud.
C
I
didn't
appreciate
how
crucial
monitoring
is,
but
it
truly
is
crucial
to
your
applications
and
you
want
to
make
sure
that
you've
got
really
good
monitoring
in
place
for
the
stuff
that
you
do
so
given
that
that
I've
been
using
okay
and
kubernetes
now
in
production,
all
my
customers
are
running
on
this
thing.
Do
I
recommend
it?
My
answer
is
yes
slides
a
little
bit
long-winded,
but
the
essential
point
is
there
at
the
bottom,
the
higher
up,
the
stack,
the
better?
I
don't
I
don't.
C
You
know
we're
in
the
business
of
delivering
internet
of
things.
Sas
applications
we're
in
the
business
of
of
delivering
supply
chain
management
applications
the
rest
of
the
stuff
that
we
have
to
do
in
order
to
make
that
work
is
just
sort
of
the
underpinnings,
but
it's
not
necessarily
something
that
we
want
to
be
investing
a
huge
amount
in
so
the
higher
up
the
stack
we
go,
the
better
and
okay
lets
us
live
a
lot
higher
up
the
stack
than
any
of
the
other
technologies
that
we
had
available
at
our
disposal.
C
So,
okay,
team's
done
a
fantastic
job
of
supporting
us
and
their
management
of
their
system
has
been
solid,
and
so
it
really
has
taken
a
big
weight
off
of
our
shoulders
in
terms
of
how
to
manage
these
environments.
So
it's
been
a
really
good
experience
for
us
to
definitely
recommend
it
and
with
that
there's
some
resources.
You'll
have
these
slides
afterwards,
and
I
want
to
stop
that
so
that
we
can
open
things
up
for
questions
and
answers.
C
Yeah,
so
there's
a
couple,
so
first
thing
is,
is
that
we
did
not
accept
any
kind
of
code
from
customers,
so
our
our
initial
concern
was,
I
think,
common
to
a
lot
of
people
running
on
these
clusters
is,
if
I
have
a
security
breach
in
one
of
these
containers,
what
can
somebody
do
and
they
can
do
a
lot,
so
one
of
the
ways
that
we
secure
our
system
is
by
not
accepting
any
type
of
code
from
customers
whatsoever.
C
When
I
say
code,
I
don't
just
mean
like
java
or
python
or
something
I
mean
no
sql.
No,
you
know,
there's
there's
nothing
that
you
can
feed
to
us
and
if
you
do
feed
us
something
like
an
image
or
something
like
that,
we
run
it
through
it's
outside
of
kubernetes
right.
It
has
nothing
to
do
with
kubernetes,
but
any
inputs
that
come
through
end
up
going
through
antivirus
software
and
all
that
kind
of
stuff,
so
that
we
can
make
sure
that
the
stuff
that's
running
in
there
solid.
C
The
other
thing
is
that
you
can
use
something
like
calico
as
and
this
is
kind
of
common
for
all
the
teams
that
are
building
on
to
okay.
Today
inside
oracle,
you
can
use
something
like
calico
in
order
to
restrict
which
pods
are
able
to
access
which
pods,
so
it
calco
gives
you
a
way
to
do
network
security
at
the
kubernetes
level,
and
so
that's
another
technique
that
can
be
used.
B
In
addition,
okay
also
supports
spot
security
policies,
and
you
know,
supports
private
worker
nodes
and
also
you
get
all
the
benefits
of
the
core
security
and
network
isolation
that
you
would
get
you
know
adopting
oci
in
general,
not
specific,
okay,
but
all
the
oci
related
security
features.
You
just
naturally
benefit
from.
C
That's
a
great
point
so
like,
for
example,
private
endpoints
on
the
database,
so
you
can
set
up
your
when
you
create
your
kubernetes
cluster.
You
create
it
inside
some
virtual
cloud
network
and
when
you
set
it
up
inside
there.
Your
first
line
of
defense
is,
of
course,
all
the
oci
security
boundary
that
you
get
from
security
lists
and
things
like
that
and
your
subnets
and
such
in
your
vcn.
C
But
you
can
also
use
like
the
private
endpoints
on
atp
or
some
other
services,
which
will
basically
give
you
the
atp
service
inside
your
vcn,
so
that
it's
not
routable
externally
at
all.
So
then,
you
know
you
have
this
nice
secure
balance.
B
So
there
was
a
question
on
chaos:
engineering,
richard
anything
that
you
can
share
from
from
a
customer
perspective,.
C
Good
question,
so
our
team
is
I'd,
say,
sort
of
adolescence
when
it
comes
to
chaos.
Engineering,
when
I
say
our
team,
I
mean
the
iot
team.
So
at
the
moment
we're
doing
most
of
our
ks
engineering
is
manual,
we'll
set
up
a
cluster
and
then
we'll
go
take
down
a
particular
availa.
You
know
all
the
nodes
in
an
availability
domain
and
see
how
our
system
responds
things
of
that
nature.
B
Yeah-
and
we
also
see
some
interesting
startups,
offering
chaos
engineering
as
a
service
and
from
product
standpoint
we're
looking
to
really
work
with
some
of
these.
You
know
startups
to
offer
this
as
a
marketplace
service.
That's
something!
That's,
definitely
something
we.
A
C
I
personally
don't
recommend
running
any
databases
inside
kubernetes
itself.
If
well,
I
guess
it
depends
on
how
you
define
a
database,
so
we
run
elasticsearch
inside
of
our
kubernetes
and
we
run
oracle
nosql
inside
of
our
kubernetes.
I
guess
both
of
those
could
be
considered
databases
by
some,
but
I'll,
be
honest
with
you.
C
So
like
I'm
going
to
be
totally
honest
before
when
I
was
at
sun
or
before
I
was
at
sun,
oracle
database
was
very
difficult
to
set
up
and
administer,
and
ever
since
oracle
cloud
came
around,
it
makes
oracle
database
so
easy.
C
It's
so
easy
and
it's
so
powerful
that
my
like
the
oracle
database
is
my
go-to
for
a
database
and
I'm
I'm
not
saying
that
as
like
an
oracle
guy,
even
though
I'm
an
oracle
employee,
I'm
telling
you
that,
as
an
as
an
engineer
right
my
experience
with
the
oracle
databases,
we
use
the
autonomous
database
ourselves
and
it's
just
so
much
easier
than
it
is
hosting
your
own
database
inside
kubernetes.
It's
so
much
easier
and
powerful
that
I
I
would
definitely
look
at
that
as
a
first
option.
Before
I
looked
at
anything
else,.
C
We
have
not
yet,
but
even
our
use
of
calico
was
not
going
to
replace
flannel.
It
was
just
going
to
be
used
for
the
networking
level
stuff,
but
we're
actually
looking
at
another
project.
Also,
I
think
it's
called
psyllium,
which
also
does
the
network
level
security
stuff,
but
we
haven't
moved
our
production
clusters
onto
it.
Yet.
C
Is
a
great
question:
in
fact,
I
would
like
to
do
an
entire
presentation
on
that
sometime
we
do
not
use
helm.
We
started
off
with
helm,
helm2
we're
like.
Let's,
let's
try
to
make
use
of
this.
There
were
various
reasons
why
we
didn't
go
with
it,
one
of
them
being
that
config
maps
didn't
get
updated
properly
and
there
was
some
some
bug
and
a
dug
into
it.
It
was
one
of
those
yeah.
It
doesn't
really
work
that
well,
maybe
in
helm3
it'll
work.
C
C
Deploying
kubernetes
yml
files,
that's
part
of
the
story,
but
it's
not
the
whole
story,
especially
if
you're
trying
to
do
something
like
blue
green
deployments,
I
mean
you
can't
just
apply
yml
files,
there's
a
whole
orchestration
thing
that
has
to
go
on
here.
Spinnaker
is
supposed
to
address
some
of
that,
but
I've
had
some
problems
with
spinnaker,
I'm
not
I'm
not
really
sold
on
it.
Even
though
I
want
to
be
sold
on
it,
I'm
not
really
sold
on
it.
C
A
Okay,
well,
thank
you
both
very
much
for
a
wonderful
presentation.
That's
all
the
time
that
we
have
for
today.
As
I
said
earlier,
the
presentation
and
slides
will
be
available
on
the
cncf
website
later
on.
I'd
like
to
thank
everyone
for
joining
us
today
and
for
our
presenters
today,
everyone
take
care
of
yourselves
stay
safe
and
we'll
see
you
next
time.