►
From YouTube: OpenShift Commons Gathering 2019 Santa Clara Building Operators at Uber Paul Schooss Matt Schallert
Description
OpenShift Commons Gathering 2019
Santa Clara
Building Operators at Uber
Paul Schooss Matt Schallert Uber
M3DB
A
Hey
everyone,
I'm
Matt,
I'm
Paul,
as
Diane
said,
were
from
uber,
were
sres
on
the
observability
team,
where
we
work
on
ubers
open-source
metric
platform,
and
today
we're
gonna,
give
you
or
share
with
you
some
of
our
experience,
building
an
operator
for
part
of
it.
So
we
work
on
M,
3
DB,
which
is
ubers
open,
source
time
series
database.
It's
part
of
em
three,
which
is
uber.
Uber
is
open
source
metric
stack,
so
M,
3
DB
is
the
core
distributed
time
series
database
that
sits
at
the
center
of
M
3.
A
We
also
have
other
components
like
our
just
like
our
fault,
tolerant,
aggregation
tier
our
distributed
query
engine,
all
of
which
act
as
kind
of
a
way
to
use
M,
3
DB
as
just
a
general
time
series
database
or
as
a
long-term
store
for
a
Prometheus
when
first
built
M,
3
dB.
Sorry
give
you
a
sense
of
M
3
to
B's
usage
at
uber.
We
have
a
little
over
1100
instances
running
M,
3
dB.
A
They
all
do
in
post
aggregation,
33
million
writes
per
second,
which
comes
out
to
about
55
gigabits
per
second,
and
we
store
a
little
over
9
billion,
unique
metric
IDs.
When
we
first
built
M
3
DB
in
2016
operating,
it
was
pretty
simple.
We
pretty
much
ran
M
3
DB
in
two
data
centers,
and
we
had
one
cluster
per
data
center.
The
clusters
also
shared
a
static
configuration.
Our
only
use
case
back
then
was
like
real-time
alerting
in
dashboard
introspection.
A
So
we
just
stored
metrics
at
10
seconds
for
a
10
second
resolution
for
two
days,
but
when
you
fast
forward
to
2019
things,
aren't
nearly
as
simple
so
we
run
em
3
DB
in
more
data
centers,
there
are
more
clusters
per
data
center,
the
clusters
are
larger
and
they
no
longer
share
a
static
configuration.
So
we
have
some
clusters
that
store
high
resolution
metrics
at
like
1
second
for
6
hours,
and
then
some
clusters
that
store
low
resolution,
metrics
at
say
an
hour
for
up
to
5
years.
A
On
top
of
all
this
configuration
complexity,
we
also
run
em
3
DB
in
multiple
cloud
providers,
where
uber
has
a
presence.
So
when
you
put
this
all
together,
clusters
grew
like
by
20
by
like
20
times,
engineer's
stayed
relatively
the
same,
and
while
m32
B
itself
is
pretty
easy
manage
on
like
a
cluster
level,
we
really
needed
something
that
would
kind
of
help
automate
this
whole
cluster
lifecycle
and
Paul's,
going
to
talk
a
bit
about
what
that
looks
like
thanks.
B
B
There's
an
action
plan,
that's
formulated
and
an
engineer
has
to
go
out
and
actually
enact
that
plan,
and
it
could
take
a
lot
of
time
away
from
that
engineer
to
actually
do
core
feature,
development
or
whatnot
for
the
actual
thing.
So,
let's
talk
about
why
we
went
the
operator
route
because
we
wanted
to
make
this
experience
a
lot
more
smoother
for
our
engineers.
So,
first
before
we
even
went
down
thinking
about
how
to
automate
this,
we
had
to
understand
the
problem
space
that
we
were
in.
B
So
let's
talk
about
some
of
the
features
of
m3d
be
at
its
core,
so
kind
of
understand
what
we
had
to
automate
first
and
foremost
and
3db,
is
a
sharted
distributed
time
series
database.
So
first
thing
that
we
have
here
is
basically
each
piece
of
data
upon
ingestion
into
m3
is
physically
sharted
into
a
respective
partition,
for
you
know
separate
the
affiliate
domains
that
we
have
inside
there.
B
Each
of
these
shards
are
actually
replicated,
furthermore,
to
separate
out
even
further
failure,
domains
that
are
within
the
database
itself,
and
so
all
shards
are
replicated
by
a
factor
of
three
by
normal
settings.
I
would
say-
and
it's
always
replicated
and
synced
in
real
time
at
all
times,
so
with
those
given
primitives.
Let's
talk
about
some
of
the
features
that
we
said,
Obamacare.
A
Yeah,
so
a
lot
of
things
that
we
were
thinking
about
when
we
were
talking
about
our
requirements
of
what
we
need
to
have
a
platform
given
that
we're
working
on
database,
really
one
of
the
main
ones
that
stood
out
was
storage.
We
needed,
whatever
system
we
chose,
had
to
have
features
to
support
a
demanding,
stateful
workload.
So
m3/d
B
requires
access
to
low
latency
storage
and
we
require
access
to
the
storage
across
all
of
our
clusters.
A
We
have
internal
workloads
at
uber
that
randomly
query
large
time
series,
ranging
from
hours
to
years
for
things
like
alerting
or
SLA
tracking
capacity
planning,
and
our
stakeholders
expect
to
have
access
to
all
this
data
in
a
real-time
manner.
To
give
you
a
sense
of
what
we
mean
by
like
performance,
stateful
primitives,
some.
Let's
take
a
look
like
some
of
the
ways
you
might
store
state
in
a
containerized
application,
so
one
option
would
just
be
to
not
store
your
state
at
all
and
m3/d
be
instances.
Data
would
live
as
long
as
its
container.
A
This
might
be
nice,
as
you
would
have
direct
access
to
the
host
disk
and
all
the
speed
that
comes
with
that,
but
otherwise
it
would
be
pretty
terrible.
So
for
one
there's
no
durability
of
the
data,
and
this
means
that
as
containers
are
upgraded
or
restarted,
even
if
that's
in
place,
they
would
have
to
restream
all
of
the
data
that
they're
responsible
for
from
their
peers,
which
could
be
terabytes
of
data
and
then,
in
addition
to
that
being
inefficient.
It
also
has
dangerous
reliability
implications.
A
So
when
an
m3
DB
instance
is
coming
up
or
as
we
call
bootstrapping
it's
available,
it
takes
writes,
but
it's
not
available
for
reads,
meaning
that
well
and
instances
streaming
data
from
its
peers,
you're
at
a
reduced,
read
replication
factor,
so
another
option
for
storing
your
state
in
a
more
durable
manner
might
be
to
use
remote
storage.
So
this
is
pretty
popular
on
your
neighborhood
cloud
provider.
You
can
use
something
like
a
block
store,
so
in
this
case
we
still
weren't
really
super
happy
with
this,
as
there's
increased
latency.
A
Additionally,
it
seemed
pretty
inefficient
for
us
to
pay
someone
to
replicate
data
at
the
block
layer
that
we're
already
replicating
at
the
application
layer
and
then
to
do
that
slower
with
potentially
varying
performance
characteristics.
But
honestly,
the
main
thing
was
that
this
is
less
portable
for
us
across
cloud
providers
and
our
work
and
our
data
center
is
so
even
different
cloud
providers.
There
are
remote
storage,
their
block,
store
solutions,
kind
of
have
different
performance
characteristics
and
then
also
we
just
weren't
really
comfortable
using
network
storage
inside
of
our
data
centers.
A
A
So
this
is
kind
of
nice
since
you're
only
paying
to
store
one
copy
of
the
data,
but
there's
even
worse
latency
introduced,
and
then
this
still
brings
back
the
problem
that,
as
containers
are
moved
around,
they
could
potentially
have
to
pull
all
of
this
data
that
they're
responsible
for
back
down
from
the
object
store.
So
this
left
us
searching
yeah.
B
So
you
can
see
the
metrics
we
have
right
here:
durability,
performance
and
efficiency
and
the
previous
cause
of
the
other
approaches
that
Matt
mentioned
really
didn't
meet
all
the
checkboxes
that
we
were
looking
for.
So
we
went
out
and
did
some
evaluation
and
we
settled
on
kubernetes
because
they
actually
checked
all
these
boxes.
For
us
eventually,
and
one
of
the
first
features
was
the
local
persistent
volumes
that
they
provide,
and
so
this
type
of
abstraction
was
very
portable
for
us
to
do
on
Prem,
as
well
as
inside
the
cloud
to
meet
our
needs.
B
A
Additionally,
some
more
kubernetes
features
that
were
really
helpful
for
us
was
the
combination
of
node
affinity
and
stateful
sets
so
with
stateful
sets.
We
got
things
like
strict
ordering
of
pod
operations,
whether
that
just
be
a
like
maintenance,
rolling
restart
or
a
cluster
version
upgrade,
and
then
combining
that
with
node
affinity
we
were
able
to
kind
of
express
our
requirements
for
failure
domains.
A
So
to
give
you
an
example
of
what
I
mean
by
that
in
our
internal
data
centers,
we
use
ubers
provisioning
tooling
to
make
sure
that
the
physical
hosts
that
m3/d
be
runs
on
are
provisioned
and
scheduled
across
racks
and
other
failure
domains
in
the
data
center.
Once
those
instances
are
up,
m3d
be
itself.
The
placement
algorithms
take
care
of
distributing
the
shards
across
those
failure.
Domains
such
that
no
loss
of
a
single
failure
domain
can
cause
the
cluster
to
lose
quorum
so
using
node
affinity.
A
We
were
able
to
express
this
really
nicely
in
our
cloud
environments,
so
we
could
pretty
much
do
is
pin
stateful
sets
of
m3
db2
zones
in
the
cloud
and
then
make
sure,
and
then
we
would
be
guaranteed
that
we
would
get
all
these
pods
evenly
distributed
across
all
the
across
all
these
failure
domains
and
then
again,
when
the
m3
DB
instances
come
up,
they
actually
just
distribute
the
shards
across
all
the
zones.
So.
B
Let's
talk
about
some
of
the
findings
that
reoccurred,
while
we
were
doing
the
development
of
the
operator
itself,
first
and
foremost,
development
and
CI
are
going
to
be
on
the
forefront
of
your
your
mindset
when
you're
doing
that,
and
first
we
found
out
with
the
blistering
pace
of
the
Cooper
failure
release
cycle
every
quarter.
There
were
a
lot
of
braking
changes,
so
one
of
the
first
things
we
discovered
was
to
have
a
proper
and
then
tests
that
would
basically
install
the
operator
do
some
rudimentary
checks
and
make
sure
it's
happy.
B
Everybody
loves
unit
tests,
but
when
you
actually
see
the
actual
end
end
life
cycle
work,
it
makes
you
much
more
confident
in
there
helped
us
catch
a
lot
of
bugs
with
braking
changes
that
came
from
the
API
changes
from
kubernetes.
Another
thing
that
we
ran
into
was
like
local
development.
B
Originally
every
kind
of
set
forth
and
did
mini
cubes
mini
cube
with
node
affinity
is
much
different
than
and
a
full-blown
h8
cluster
that
you
would
have
inside
a
cloud
or
on-premise,
and
so
it's
it's
very
important
to
know
that
these
environmental
changes
can
actually
shape
how
your
operator
is
going
to
be
developed
with
and
and
what
is
going
to
change
over
the
course
of
time.
Another
thing
that
we
ran
into
is
installation.
B
In
addition
to
this,
if
you
kind
of
do
this
right
off
the
bat
it
let
it
kind
of
takes
away
from
your
chord
feature,
development
and
so
you'll
start
making
assumptions
that
you
wouldn't
have
done
otherwise
and
it
kindly
kind
of
down
a
rabbit,
hole
and
Matt
will
kind
of
go
into
that.
A
little
bit
more.
A
When
we
mean
by
this
is
that
you
know
we
built
this
operator
with
all
the
assumptions
of
how
we
kind
of
operate
em
3db
at
uber,
which
is
we
run
these
highly
available
clusters
that
are
spread
across
multiple
failure
domains,
and
so
we
realized
that
we
ended
up
baking
in
some
of
the
assumptions
about
how
we
operate
em
3db
into
our
operator.
That
might
not
be
totally
portable
for
everyone.
A
We'll
talk
about
that
in
a
second,
but
one
thing
that
we
realized
is
that
you
know
users
are
kind
of
at
the
different
phases
of
what
you
might
call
like.
The
kubernetes
maturity
lifecycle,
in
the
sense
that
you
have
some
people
that
are
still
kind
of
like
testing
the
waters,
maybe
they're
just
installing
your
operator
on
a
local
mini
coupe
instance.
A
Our
operator
actually
requires
that
you
have
a
kubernetes
cluster
spread
across
three
failure:
domains
if
you're
using
a
3
DB
error,
if
you're
creating
an
EM,
3
DB
cluster
with
a
replication
factor
of
3,
because
this
is
how
we
operate
them
through
Neve
you,
but
not.
Everyone
is
really
there
yet,
and
you
kind
of
have
to
meet
your
users
halfway.
So.
B
Let's
talk
about
some
of
the
successes,
some
of
the
results
that
we
actually
achieve
from
this
and
will
react
Amish.
Yes,
we
got
some
clusters
there,
they're
in
three
different
zones.
Currently
we
have
an
operator
running
a
subset
of
ubers
metrics
right
now
for
hosting
services
that
are
hosted
inside
our
data
centers
and
one
thing
that
we
realized
right
off
the
bat
we
were
excited
with
this,
but
this
is,
it
presents
a
different
paradigm.
We
were
used
to
how
we
typically
operated
clusters.
B
Now
we
had
to
shift
over
to
how
an
operator
is
being
used
and
how
that
would
interface
with
our
engineers
itself.
So,
let's,
let's
talk
about
what
what
it
did
really
helped
out
with,
though,
so,
let's
go
back
to
those
two
cycles.
We
have
the
reactive
cycle
which
change
dramatically
now
so
node
becomes
problematic.
It
signals
to
our
operator
something's,
going
on
it'll
triage
based
off
of
you
know,
whatever
cases
that
will
catch
and
tries
to
bring
the
cluster
back
into
a
healthy
state.
B
B
So
previously
we
would
have
those
meetings
about
capacity
planning
and
how
we're
going
to
manage
our
clusters
and
it
would
take
a
lot
of
time
off
people's
hands,
especially
when
you
had
to
enact
those
changes
that
we
would
set
forth
with
a
state
full
service.
Like
this,
you
have
to
reach
certain
goal
states
with
your
with
your
placement
in
addition
or
removal
of
nodes,
and
that
was
manually
done
so
an
opera
engineer
would
have
to
go
in.
B
So
it
I
mean
ideally
would
take
it
down
to
20
minutes
a
week,
but
it's
it's
no
longer
this
whole
big
parade
of.
Let's
get
everybody
together,
figure
out
things
it's
it
can
be
done
within
a
you
know,
config
or
I
should
say
a
get
repo
of
some
sort.
Just
the
config
change.
That's
there.
So
let's
go
over
some
advice
for
large
stateful
workloads
as
Alice
mentioning
because
this
is
a
tough
thing
to
do.
Stateless
apps
are
great,
but
stateful
presents
a
lot
more
challenges
to
to
engineering.
B
So,
first
and
foremost,
we
didn't
just
jump
to
kubernetes,
as
batte
mentioned
earlier,
m3
has
been
around
for
quite
some
time
the
whole
ecosystem.
That
is
four
years,
and
so
we
invested
a
lot
of
time
and
effort
to
make
this
platform
reliable
before
it
was
even
inside
an
automated
operator
such
as
inside
like
such
as
kubernetes,
so
first
of
all,
most
make
sure
your
tooling.
Is
there
make
sure
your
platform's
there
and
nice
and
stable
don't
do
a
POC
and
say
you
know
what
we
need
to
be.
Do
we
need
to
put
this
on
kubernetes?
B
Second
thing:
is
we
only
really
approach
this,
as
mentioned
before
with
when
we
hit
kind
of
scaling,
challenges,
40
plus
clusters,
for
an
on-call
engineer,
it's
kind
of
a
lot
to
take
on
it's
burdensome,
and
so
we
only
started
looking
at
these
automated
solutions
when
operational
complexity
was
at
such
a
high
scale
that
we
needed
to
actually
remediate
this
in
a
proper
way.
So
we
only
we
only
started
into
this
when
scaling
became.
An
actual
challenge
for
us
with
that
said
is
keep
in
mind.
This
is
not
a
magic
bullet.
B
It's
not
going
to
solve
all
your
problems.
It's
actually
gonna,
add
more
complexity
to
your
actual,
your
platform,
so
be
very
mindful
of
like
adding
yet
another
layer
of
operation
on
top
of
your
cluster
management,
because
there's
yet
another
thing
that
you're
going
to
have
to
think
about,
while
you're
debugging
or
actually
developing
the
the
actual
automation.
That's
there.
A
Are
there
a
piece
of
advice
if
you're,
considering
creating
an
operator
for
your
own
stateful
workload,
is
to
kind
of
embrace
this
declarative
approach
to
infrastructure
management
that
kubernetes
has
made
so
popular?
We
were
already
doing
this
in
the
sense
that
m3/d
bees,
cluster
topologies
are
actually
modeled,
as
desired
states
in
that
are
stored
in
at
CD
that
the
m3
deep
instances
themselves
work
to
converge
on
and
update
when
that
convergence
has
occurred.
A
So
this
actually
meant
that
pretty
much
all
our
operator
had
to
do
was
exchange
this
series
of
desired
States
between
kubernetes
and
m3,
DB
and
kind
of
take
feedback
from
one
use
it
to
inform
decisions
about
the
other
and
it
kind
of
just
took
part
in
this
holistic
reconciliation
loop.
You
compare
this
to
something
where
I
don't
know.
If
you
have
a
database
where,
like,
if
you're
initiating
a
replace,
you
need
to
know
that
you
initiated,
replace
and
somehow
keep
track
of
that
in
m3
DB
with
M
3
dB.
A
So
this
meant
that
even
if
the
kubernetes
api
is
down
or
malfunctioning
so
long
as
the
pods
themselves
are
still
up
which
crewman
ids
makes
great
effort
to
ensure
it's
the
case
under
failure
scenarios
that
we
could
still
actually
make
emergency
changes
to
our
cluster
topologies
by
just
modifying
the
state
that's
stored
in
that
CD.
So
how
are
people
starting
to
use
this?
A
This
is
probably
the
part-
that's
been
most
exciting
for
us,
so
it's
only
been
a
few
months
that
our
operators,
but
now-
and
we
already
have
people
that
are
starting
to
use
it
and
deploy-
are
working
on
deploying
it
in
production.
So
one
thing
that
you
know
that's
been
really
cool
see.
Is
we
built
em,
3
DB?
As
you
know,
this
long-term
storage
for
Prometheus?
A
And
then,
additionally,
we
have
users
that
have
told
us,
oh
yeah,
I'm,
using
the
sed
operator
to
make
my
etsy
D
clusters
that
M
3
DB
talks
to
and
then
I'm
using
the
M
3
DB
operator
to
make
these
M
3
DB
clusters
and
I'm
using
the
Prometheus
operator
to
make
these
Prometheus
clusters
that
are
reading
and
writing
from
M,
3
dB
and
again
this
theme
that
we've
heard
today.
It's
you
know
it's
operators
everywhere,
and
this
has
been
really
cool
to
see.
A
Looking
at
OpenShift
has
actually
been
really
valuable
for
us
because,
again
talking
about
those
assumptions
that
we
built
into
our
operator,
you
know
we're
running
em,
3
DB
in
these
completely
walled
off
single
tenant
environments,
and
we
hadn't
thought
as
much
about
things
like
locking
down
our
pod
privileges
and
the
the
more
the
stricter
default
security
policies
that
come
with
OpenShift
have
actually
caused
us
to
kind
of
go
back
and
make
sure
that
you
know
our
clusters
support
the
security
configurations
that
more
mature
users
might
need.
So.
B
Just
wrap
up
very
quickly
here
we
have
some
future
work.
That's
there
more
CR,
T's
right
now
we're
pretty
limited.
We
have
it
kind
of
coupled
with
our
API
query
service
that
provides
the
read
and
write
endpoints
as
well
as
the
actual
storage
note
itself.
We
want
to
start
separating
this
out
and
build
a
richer
reliability
story
at
the
scaling
story.
For
folks
beyond
that,
we
want
even
more
CR
DS
M
3
DB
is
only
one
part
of
M
3
M
3
is
an
entire
ecosystem.
We
have
a
collection
tier.
B
We
have
a
very
reliable
agitation
tier
ingestion
care,
so
we
want
to
break
this
out
a
little
bit
more
and
provide
people
the
entire
ecosystem
that
we
use
inside
uber
under
an
automated
operated
fashion.
Lastly,
we
would
like
to
think
for
our
entire
team-
that's
not
on
stage
right
now,
it's
just
the
two
of
us
for
helping
to
contribute
to
this
project
and
make
it
make
it
actual
realization
and
a
possibility
for
more
information.