►
From YouTube: Scaling Monitoring at Databricks from Prometheus to M3
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
A
Yeah,
so
for
those
of
you
who
haven't
heard
of
databricks,
we
were
founded
in
2013
by
the
original
creators
of
apache
spark
we're
a
data
and
ai
unified,
analytics
platform
as
a
service,
and
we
serve
over
5000
customers,
we're
still
a
startup,
but
we've
grown
pretty
big.
We
have
more
than
1
500
employees
with
more
than
400
engineers
and
with
our
an
arr
of
400
million
dollars
or
more.
A
A
So
in
this
talk
we'll
cover
how
we
monitor
our
data
bricks
before
m3
and
then
I'll
talk
about
how
we
deploy
m3
at
databricks,
including
architecture
and
migration,
and
then
nick
will
cover
some
of
the
lessons
we
learned
in
this
process,
including
operational
advice,
important
things
to
monitor
in
an
m3
cluster
and
how
we
do
updates
and
upgrades.
A
First,
I'm
going
to
provide
some
context
about
the
role
that
monitoring
plays
for
databricks
engineers.
We
have
two
main
metric
sources.
The
first
is
our
internal
services
that
run
on
kubernetes
clusters
that
we
manage
in-house,
and
then
we
have
external
services
running
co-located
with
the
vms,
for
customer
spark
workloads,
and
these
run
in
customer
environments.
A
We've
been
running
a
prometheus-based
monitoring
system
to
monitor
these
targets,
since
2016
and
all
service
teams
rely
heavily
on
the
system,
service
owners
write
and
omit
their
own
metrics
from
their
own
services,
and
they
use
these
metrics
for
dashboarding.
We
use
grafana
for
dashboarding
and
they
write
their
own
queries
and
most
engineers
are
prom,
kill,
literate
and
engineers
maintain
their
own
alerting
rules.
A
A
So
for
those
of
you
who
are
not
too
familiar
with
prometheus,
it's
a
single
node
monitoring
solution
that
is
used
for
event,
metrics
and
alerting.
It
uses
like
a
pool
based
model
and
it
will
scrape
metrics
from
other
services
and
it
stores
these
metrics
as
time
series
data,
and
then
it
can
service,
queries
and
alerts
using
this
data.
A
So
in
each
region
we
have
two
different
problems
for
the
two
types
of
monitoring
targets
that
we
have.
The
first
problem
is
prometheus
normal.
This
is
the
instance
that
scrapes
all
the
kubernetes
pods
local
to
the
cluster,
and
then
we
have
a
prometheus
proxied
instance
for
proxy
metrics
from
services
that
live
outside
of
our
kubernetes
cluster.
A
For
this
prometheus
instance,
we
maintain
a
whitelist
to
only
ingest
some
metrics
so
that
we
can
reduce
metrics
volume,
since
the
metrics
from
our
customer
environments
are
the
higher
cardinality
workloads
and
the
reason
we
have
two
separate
prometheus
servers.
Instead
of
just
one
server
per
region
is
because
of
scaling
limitations,
and
we
found
that
metrics
from
both
of
these
sources
would
not
fit
on
a
single
prometheus
server.
A
So
users
interact
with
this
monitoring
system
in
two
main
ways.
The
first
is
alerting,
so
each
of
these
prometheus
servers
will
evaluate
alert
rules
and
issue
alerts
to
alert
manager
and
then
alert
manager
will
forward
the
alerts
to
the
bench
response
channels
like
pagerduty
or
slack,
and
the
second
workflow
is
querying
or
dashboarding.
A
So
here
are
some
numbers
to
give
you
a
picture
of
the
scale
that
we
were
running.
This
prometheus
modeling
system
at
we
ran
in
more
than
50
regions
and
different
cloud
providers,
and
we
monitored
like
100
plus
microservices,
with
an
infrastructure
footprint
of
4
million
plus
vms
of
databrick
services
and
customer
apache
spark
workers,
and
because
of
our
architecture
where
we
had
a
single
prometheus
server
handling
all
the
metrics
from
the
customer
environments.
A
That
server
was
pretty
huge,
like
at
its
peak.
It
was
handling
close
to
a
million
samples
per
second
had
a
pretty
high
metric
turn
rate
with
a
lot
of
metrics
from
short-lived
spark
drops
persisting
for
like
less
than
100
minutes,
and
the
disk
usage
was
pretty
high
at
4
terabytes
for
only
15
days
of
retention,
and
we
also
were
running
on
a
really
big
machine
with
64
cores
of
cpu
and
2
terabytes
of
ram.
A
In
terms
of
the
user
experience.
Users
had
to
deal
with
like
a
shorted
view
of
metrics.
They
had
to
be
like
aware
of
which
metric
stores
they
wanted
to
query
whether
it
was
from
normal
or
from
proxied
users
also
had
to
deal
with
slowness
of
querying
bigqueries
would
take
like
a
long
time.
Sometimes
they
would
never
even
like
complete,
and
sometimes
they
would
even
cause
the
prometheus
server
to.
A
Users
also
had
to
deal
with
shorter
retention
period,
and
they
could
only
see
metrics
in
like
the
last
15
days,
instead
of
the
last
90
days,
which
was
ideal
and
they
couldn't
really
see
metrics
spanning
across
different
release
cycles.
Since
our
release
cycle
is
only
two
weeks
and
also
users
had
to
deal
with
a
metric
white
list,
they
could
only
include
a
small
subset
of
metrics
in
the
whitelist
and
occasionally
for
us
as
the
operators.
A
When
we're
dealing
with
like
capacity
issues,
we
would
even
have
to
actively
remove
metrics
from
the
white
list
and
then
just
to
keep
our
permissive
server
running.
A
So,
with
these
scaling
bottlenecks
and
pain
points,
we
really
needed
to
find
a
more
scalable
monitoring
solution.
Some
of
our
requirements
were.
This
system
needs
to
be
able
to
handle
high
metric
volume,
cardinality
and
return
rate
since
databricks
was
growing
rapidly
and
we
needed
a
system
that
could
keep
up.
A
We
also
wanted
there
to
be
a
90-day
retention
so
that
engineers
can
monitor
their
service
release
over
release.
Also,
importantly,
we
needed
it
to
be
prom
q
compatible,
since
everything
is
already
built
on
top
of
prometheus,
so
we
didn't
want
to
migrate.
These
are
workflows
like
alerting
rules
querying
or
dashboards
to
another
system
and
teach
users
to
use
a
different
language.
A
A
So
like
no
metric
gaps
during
updates
and
less
manual
operations
for
updates,
it
would
also
be
nice
if
the
system
had
been
battle
tested
in
a
large
scale,
production,
environment
and
also
good
if
the
system
was
open
source
so
that
we
have
more
freedom
to
manage
it
on
our
own,
to
make
it
more
suitable
to
our
metric
workload,
and
also,
we
felt
that
there
was
more
transparency
into
the
cost
of
running
an
open
source
monitoring
system,
some
of
the
alternatives
we
considered
in
mid-2019.
A
A
We
considered
some
open
source
solutions
like
cortex
and
thanos
we
prototyped
down
as
in
late
2018
at
that
time.
It
wasn't
that
mature
back
then,
and
we
weren't
really
comfortable
using
it
at
our
scale
in
production,
and
we
also
prototyped
cortex,
and
we
found
that
it
wasn't
really
suited
for
metric
workloads,
which
were
pretty
which
had
a
pretty
high
churn
rate.
A
We
also
considered
some
hosted
solutions
like
data
dog
and
signalfx,
but
they
were
too
expensive.
So,
given
our
requirements
and
the
alternatives
we
considered,
why
did
we
pick
m3
m3
fulfilled
all
our
hard
requirements?
It
was
designed
for
large-scale
workloads
and
it's
horizontally
scalable
and
it
exposes
a
prometheus
api
query
endpoint
as
well.
A
So
now
I'm
going
to
cover
how
we
deploy
m3
at
databricks,
like
specifically
the
different
decisions
we
made
along
the
way
and
how
we
arrived
at
our
final
architecture.
A
A
So
this
setup
would
result
in
some
improvements.
First,
since
all
metrics
would
be
remote
written
into
the
m3db
storage
database
in
the
region,
we
would
get
rid
of
the
started
view
of
metrics
in
one
region,
so
users
could
have
a
consolidated
view
of
both
from
normal
and
prone
proxied
in
one
region,
rather
than
having
the
query
both
separately.
A
So,
though,
this
architecture
was
really
simple
and
it
would
have
been
the
least
amount
of
work
to
incorporate
m3d
into
our
system.
We
did
find
like
some
trouble
with
remote
writing
from
prometheus
servers.
Like
specifically,
we
couldn't
remote
right
at
the
scale
that
we
needed,
especially
for
this
prometheus
proxied
instance,
which
was
handling
all
the
higher
cardinality
metrics
from
our
customer
environments.
A
A
A
So,
unfortunately,
m3
doesn't
have
an
out
of
box
full
evaluation
engine.
It
mainly
just
serves
as
like
the
metric
storage
database
for
writing
to
and
querying
from
so
this
led
us
to
building
our
own
role
engine.
This
was
more
like
in
the
spirit
of
designing
an
architecture
with
more
lightweight
and
scalable
components,
each
serving
a
more
narrow
purpose.
So
for
this
we
ripped
out
the
rule
management
code
from
open
source
prometheus
and
deployed
that
as
our
prometheus
api
friendly
role
engine,
so
we
pass
our
original
prometheus.
A
Modern
systems
alerting
rule
configurations
into
it.
It
issues,
alertable
queries
to
m3db,
m3,
handles
the
role,
evaluation
and
evaluates
the
query,
and
then
it
returns.
The
query
result
back
to
the
rule
engine
the
rule
engine
does
some
extra
like
processing
for
checking,
for
example,
the
four
duration
of
the
alert.
It
adds
like
any
extra
external
labels
to
the
alerts
and
then
it'll
issue
the
alert
back
to
the.
A
User
so
so
far,
we've
covered
the
components
we
set
up
to
interact
with
m3
like
the
scrape
agents,
the
metric
proxy
service
with
the
remote
writing
and
the
rule
engine.
A
The
storage
cluster
consists
of
multiple
replicas.
In
our
case,
we
use
three
and
each
replica
has
multiple
pods
and
each
pod
has
a
disc
attached
to
store
metrics
to
scale
up
the
cluster.
We
just
increase
the
number
of
pods
in
each
replica
and
then
we
have
the
m3
coordinators.
These
allow
us
to
interact
with
the
storage
cluster
to
read
and
write
metrics.
The
coordinators
have
empty
query
embedded
in
them
so
for
write
requests.
It
receives
the
requests,
unpacks
them
and
then
writes
them
into
all
replicas
of
the
storage
cluster
and
then
for
read
requests.
A
A
We
also
run
the
m3db
operator.
This
is
like
optional
to
have
an
m3
system,
but
it
was
like
we
use
it,
because
it's
really
useful
for
a
kubernetes
setup,
since
it
automates
scaling
up
and
down
the
cluster
and
also
automates
like
deleting
and
creating
the
storage
cluster.
So
we
don't
have
to
manually
manage
the
three
different
replicas.
A
One
issue
we
did
have
initially
with
this
basic
setup,
was
that
the
m3
coordinators
were
having
a
lot
of
like
noisy
neighbor
issues.
For
example,
if
a
user
submits
a
heavy
query,
it
might
like
take
up
all
the
cpu
and
memory
on
the
coordinator
and
which
might
impact
the
right
path
and
cause
data
loss
or
impact
the
rule
evaluation
workload
from
the
real
engine
and
affect
the
availability
of
alerts.
A
A
So
we
created
four
different
groups
of
coordinators.
We
have
a
group
that
handles
writing.
These
consists
of
many
small
replicas
of
right
coordinators
to
handle
any
incoming
requests
from
our
scrape
agents
and
the
metrics
proxy
service,
and
then
we
have
a
group
designated
to
the
rule
engine.
This
just
handles
like
a
regular
and
more
predictable
workload
of
querying
for
rule
evaluation.
A
Since
the
rule
engine
will
just
submit
the
same
set
of
alert
rules
at
regular
intervals
to
m3,
and
then
we
have
two
groups
here
to
handle
ad-hoc
querying.
The
first
is
the
regional
group.
This
just
returns
query
results
based
on
the
metrics
in
the
regional
cluster,
and
then
we
have
the
global
querying
group
which
reads
from
m3db
clusters
across
regions
and
provides
a
global
view
to
our
users.
A
We
wanted
to
separate
the
regional
and
the
global
coordinator
groups
since
the
global
coordinator
group.
The
configuration
is
really
different.
It
requires
setting
up
listening
across
clusters
and
has
different
security
configurations,
but,
more
importantly,
our
users
mostly
only
use
the
regional
view
of
metrics
in
most
cases,
and
so
since
we
knew
if
we
just
exposed
the
global
view,
it
was
unlikely
that
users
would
be
like
making
the
extra
effort,
to
always
add
a
region
label
filter
to
each
query.
A
So
now
that
we
separated
the
coordinator
groups,
our
two
most
important,
stable
workloads,
the
right
path,
which,
where
stability
is
important
to
prevent
data
loss
and
the
rule
evaluation
path,
which
is
highly
critical
for
us
to
always
have
alerts.
A
So
for
this
we
decided
to
set
up
a
vanilla,
lightweight,
prometheus
server.
This
prometheus
server,
only
scrapes
m3
related
components.
It
has
no
disk
attached
and
its
retention
is
only
a
couple
of
hours,
so
it's
very
easy
to
maintain,
since
we
consider
it
to
be
stateless
and
restarts
happen
really
quickly.
A
The
metric
retention
period
is
short,
but
it's
sufficient
for
us,
since
we
mainly
use
this
prometheus
to
alert
us
if
any
entry
components
are
down,
and
it
doesn't
really
require
us
to
look
back
on
metrics
over
the
past
couple
days.
For
this
permission,
server
to
issue
alerts
and
it
issue
alerts
straight
to
alert
manager
and
doesn't
have
like
it's
completely
independent
from
the
m3
system.
A
We
also
have
a
global
m3
monitoring
prompt
for
a
longer
term
view
of
our
m3
related
metrics,
for
example,
to
track
this
usage
or
memory
usage
or,
like
the
number
of
reads
per
second,
and
we
use
the
prom
federation
feature
here
to
federate
all
metrics
from
the
regional
m3
monitoring
prompts
to
be
persistently
stored
in
this
global
prom.
A
A
We
were
dual
writing
metrics
to
both
prom
and
m3
storage,
and
we
were
evaluating
alerts
in
both
the
old
prometheus
system
and
the
new
m3
system.
So
we
just
sent
all
our
alerts
from
the
new
m3
system
to
a
black
hole,
receiver
that
didn't
fire
alerts
to
any
real
receivers,
and
we
also
opened
up
a
querying
endpoint,
but
only
to
internally
to
the
durability
team
and
not
really
to
the
rest
of
engineering
or
the
end
org,
so
that
we
could
do
some
behavior
validation,
yeah.
A
The
third
step
was
an
incremental
rollout
of
querying
traffic
and
alerts
for
ad-hoc
querying
traffic.
We
staged
it
across
environments
and
we
did
a
percentage
based
rollout
of
traffic
from
prometheus
over
to
m3
and
then
for
alerts.
We
did
a
per
service
migration.
We
replaced
alerts
emitted
by
prom
with
alerts
emitted
by
m3
for
less
critical
services.
A
Now,
here's
a
diagram
to
illustrate
like
how
we
did
the
rollout
of
the
ad-hoc
querying
traffic.
It's
a
pretty
simple
setup.
We
just
put
an
m3
an
engine
x
in
front
of
the
coring
endpoints
of
permeates
and
m3,
and
we
split
the
current
traffic
across
both
and
over
time.
We
just
slowly
increase
the
percentage
of
traffic
directed
to
m3.
A
A
In
addition
to
this
label,
we
added
a
source
label
to
indicate
whether
the
alert
was
submitted
by
the
old
prometa
system
or
the
new
m3
system,
and
then
we
made
some
routing
configuration
changes
in
alert
manager
so
that
if
we
want
to
roll
out
empty
alerts
for
less
critical
service,
first,
the
m3
alert
from
that
service
would
go
to
the
receiver,
but
the
equivalent
prometheus
alert
for
that
service
would
go
to
the
black
hole
receiver
that
doesn't
alert
anyone.
A
We
found
this
rolex
strategy
to
be
like
really
nifty,
because
it
was
all
controlled
in
alert
manager
at
the
routing
level,
so
an
alert
manager
can
hot
reload
any
new
config
quickly
without
any
restart.
So
it
was
very
easy
to
advance
and
roll
out
board.
But,
more
importantly,
it
was
like
easy
to
roll
back
if
we
found
any
issues-
and
this
was
good
to
do
since
alerting-
is
a
highly
critical
service.
So
rollback
shouldn't
be
able
to
happen
quickly.
A
Yeah
the
outcome
of
this
one-year
migration
is
that
m3
now
runs
as
a
sole
metrics
provider
in
all
environments
across
clouds,
the
global
querying
endpoint
is
available
via
m3
for
all
metrics.
This
is
a
slim
beta,
so
we're
still
like
testing
this
and
rolling
it
out,
and
the
user
experience
is
largely
unchanged.
Our
users
still
use
promql
for
learning,
rules
and
dashboards,
and
we
still
use
alert
manager
for
all
our
alerts.
A
A
So
now
we
have
much
higher
confidence
that
we
can
continue
to
scale
this
system
into
the
upcoming
years,
since
databricks
is
like
continue
to
grow,
continuing
to
grow
rapidly
as
a
company
and
it'll
keep
processing,
larger
and
larger
workloads
and,
most
importantly,
the
observability
team
doesn't
have
to
deal
with
like
a
giant
prometheus
driver
anymore.
That
runs
on
two
terabytes
of
ram
and
takes
like
multiple
hours
to
restart
okay
on
to
nick.
For
lessons
learned.
B
B
Thanks
everybody
cool,
so
I'm
going
to
go
cover
some
of
our
lessons
learned
in
operating
m3
over
the
past
year
or
so
so.
I'm
going
to
some
of
the
things
I
want
to
talk
about
are
some
of
the
system
metrics
that
you
should
be
looking
at
if
you're
monitoring
m3,
give
some
general
operational
advice,
talk
about
some
things
that
we
found.
It
really
helpful
to
alert
on
and
also
talk
a
little
bit
about
how
we
do
upgrades
and
updates.
B
I
just
want
to
give
a
brief
overview,
because,
when
you're
talking
about
sort
of
like
from
the
trenches,
I
think
the
the
perspective
can
sometimes
sound
negative,
because
I
will
be
talking
about
issues
that
we've
we've
run
into,
but
I
want
to
like
emphasize
at
the
start
that
overall
m3
has
been
amazingly
stable
for
us.
Why?
I
sort
of
talked
about
how
much
trouble
we
had
operating
prometheus?
It
was
a
constant
source
of
alerts
and
trouble
for
us
and
we
operate.
B
You
know
more
than
50
different
deployments
of
m3
and
it's
it's
just
really
stable.
We
have
a
few
places
where
we've
run
into
issues
and
I'll
be
talking
about
those
and
those
are
obviously
the
ones
that
have
sort
of
the
highest
scale,
where
you're
really
pushing
against
the
limits
of
what
we
can
do.
B
But
overall
it's
been
an
extremely
stable
thing
for
us
and
honestly,
the
biggest
problem
we
have
is
just
that
we
keep
running
out
of
disk
space
in
places
because
our
metric
load
keeps
going
up
and
that's
not
obviously
m3
fault.
That's
our
fault
for
needing
to
be
better
about
how
we
deal
with
incoming
metrics.
So
even
though
that's
so
that's
the
positive
side,
we
have
had
some
problematic
things,
so
I'm
going
to
dive
into
that
because
you
know
dealing
with
problems
is
obviously
an
interesting
thing
to
hear
about
before.
B
I
do
that,
just
a
little
bit
more
about
how
how
it
runs.
You
know
we
like
I
said
we
have
a
large
number
of
clusters,
so
we
have
to
automate
things.
We
use
a
combination
of
spinnaker
and
jenkins
to
to
do.
Temp
like
template
applies
to
update
things.
So
that's
where
having
the
operator
is
really
nice,
because
it
makes
it
pretty
easy
to
kind
of
do
those
updates
in
our
bigger
clusters.
B
We
process,
you
know
close
to
a
million
samples
per
second
and
about
200
000
reads,
so
we
are
more
right
heavy.
You
know
I
that
that's
definitely
the
workload
that
we
we
have
at
databricks
cool,
so
I
wanted
to
jump
into
sort
of
at
the
top
level,
the
things
that
we
found
most
important
to
keep
an
eye
on
while
you're
operating.
So
we
look
at
how
much
memory
is
being
used.
These
are
in
the
m3
db
pods.
B
We
have
seen
that
if
you
are
steadily
over
60
of
memory
usage
that
can
be
bad
mostly
because
there
are
certain
things
that
happen,
that
can
cause
memory,
spikes
and
then,
if
you're,
consistently
over
60
percent,
those
can
can
get
you
all
the
way
up
to
over
100
and
then
you
boom.
You
know
it's
nice
that,
because
it's
distributed
and
highly
available,
if
only
one
pod
ooms,
it's
not
a
big
deal,
it
recovers.
Nobody
even
notices.
We
don't
even
get
alerted
when
that
happens,
for
a
single
one.
B
But
if
you're,
all
of
your
pods
are
consistently
over
60,
you
have
a
good
chance
of
multiple
ones,
zooming
and
then
things
can
be
bad.
So
how
you
can
you
resolve
that
if
you're
steadily
over
60,
you
can
scale
up
your
cluster
or
you
can
reduce
incoming
metric
load?
Those
are
the
two
primary
ones
we've
found.
Obviously,
if
you're
in
a
more
read
heavy
workload,
you
may
need
to
do
something
like
reduce
the
amount
of
reads
that
are
happening.
B
One
thing
I'll
mention
here
with
the
new
version
of
m3
having
all
these
nice
limits
for
reading
and
writing.
It's
really
a
great
other
way
to
sort
of
put
limits
on
the
memory
used.
So
you
know
we
don't
we've
already
done
that.
So
that's
not
a
way
we
resolve
these,
but
if
you
haven't
set
the
limits,
that
would
be
another
way
to
try
to
reduce
the
amount
of
memory
and
I've
included
here.
B
I
you
know
you
obviously
don't
need
to
like
memorize
the
the
queries
that
I've
put
in
here,
but
I've
just
kind
of
put
them
in
here
as
a
reference
for
like
the
way
that
we
look
at
these.
So
we
look
at
this
particular
metric
filtered
for
our
pods.
We
also
need
to
alert
on
disk
space.
Like
I
mentioned,
this
is
a
problem
for
us
just
because,
as
things
grow,
the
cluster
can
get
bigger
and
bigger,
and
your
disks
can
fill
up.
We
use
predict
linear
to
look
at
how
big
the
disks
are.
B
The
disk
space
usage
is
actually
very
easy
to
predict
over,
so
it's
pretty
accurate,
and
so
we
do
it
very
early,
mostly
just
to
give
ourselves
lots
of
time
to
react
this.
You
know
there's
other
things
going
on.
B
Sometimes
it
takes
a
little
bit
of
time
to
get
to
it
and
also
you
know,
we've
found
that
scaling
up
can
take
a
significant
amount
of
time
in
in
the
really
big
regions,
and
so
it's
nice
to
give
ourselves
enough
time
to
to
deal
with
this
and
again
ways
you
can
fix
this.
If
you
are
running
out
of
disk
space
are
scaling
up.
The
cluster,
like
I
said,
reducing
retention,
obviously
will
free
up
some
disk
space
or
reducing
the
incoming
metric
load
and
again
here's
the
query.
B
Like
I
mentioned
cluster
scale
up
can
be
slow.
There
has
been
a
good
amount
of
work
in
improving
this
bootstrapping
time
in
the
newer
versions
of
m3.
B
We
are
a
little
bit
behind
on
the
update
schedule,
so
we're
hoping
to
see
some
improvements
in
our
cluster
scale
up
time,
as
we
upgrade
to
the
newer
versions
of
m3,
but
I
would
encourage
you
if
you're
operating
m3,
to
try
to
do
some
testing
around
how
long
how
long
it
takes
to
do
this
cluster
scale
up
so
that
you
can
set
the
sort
of
limits
and
like
how
long
in
the
future,
you
need
to
alert
for
these
kinds
of
things
so
that
you
know
how
to
react
to
them
cool.
B
So
that
was
some
of
the
like.
Those
are
probably
the
most
important
things
that
we
need
to
alert
on
I'll,
get
on
to
some
of
the
other,
like
smaller
things
in
a
little
bit,
but
I
also
wanted
to
give
a
little
bit
of
general
advice
that
we've
kind
of
accrued
over
the
time
over
operating
m3.
One
thing
is,
you
know,
try
to
not
add
a
lot
of
custom
annotations
labels
configs
on
top
of
the
deployments.
B
B
Let
the
operator
just
do
its
thing,
like
I
mentioned,
do,
observe
your
query
rates
and
set
limits,
so
look
at
how
your
cluster's
being
used
in
a
good
state
and
try
to
set
limits
so
that
you
can
prevent
a
bad
state
from
occurring.
B
If
a
giant
queries
come
in
or
you
know
things
like
that,
one
thing
that
I
think
we
waited
much
too
long
to
do
at
databricks
was
to
have
a
really
good
testing
environment,
meaning
we
we
rolled
this
out
in
all
of
our
clusters,
and
it
was
working
pretty
well.
But
you
know,
a
monitoring
system
is
something
that
everybody
relies
on
all
the
time,
which
means
that
your
dev
clusters
are
development
for
other
engineers.
B
But
for
us,
our
dev
clusters
are
actually
quite
important
to
have
good
monitoring
because
people
care
about
observing
how
their
you
know,
test
clusters
are
running,
and
so
we
needed
sort
of
a
dev
dev.
I
guess-
or
you
know,
m3
dev
cluster,
which
we
now
have,
but
it
took
us
too
long,
but
it
was
really
important.
B
I
think
it's
important
to
have
a
place
where
you
can
quickly
iterate
on
rolling
out
new
versions,
testing
load
important
to
be
able
to
just
throw
away
the
data
there
so
that
you
know
if
you're
testing
some
stuff
out
and
it
doesn't
work.
You
don't
you're,
not
scared
of
ruining
your
data,
and
I
think
it's
also
important
to
try
to
have
that
at
scale.
And-
and
this
is
non-trivial-
you
know
having
load
generation
and
stuff,
because
if
it's
truly
dev
dev
you're
not
going
to
have
a
lot
of
stuff
running
there.
B
Naturally.
So
there
is
some
work
to
doing
that.
But
I
think
it's
very
valuable
to
have
this
so
that
you
can
kind
of
be
aware
of
how
your
production
clusters
are
going
to
behave,
but
not
be
testing
it
in
production,
because,
if
you're
only
testing
high
load
in
production,
it's
not
it's
not
a
recipe
for
great
success.
I
also
would
encourage
you
to
have
a
look
at
some
of
the
m3
dashboards
that
are
out
there
and
kind
of
learn
what
these
metrics
mean.
It
can
be
really
helpful.
B
You
know,
as
rob
mentioned,
m3
has
a
ton
of
features,
and
it
also,
as
a
metric
system,
has
a
lot
of
its
own
internal
metrics.
The
dashboard
I've
linked
here
is
kind
of
the
the
one
that
that
the
m3
web
page
mentions.
I
would
say
that
the
this
one
that's
linked
here,
the
grafana.com
is,
is
somewhat
developer
focused.
I
think
it's
built
to
help.
B
People
who
are
working
on
m3
understand
what's
happening
and
debug
issues
and
that's
useful
for
that,
but
I
would
maybe
suggest
kind
of
looking
through
that
and
understanding
what
some
of
the
more
useful
metrics
are
from
an
operator
operational
perspective
and
and
making
a
dashboard
with
your
own
key
metrics,
I'm
not
going
to
cover
that
this
is
like
you
know,
for
future
reference.
This
is
what
our
one
of
our
internal
dashboards
looks
like.
So
we
kind
of
look
at
some
a
lot
of
the
more
high-level
things
that
kind
of
show.
B
You
know
how
cpu
doing
how's
memory
doing,
and
you
can
kind
of
see
in
that
memory,
one
how
it's
a
little
bit.
It
goes
up
and
down,
but
we
try
to
keep
it.
B
You
know
at
about
the
50
to
60
level
of
what's
available,
you
know
and
I'm
not
going
to
cover
all
the
other
stuff
that's
on
here,
but
basically
these
are
some
of
the
things
that
we
have
found
really
useful
to
monitor
in
terms
of
understanding,
what's
happening
from
a
more
sort
of
operational
perspective
than
than
this
would
like
really
internal
perspective
of
a
developer,
working
on
m3
cool,
so
a
few
other
things
that
we
alert
on
just
the
things
that
are
worth
worth
looking
at.
B
B
We
do
try
to
filter
on
the
incoming
metrics
and
and
prevent
them
from
coming
in
late,
but
I
will
say
that
although
the
grafana
agent
has
been
pretty
good
for
us,
one
problem
it
does
have
is
a
tendency
to
sometimes
try
to
write
old
data
and
it's
hard
to
get
it
to
not
do
that.
So
you
can
sometimes
see
spikes
in
this
from
that,
and
then
you
need
to
go
kind
of
kick
the
agent
to
have
it
not
do
that.
We
look
at
the
rates
of
both
right
errors
and
fetch
errors.
B
These
are
good
ones
to
be
watching
mostly
because
they
kind
of
represent
the
user
perspective.
They
say
like
how
is
it
as
a
user
of
m3
trying
to
do
something
like
write
a
metric
in
or
trying
to
do
something
like
issue
a
query?
Am
I
seeing
errors
right?
There
can
be
all
kinds
of
other
errors
happening
under
the
covers,
but
if
these
two
metrics
are
steady
from
a
user
perspective,
you're
you're
kind
of
okay,
you're
meeting
your
slo,
so
we
we
monitor
those
and
those
ones
kind
of
issue.
B
Some
high
priority
alerts
if
you're
getting
right
errors,
something's
bad
you're,
not
able
to
write
error
right
metrics
into
the
cluster,
if
you're,
getting
fetch
errors,
something's
bad
you're,
not
able
to
query
the
cluster
and
your
users
will
be
seeing
issues
another
one
that
we
monitor,
and
I
I
wanted
to
mention
this
one
because
it
can
cause
for
us.
B
It
can
be
a
really
big
issue,
but
it
may
not
be
depending
on
the
the
your
deployment,
but
we
do
look
at
how
many
out
of
order
samples
are
being
are
being
written
and
the
the
reason
that
we
do.
This
is
because,
if
you
are
double
scraping,
pods
or
services
that
can
cause
all
kinds
of
craziness
in
your
metrics
because
they
can
kind
of
bounce
around
and
the
counter
semantics
of
prometheus
make.
It
think
that
crazy
stuff
is
happening
and
and
because
we
operate
in
a
largely
pool-based
architecture.
B
This
can
cause
a
lot
of
false
alerts
for
our
customers.
It
is
a
little
bit
of
a
tricky
want
to
monitor,
because
some
amount
of
out
of
order
emerging
is
is
expected
and
doesn't
mean
that
there's
a
problem.
So
I
put
an
x
here
because
you'll
probably
need
to
look
at
how
it
looks
in
your
cluster.
B
So
you'll
want
to
inhibit
this
during
node
startup,
but
if
in
sort
of
a
normal
operation
you
can
you
should
be
able
to
get
a
good
sort
of
baseline
for
what
your
out
of
order
write
rate
is
and
then,
if
that
spikes,
it
can
be
indicative
that
you've
somehow
messed
up
the
configuration
of
your
scraping
yeah.
So
that's
another
one
that
has
bitten
this
great.
B
So
talking
a
little
bit
about
upgrades
and
updates,
they
have
been,
I
think
rob
mentioned,
that
they
do
really
focus
on
or
m3
really
focuses
on,
having
good
forward
and
backward
capacity
compatibility,
and
we
have
seen
that
we
have
experienced
that
we
are
not
scared
of
doing
updates
to
new
versions.
We
have
not
seen
really
any
issues.
B
I
only
mentioned
this
because
it's
literally
the
only
issue
we've
run
into
is
just
there
was
a
tiny
query:
evaluation
regression
where
I
think
they
changed
something
in
the
query
engine
and
it
was
also
relatively
it-
was
not
a
big
deal
and
and
was
fixed
relatively
quickly,
but
we've
done
you
know.
We've
been
running
m3
for
a
while
through
a
lot
of
the
early
pre
1.0
things,
and
so
there
were
relatively
a
relatively
large
number
of
updates
there
that
we've
gone
through
and
it's
been,
it's
been
very
smooth
for
us.
B
So
that's
been
great.
We
are
now
just
moving
into
1.0
slowly
throughout
our
clusters.
That
has
also
been
very
smooth.
You
know
rob
rob
kind
of
mentioned
that
there
were
some
config
changes
and
made
it
sound
like
it's
a
big
deal,
but
honestly
there
weren't
that
many
config
changes.
We
have
a
pretty
complex
kind
of
programming
system
for
generating
our
configs,
and
so
it
was
pretty
easy
for
us.
I
suppose,
if
you
have
manual
configs,
it
might
be
more
work,
but
for
us
it
really
wasn't
a
big
deal
to
update
to
1.0.
B
One
thing
to
be
aware
of
there
were
some
sort
of
api
changes,
so
this
is
probably
only
relevant
if
you
have
a
built
up
sort
of
institutional
knowledge
around
m3
already,
but
a
lot
of
our
run
books
have
to
go.
We
have
to
go
through
and
change
some
of
the
api
paths
that
we
say
to
hit
for
things
like
changing
retention
or
updating
placements,
that's
sort
of
like
advanced
usage,
so
as
a
normal
user
of
m3,
you
probably
won't
run
into
any
of
that.
B
I
mentioned
that
we
manage
all
of
our
upgrades
and
updates
via
spinnaker
and
jenkins
the
one
sort
of
minor
issue
here
that
we
have
is
up.
Until
now,
there
has
been
a
lack
of
fully
self-driving
updates,
we
kind
of
count.
You
know
we're
very
bought
into
kubernetes
shop
and
we
kind
of
count
on
our
pipelines
being
able
to
do
updates
by
just
calling
coop
control
apply
on
a
new
template.
B
This
was
not
fully
working
until
recently,
but
as
rob
mentioned
with
the
o13
version
of
the
operator,
this
should
now
be
available.
That's
a
relatively
recent
release.
We
have
not
had
time
to
go
fully
test
this,
but
we
do
believe
we'll
be
able
to
to
move
to
this
in
the
near
future.
B
One
thing
that
I
will
want
to
would
you
want
to
mention
if
you
are
doing
these
kinds
of
fully
self-driving
updates
or
sorry,
if
you're
not
doing
full
self-driving
updates,
where
you
are
just
applying
and
then
having
to
do
some
kind
of
manual
intervention
to
get
the
operator
to
do
the
update.
We
have
to
be
vigilant
that
the
configs,
which
can
be
updated
by
just
calling
control,
apply
stay
in
sync,
with
the
m3db
version.
B
We've
had
some
issues
where
you
know
our
pipeline
goes
and
deploys
new
version
specification
and
new
config,
but
the
old
version
is
left
running
and
then
we
restart
it.
And
it's
like.
I
don't
understand
this
new
config.
So
that's
one
thing
to
be
careful
of,
and
then
one
suggestion
I
will
have.
This
is
probably
generally
good,
but
it's
one
that
we
found
really
important
during
the
upgrade
process
is
to
have
a
readiness
check
for
your
coordinators.
B
So
try
to
make
sure
that
your
coordinators
as
they
come
up,
are
able
to
talk
to
m3db,
and
then
you
know
I'm
not
going
to
cover
sort
of
update
strategies
for
for
kubernetes,
but
if
you're
familiar
with
kubernetes,
you
know
have
like
a
rolling
update
strategy
that
doesn't
restart
too
many
coordinators
at
once.
What
we
found
is
that,
if
you
know
you
run
a
lot
of
coordinators,
these
coordinators
are
super
lightweight,
which
is
great
and
we
run
a
whole
bunch
of
them.
B
But
that
means,
if
they
all
restart
quickly,
which
is
what
happens
if
you
don't
have
a
readiness
check.
It's
so
many
services.
Restarting
that
kubernetes
can
can
deal,
has
a
little
bit
of
trouble
dealing
with
the
churn
and
can
leave
a
number
of
the
coordinators
unable
to
connect
to
m3db,
just
because
it
hasn't
had
time
to
sort
of
update
all
the
endpoints
and
the
various
service
updates
that
it
needs
to
make
to
do
that.
B
And
then
you
have
a
little
bit
of
downtime
and
because
the
kubernetes
control
loop
is
not
super
fast,
it
can
actually
just
take
a
while
before
it
sort
of
churns
through
and
updates
everything.
So
having
this
readiness
check,
if
you
have
a
connect
consistency
on
the
coordinator,
then
that
will
sort
of
enforce
that
the
coordinators
restart
slowly
and
re-establish
their
connection
before
the
next
one
goes
down
and
you
can
have
sort
of
zero
downtime
downtime
updates
cool.
B
So
that's
what
I
wanted
to
mention
about
upgrades
and
updates
and
then
just
a
little
color
about
you
know.
Metric
spikes
in
any
high
volume
system
that
you
operate,
you're
gonna
have
to
have
a
way
to
deal
with
spikes.
You
know
an
example
of
something
we
sometimes
have.
Is
you
know
some
somebody
goes
and
adds
a
new
label
to
a
metric,
and
it
has
an
absurdly
hide
cardinality.
So
suddenly
your
your
number
of
time
series
is
just
going
up
by
a
huge
amount.
This
is
this.
This
happens.
B
You
don't
control
all
the
services
that
get
deployed.
They
can
do
crazy
things
and
so
a
great
way
to
deal
with
this
is
you
know
you
need
to
be
able
to
identify
where
it's
coming
from
so
you
know
have
some
metrics
that
you
can
look
at
that
that
cover
you
know
the
graphon
agents
can
can
expose
this.
We
have
our
own
metrics
inside
our
other
systems
that
sort
of
push
directly
into
m3
so
always
have
some
metric
that
you
can
use
to
see.
B
You
know
who
who
is
producing
all
of
these
metrics
and
then
also
be
able
to
cut
that
source
off
easily.
So
we
have
good
ways
of
sort
of
quickly
blacklisting
things,
because
this
is
extremely
preferable
to
ooming
your
cluster.
I'm
much
happier
to
be
able
to
go
to
a
customer,
and
you
know
internal
customer
and
say
hey.
Your
service
is
not
currently
not
having
any
metrics,
because
you
did
something
silly
rather
than
having
to
send
an
email
out
to
the
entire
company
and
say:
hey
our
our
metricking
system
is
down
in
production
right
now.
B
It's
never
a
nice
message
to
send
so
yeah,
that's
something
I
would
recommend
so
just
a
little
bit
about
how
we
do
capacity.
Obviously
I
think
you
know
it's
gonna
depend
a
little
bit
on
your
workload,
so
you
know
we
found
that
about
one
m3db
replica
for
about
50
000
incoming
time.
Series
works
pretty
well
for
us,
but
we
are
very
right
heavy.
So
that
may
be
different
depending
on
how
how
your
cluster
looks.
B
I
saw
that
there
was
a
question
about
how
many
nodes
we
use
on
that
big
cluster.
We're
currently
running
about
18
nodes
per
replica,
so.
C
B
You
can
do
the
math
for
how
many
we
have
all
together
and
we
run
about
50
right
coordinators
and
two
different
deployments
about
100
and,
like
I
said,
we
just
are
happy
to
run
lots
and
lots
of
these
for
stability.
We
probably
could
get
away
with
fewer,
but
this,
just
you
know,
gives
us
sort
of
the
buffer
that
we
need,
and
it's
it's
okay,
and
these
are
just
sort
of
numbers
that
we've
come
to
sort
of
organically.
B
So
I
don't
have
a
it
would
be
nice
if
I
had
sort
of
a
formula
like
this.
Many
in
do
this
many
things,
but
I
think
partially
also
just
testing
and
because
it
will
depend
on
your
workload
cool.
I
did
want
to
talk
a
little
bit
about
some
of
the
things
we
want
to
do
in
the
future.
So
I
would
say
at
this
point:
we
have
completed
the
fully
completed
our
migration
away
from
prometheus.
We
are
removing
prometheus
everywhere,
so
we
don't
need
it
anymore.
B
Aside
from
these
sort
of
like
lightweight
monitoring
ones,
that
that
why
we
mentioned
and
we're
now
getting
to
the
point
where
we
can
sort
of
look
towards
the
future
and
what
nice
new
things
that
we
can
do
so
we
were
sort
of
previously.
We
were
in
this
bad
state
that
prometheus
kept
crashing
and
everything
was
unhappy
and-
and
it
wasn't
good
and
we've-
you
know-
had
a
very
successful
but
somewhat
long.
B
Migration
also
to
now
have
a
sort
of
stable
set
of
monitoring
clusters
running,
and
so
now
we
can
start
to
look
forward
and
say:
hey
now
that
we
have
this
nifty
new
m3
thing.
What
can
we
do,
and
so
some
of
the
things
we're
looking
to
do
in
the
future
are
to
start
getting
down
sampling
for
our
older
metrics.
So
this
is
something
that
we
expect
to
see.
Significant
savings
in
disk
space,
probably
also
in
you,
know,
query
performance
over
older
data.
B
You
know
we
we
run
with
a
30
second
scrape
interval
and
it's
a
little
bit
unreasonable
to
expect
that
when
you
query
data,
that's
60
days
old,
that
you'll
get
it
at
30
second
resolution
we
will
be
looking
at
using
different
name
spaces
for
metrics
with
that
have
different
sort
of
retention
requirements.
So,
like
I
mentioned,
we
run
a
lot
of
test
things
in
our
dev
clusters.
It's
a
bit
unreasonable,
also
to
expect
that
these
test
clusters
will
have
really
long
retention
on
their
metrics.
B
So
a
really
nice
feature
that
m3
has
is
that
you
could
kind
of
put
those
metrics
in
a
different
name
space
and
then
have
a
shorter
retention
on
that
namespace.
So
your
test
clusters
get
you
know
five
days
or
whatever
of
retention
and
that's
great
and
then
for
things
that
matter
more.
You
can
have
longer
retention
again.
This
is
to
deal
with
less
needing
less
disk
space
and
lower
lower
load
on
the
system
and
then
another
one
that
I
think
that
we
are
excited
about.
B
That
m3
has
enabled,
because
m3
supports
sort
of
pushing
metrics
in
it.
It
allows
us
to
do
to
enable
work.
Use
cases
where
you're
in
a
in
a
mode
that's
actually
very
hard
to
be
scraped.
So
there's
things
like
databricks
jobs
that
we,
that
a
lot
of
that
we
leverage
where
you
know
spark
clusters
are
running
and
processing
lots
of
data.
B
Our
data
science
teams
really
want
to
track
those
jobs
and
historically
it's
been
very
difficult
because
those
jobs
are
running
sort
of
somewhere
else
in
an
isolated
cluster
and
how
would
prometheus
reach
over
there
and
scrape
the
metrics
out
of
it
and
so
being
able
to
sort
of
build
a
little
proxy
so
that
they
can
push
metrics
directly
through
into
m3.
Is
is
really
great
another
one
that
our
dev
tools
team
is
looking
at?
Is
they
want
to
sort
of
monitor
things
from
developer
laptops
to
know
hey?
B
How
long
is
it
taking
to
do
certain
developer
operations,
so
they
can
improve
the
developer
experience
at
databricks
and
they
want
to
also
be
able
to
put
metrics
in,
because,
obviously,
your
scraping
system
is
not
going
to
be
able
to
to
reach
out
and
scrape
metrics
from
your
lap
from
your
developers
laptops.
So
that's
another.
These
are
another
feature
that
m3
has
that's
kind
of
like
a
new
thing
that
we
can
do.
There
obviously
are
ways
to
push
metrics
into
a
proxy
for
for
prometheus,
but
we
found
them.
We
we
do
operate.
B
We
did
operate
some
of
those
and
we
found
them
to
be
less
than
reliable
just
because
of
the
nature
of
kind
of
needing
to
cache
the
data
and
then
re-scrape
it
it's
much
less
reliable
to
do
that
than
to
be
able
to
just
directly
push
the
metric
in
because
it
lets
you.
Caching
is
just
a
difficult
problem
with
metrics,
so
I'll
leave
it
at
that.
It's
a
lot
of
time
trying
to
fix
various
caching
metric
caching
issues
anyway.
B
So
I
want
to
leave
a
little
bit
of
time
for
questions
so
I'll
I'll,
just
conclude,
and
then
we'll
leave
about
10
10
minutes.
So
I
would
say
in
conclusion,
this
has
been
a
very
successful
migration
for
us.
It
was,
you
know
we
were
very
happy
with
where
we've
landed.
Overall,
the
community
has
been
extremely
helpful.
You
know,
we've
worked
a
lot
with
the
people
at
chronosphere,
who
have
been
extremely
helpful.
B
We've
worked
sort
of
just
with
the
open
community
as
well
and
and
people
it's
a
it's,
been
a
great
experience
there
and
there's
a
lot
of
great
new
things,
sort
of
on
the
horizon
for
us
and
we're
really
excited
to
be
sure,
shifting
gears
into
making
the
overall
metrics
experience
at
databricks
better
from
a
sort
of
feature
perspective.
B
Not
just
from
you
know,
stability,
perspective,
so
cool.
I
thought
that's
what
I
wanted
to
say
and
I
think
we
can
maybe
shift
to
questions
so
I
saw
I
see
a
lot
of
them
have
been
marked
as
answered,
but
okay.
So,
let's
see
I
see
why
cortex
didn't
suit
the
high
churn
rate.
We
know
we
didn't.
We
didn't
dive
very
deeply
into
this
other
than
that
we
did
actually
talk
to.
B
The
name
is
escaping
me
right
now,
but
the
main
guy
who
started
cortex
at
grafana
and-
and
he
did
kind
of
identify
this-
this
problem
that
cortex
has
with
ingesting
a
lot
of
sort
of
short-lived
metrics,
and
so-
and
I
see
that
why
I
would
like
to
answer
this
question
live.
But
oh.
B
Oh
okay,
thanks
so
that
that
is
something
that
that
we
kind
of
ran
into
and
we
we
we
actually
did
try
quite
hard
to
to
deal
with
that
and
then
how
many
production
issues
are
we
facing
daily
on
over
50
clusters?
If
I
exclude
disk
space
issues,
I
would
say
we
average,
maybe
one
or
two
a
week
of
actual
production
issues
on
over
50
clusters.
So,
like
I
said,
it's
really
very
stable
at
this
point.
B
Something
we
are,
I
would
say,
probably
the
main
thing
we
are
focusing
on
right
now
is
kind
of
getting
a
handle
on
the
the
increasing
load.
You
know,
databricks
is
scaling
very
quickly
in
terms
of
how
many
people
use
the
platform,
and
so
our
metric
loads
just
continue
to
go
up,
and
so
we
we
get
a
lot
of
alerts.
That
sort
of
say,
hey,
you're,
going
to
be
running
out
of
disk
space,
and
then
we
have
to
kind
of
decide.
Are
we
going
to
reduce
retention
here?
B
Are
we
going
to
try
to
scale
up
the
cluster?
Are
we
going
to
go?
Try
to
figure
out
like
which
are
the
worst
defenders
and
and
make
them
reduce
their
their
things,
so
those
are
sort
of
a
separate
set
of
issues,
mostly
because
they're,
not
really.
I
wouldn't
blame
those
on
m3.
That
would
be
an
issue,
no
matter
what
system
we
were
using.
B
So
in
terms
of
sort
of
like
m3
issues,
I
would
say
you
know
on
the
order
of
one
or
two
a
week
where
and
it's
usually
something
like
high
memory
usage
or
something
like
that.
We
we
also
end
up.
B
You
know
with
50
clusters,
there's
inherently
going
to
be
some
underlying
infrastructure
issues,
so
a
lot
of
times
it'll,
be
you
know
you
have
some
cluster
where
your
node
just
can't
schedule,
because
there's
some
underlying
issue
with
cloud
platform
or
something
like
that
but,
like
I
said,
not
a
ton
of
issues
overall,
so
hopefully
that
answers
that
one.
How
do
you
communicate
with
the
community
through
the
slack
channel?
Yes,
we
have
not
taken
the
chronosphere
office
hours.
We
may
start
doing
that.
B
But
yes,
currently,
our
communication
with
the
community
has
been
through
the
slack
channel
but
through
flying
filing
github
issues.
You
know
I
mentioned
the
operator
self-driving
thing,
that's
an
issue
that
we
reported,
that
it
wasn't
that
there
were
some
bugs
in
there
and
they've
now
fixed
those.
So
that's
been
good,
so
I
would
say
our
primary
things
of
communicating
are
through
the
slack
channel
where
people
are
pretty
responsive
and
also
just
through
github,
for
sort
of
code
issues
node
size
using
for
the
large
cluster.
B
B
And
did
we?
Let's
see,
I
saw
that
somebody
else
asked
if
we
were
wanting
to
to
open
source
our
rule
manager
engine,
and
the
answer
is
yes,
we
do
want
to
at
some
point
in
the
future.
We
we
do
not
have
a
like
roadmap
for
doing
that
right
at
the
moment,
but
as
soon
as
we
sort
of
are
able
to
to
prioritize
that,
I
think
that
is
something
that
we
would
definitely
like
to
to
prioritize
and
get
get
open
source.
B
We
are
a
company
that
likes
to
open
source
things
if
possible.
So
yeah.
Let's
see,
I
don't
know
if
I
should
go
through.
I
see
why
has
answered
a
lot
of
these
questions
sort
of
by
typing.
So
thanks
y,
let's
see
if
there
are
any
other
questions,
are
we
using
stateful
set
disks?
Yes,
so
we
use
a
stateful
set.
We
use
ssds
under
the
covers.
You
know
we're,
like
I
said
we
are.
B
We
run
across
multiple
clouds,
one
of
the
nice
things
about
about
that
is
that
kubernetes
can
kind
of
hide
some
of
those
details
for
us.
So
we
just
have
a
persistent
volume
claims
that
that
let
us
get
disks
for
the
clusters
and
then
you
know
they
magically
have
discs
and
doesn't
matter
which,
which
cloud
you're
running
in
cool.
Well,
I
don't
know
if
there
are
any
other
questions,
otherwise
I
think.
C
C
Thank
you.
Both
I've
now
had
it
back
to
gibbs
but
yeah.
That
was,
I
hope
that
was
like
I
mean
yeah,
extreme,
deep,
dives
and
really
valuable.
I
think
for
everyone
to
hear
about
so
thanks
for
putting
so
much
work
into
it.
It's
it's.
You
know
really
great
to
see
under
the
covers
for
the
for
other
folks
out
there
running
entry,
obviously
so
yeah.
I
just
wanted
to
to
say
yeah
give
a
great
thank
you
for
for
all
the
detailed
materials
here.