►
Description
Presented by: Yaarit Hatuka | IBM
To increase product observability and robustness, Ceph’s telemetry module allows users to automatically report anonymized data about their clusters. Ceph’s telemetry backend runs tools that analyze this data to help developers understand how Ceph is used and what problems users may be experiencing. In this session, we will overview the various aspects of Ceph’s upstream telemetry and its benefits for users, and explore how telemetry can be deployed independently as a tool for fleet observability.
A
Can
you
hear
me
well
awesome,
hey
everyone,
so
today,
I
will
talk
about
soft
Elementary
project,
we'll
have
an
overview
of
the
project
and
we'll
talk
about
the
motivation
behind
it.
It's
architecture
we'll
have
some
sort
of
a
dashboard
demo,
we'll
hear
some
success
stories
and
we'll
talk
briefly
about
how
to
deploy
your
own
Telemetry
service.
A
A
It's
important
to
emphasize
that
by
default,
Telemetry
reporting
is
off.
Users
need
to
explicitly
opt
in
by
agreeing
to
cdla
license
and
they
can
do
it
either
with
a
CLI
command
with
soft
elementryon
or
they
can
use
the
dashboard
wizard
to
do
so.
Users
can
also
see
a
preview
of
a
Telemetry
Report
with
a
cefton
tree
show
all
or
cefton
tree
preview,
all
since
Quincy,
and
by
default.
A
The
Telemetry
report
is
compiled
daily
from
several
channels,
each
with
a
different
type
of
information,
and
once
the
user
is
opted
into
Telemetry,
the
channels
can
be
turned
on
or
off.
Currently
there
are
five
channels
in
Toronto
tree.
The
first
one
is
the
basic
Channel
which
is
on
by
default,
which
collects
and
reports
basic
information
about
the
cluster
like
Seth
and
kernel
versions,
the
cluster
size,
number
of
demons,
and
so
on.
A
A
We
have
another
Channel,
which
is
the
ident
channel,
which
is
off
by
default,
and
it
allows
users
to
identify
themselves.
Even
even
though
everything
is
anonymized,
some
users
choose
to
identify
themselves
and
this
channel
allows
allows
them
to
provide
their
name
email
and
organization,
and
in
Quincy
we
added
the
perf
Channel,
which
is
also
off
by
default,
which
reports
various
performance
counters
in
the
cluster.
A
It
is
very
important
for
us
to
emphasize.
We
care
deeply
about
the
privacy
of
our
users
and
we
take
several
steps
to
ensure
it.
Whenever
we
introduce
a
new
data
model
version,
we
require
users
to
reopt
in
on
a
safe
upgrade,
otherwise
they
can
send
the
current
opt-in
version.
A
Reports
do
not
contain
any
sensitive
or
identifying
data
like
pool
or
host
names
or
object
names
or
contents.
We
also
anonymize
the
report,
so
each
cluster
is
assigned
a
random
uid
which
is
specific
for
Telemetry.
The
fsid
is
not
reported
in
Telemetry
and
we
also
redact
the
disk
serial
number
and
we
never
ever
store
IPS.
A
A
We
also
want
to
learn
about
the
upgrade
Cadence
and
the
version
adoption
rate
from
users
and
the
device
the
device
Channel
helps
us
to
learn
about
what
hard
disk
and
SSD
models
users
are
deploying
and
in
the
long
run
we
are
creating
we're
currently
creating
and
in
the
long
run,
we're
going
to
have
a
really
a
great
open
data
set
of
device.
Health
metrics,
which
is
aimed
for
research
to
create
more
accurate,
drive
failure
prediction
models
with
a
crash
data.
A
We
can
learn
about
how
bugs
about
new
bugs
that
are
happening
as
soon
as
possible,
and
it
also
helps
us
to
prioritize
issues
by
focusing
on
the
most
common
bugs.
So
we
can
see
how
many
clusters
experiencing
a
specific
issue,
and
it
also
allows
us
to
discover
some
Trends
throughout
the
versions
of
specific
issues
and
and
once
we
have
a
solution
for
an
issue
that
we
see
in
the
wild,
we
can
actually
verify
that
the
solution
Works
by
identifying
regressions
and
we'll
talk
about
that
shortly.
A
For
users,
users
can
use
the
Telemetry
data
in
order
to
validate
your
own
installations
by
looking
at
what
is
common
and
by
opening
to
the
device,
Channel
they're
contributing
to
an
open
data
set
of
Drive
Health
metrics,
which
is
very
it's.
It's
pioneering,
there's
only
one
company
in
the
market.
That
does
that,
and
we
want
to
create
an
even
better
open
data
set
in
order
to
allow
research
on
better
failure,
prediction
models
which
eventually
will
lead
to
preemptive
mitigation
or
failing
device
to
reduce
downtime.
A
Another
great
win
for
users
is
that
they
don't
need
to
actively
report
issues
or
Open
tickets
for
each
crash.
The
process
is
automated
and
all
the
crashes
that
happen
in
the
cluster
are
just
reported
to
us,
and
users
can
also
use
the
open
data
set
of
crashes
to
better
understand
an
issue.
They
can
view
it
on
our
bug,
tracking
system
track
our
safcom,
and
they
can
learn
whether
the
crash
that
happened
in
their
cluster.
That
issue
that
they're
experiencing
was
fixed
in
in
a
specific
version.
A
Let's
briefly
talk
about
the
architecture,
so
every
day
the
manager,
the
staff
manager
demon,
collects
data
from
all
of
the
demons
in
the
cluster
and
the
data
collection
mechanism
is
built
into
Seth.
And
then
it
compiles
the
reports.
One
report
for
the
cluster
and
the
other
reports
for
the
device,
health
metrics,
and
it
sends
it
home
to
to
our
backend
to
Telemetry
safcom,
where
we
have
an
Apache
server
and
a
postgres
skill
database
and
a
couple
of
instances
for
grafana
one
for
the
public
dashboard
and
one
for
the
private
dashboard.
A
We're
gonna
take
a
look
at
the
demo.
Sorry
about
that.
It's
a
screenshot
I
did
not
know
that
I
will
not
be
able
to
connect
my
laptop,
but
you
can
go
to
Telemetry,
safe.
The
telepublicstaff.com
and
you'll
see
the
public
dashboard.
A
We
can
also
see
a
breakdown
of
of
the
versions
for
the
Clusters
that
report.
So
at
the
top
of
the
page,
you
can
select
a
breakdown
by
major
or
minor
in
here
we
have
a
breakdown
for
Quincy,
so
we
can
see
the
the
adoption
rate
for
diversions
and
how
users
are
migrating
to
the
most
recent
release:
the
seven
17-5.
A
It's
also
useful
to
understand
the
volume
of
the
capacity
that's
coming
from
the
different
clusters,
so
not
just
the
breakdown
by
the
versions
and
the
the
demon
count,
but
to
see
where
the
most
volume
is
coming
from
and
currently
in
the
wild
for
the
cluster
that
report
Telemetry
most
of
it
is
coming
from
Pacific.
A
A
We
can
also
learn
about
a
capacity
that
the
Clusters
that
report
are
going
to
need.
So
this
is
a
dynamic
dashboard.
It
exists
on
the
private
dashboards
currently
that
we
can
look
forward
to
the
next
90
days
and
have
a
list
for
all
the
Clusters
that
will
reach
an
80
percent
capacity
threshold
and
looking
forward
to
the
next
year.
A
We
can
have
this
graph
to
see
how
many
the
the
total
extra
capacity
for
for
the
next
year
for
all
the
Clusters
that
report
and
we
can
choose
the
threshold
to
look
at
with
a
device
data.
We
can
see
the
breakdown
by
the
popular
vendors
that
are
out
there
in
the
wild,
and
we
can
see
that
currently
Seagate
is
the
most
popular
vendor
that
users
are
deploying
their
drives
and
right
after
that,
I
believe
that
HST
and
Western
Digital.
A
We
can
also
see
a
breakdown
of
devices
by
interface,
either
for
Flash
or
spinning.
So
nvme
is
we.
We
can
see
that
it's
popular
as
much
as
a
setup
for
ssds
and
scazzi
is
a
lot
more
popular
than
SATA
for
hard
drives
and
all
together
we
can
see
devices
by
type,
so
mostly
hard
disks
are
reported
and
ssds
and
nvmes
are
down
below.
A
So
that
was
a
breakdown
for
all
devices.
By
types
of
for
all
vendors,
we
can
also
have
a
specific
breakdown
by
vendor
if
we
choose
it
from
the
pull
down
at
the
top
of
the
page
at
the
dashboard,
I
want
to
talk
a
little
about
the
crashes.
So
let's
look
what
the
crash
report
contains,
so
each
raw
crash
report
that
we
receive
contained
the
crash
ID,
which
is
a
combination
of
a
timestamp
and
a
random
uad.
A
The
demand,
type
and
name
set
version
of
that
demon,
the
back
Trace
some
information
about
the
OS
and
the
assert
message
and
condition
if
applicable,
and
we
can
also
look
at
data
from
the
basic
channel
in
case
the
user
is
opted
in
to
that
channel
as
well
to
to
see
a
bigger
picture
of
the
cluster.
That
is
reporting
that
specific,
Crush
and
I
will
receive
a.
A
We
receive
a
lot
of
raw
crash
reports
and
we
need
to
find
patterns
among
them.
A
The
crash
processor
supports
multiple
Generations
or
recipes
of
signatures,
which
allows
for
backward
compatibility.
So
in
case
we
want
to
enhance
the
recipe
and
create
an
even
better
grouping,
because
I
know
that
there
are
some
duplicates
in
in
their
reports
right
now,
so
we
can
keep
enhancing
the
recipe
and
we
have
backward
compatibility
for
that
and
then
the
processor
populates
the
database,
and
then
we
have
database
of
signatures,
which
is
mapped
to
Raw
crashes,
but
just
having
the
data
itself
is
not
enough.
We
want
to
bring
it
to
action.
A
This
is
where
the
redmi
in
Telemetry
bot
comes
to
place
and
the
way
that
it
behaves
its
queries.
The
database,
the
postgresql
database,
that
we
have
for
the
most
recent
Quest
signatures,
and
then
it
Maps
each
signature
to
a
redmine
issue
and
the
way
it
does
that
it
searches
redmine
for
these
signatures
and
it
updates
an
existing
issues
in
case
it
find
one.
A
Otherwise,
it
creates
a
new
issue
with
the
data
of
this
signature,
which
contains
the
affected
versions
that
this
signature
there
are
that
the
raw
crashes
of
this
signature
were
reported
to,
and
there's
also
the
sanitize
and
the
Raw
Back
trace,
and
the
link
to
a
dynamic
dashboard.
A
What's
nice
about
the
bot
is
that
it
also
identifies
regressions.
So
in
case
we
have
this
tracker
issue
that
is
mapped
to
a
signature,
and
let's
say
that
we
solved
that
that
bug
and
and
now
we
receive
new
reports,
new
crash
events
with
a
newer
stuff
version.
That
was
that
basically
should
not
contain
this
issue.
The
bot
will
open
a
new
ticket
in
redmine
and
will
refer
to
the
original
fixed
one
and
give,
and
it
will
leave
a
heads
up
to
Developers.
A
A
We
can
see
the
number
of
affected
clusters
for
each
signature
and
to
learn
about
the
crash
status.
We
can
also
have
drill
Downs
to
Cluster
information,
so
you
can
learn
about
the
capacity
of
the
cluster
in
the
current
version
that
the
cluster
is
running
and
we
can
also
see
the
trends
throughout
versions
now.
A
This
is
a
another
static
demo.
Excuse
me
for
that.
So
here
we
can
see
that
we
have
a
lot
of
fields
that
we
can
search
by
what
we
just
mentioned
earlier.
So
here
we
just
wanted
to
see
all
the
crashes
for
the
most
recent
Quincy
version
17-5,
and
we
can
see
the
number
of
crashes
in
a
time
frame
here.
The
time
frame
is,
is
very,
it's
very
big,
so
it
is
similar
to
the
total
number
of
occurrences
that
each
signature
has.
We
can
learn
about
how
many
clusters
see
this
issue
in
their.
A
So
if
we
were
to
click
on
one
of
the
signatures,
we
will
see
a
signature
page
which
tells
us
when
the
issue
was
first
and
last
reported.
We
can
learn
about
the
the
count
of
the
raw
crash
events
that
were
reported
and
how
many
clusters
were
affected.
A
A
Going
back
to
that
page,
we
have
some
more
information
on
that
specific
signature
page
such
as
the
assert
function
and
condition,
and
then
we
have
the
daily
occurrences.
We
can
zoom
in
and
learn
some
more
we'll
have
a
list
of
the
affected
clusters
and
lots
of
information
about
each
of
these
clusters,
and
then
we
have
the
list
of
raw
crashes.
A
A
Yeah,
if
you're,
you
can
just
see
it
on
tracker
or
if
anyone's
interested
I
can
share
the
links.
Just
a
side
note,
not
all
crashes
I
mean
Seth
bugs,
so
there
could
be
hardware
and
IRS
issues
or
environment
or
resource
and
config
issues,
and
there
could
be
also
issues
of
other
dependencies.
So
not
all
crashes
are
actually
bugs.
A
All
right,
so,
let's
talk
about
some
cool
stuff
we
saw
so
it
is
very
essential
to
monitor
crashes
of
new
releases
like
we
just
saw
with
the
1725,
and
here
we
have
examples
for
bug,
fixes
to
crashes
that
were
reported
through
Telemetry
and
were
open
in
the
tracker
and
they
have
been
picked
up
by
the
teams
and
their
pull
requests
and
backboards.
That
fix
those
bugs,
and
this
is
pretty
awesome.
This
is
all
automated.
The
only
manual
thing
here
is
the
fix
by
our
engineers.
A
This
is
an
example,
oh,
so
actually
I
think
yeah
I
I
did
have
another
screenshot
for
that.
So
that's
an
example
for
the
tracker
that
was
opened
by
the
Telemetry
boat,
so
you
can
see
that
it
populates
the
tracker
with
all
the
signatures
that
are
relevant
for
for
this
issue
and
all
the
affected
versions,
and
then
it
has
some
example
for
a
rock
Rush
report
and
a
link
to
the
dynamic
dashboard
where
the
engineers
can
look
and
have
some
more
details,
and
here
there's
a
pull
request.
A
That
was
that
had
that
issue
fixed,
sometimes
users,
report
manually
issues
and
this
the
crash
dashboard,
helps
us
to
understand
when
the
the
issue
was
first
introduced.
So
in
this
example,
a
user
was
reporting
an
issue
that
happened
in
Quincy,
but
then
with
the
Telemetry
dashboard.
We
could
search
for
that
assert
condition
and
function,
and
we
realized
that
it
started
as
early
as
the
Pacific
16
to
5.
A
And
sometimes
users
themselves
respond
to
Telemetry
Crush
reports,
so
the
Telemetry
bot
will
open
a
ticket
and
users
chime
in
and
say,
hey
I
also
see
this
issue.
I
can
provide
logs
or
any
other
information
that
that
you
need
and
that's
very
helpful,
because
we
cannot
always
reproduce
issues
in
our
own
lab,
and
here
is
an
example
for
a
bug
that
was
fixed
and
the
latest
version
that
was
specified
in
tracker
was
I
think
16
to
zero.
A
If
I
remember
correctly,
but
then
we
saw
some
new
crash
events
from
16
to
5
and
16
to
9.,
so
the
crash
bot
that
the
crash
Telemetry
about
opened
a
new
issue
and
gave
Engineers
a
heads
up
saying
that
issue
was
fixed
but
now
we're
seeing
new
reports
from
newer
Crash
from
your
nearest
safe
version.
So
you
might
want
to
take
a
look
and
like
we
mentioned,
we
have
the
ident
channel.
A
Sometimes
users
want
to
identify
themselves,
so
we
contact
them
when
we
need
more
information,
more
information
about
crashes,
the
that
we
want
to
that.
We
need
additional
logs
or
anything
like
that
and
I'm
very
happy
to
share
that
users
are
more
than
happy
to
to
help
on
that.
A
We
also
use
Telemetry
in
order
to
understand
how
features
are
used
in
the
wild.
So,
for
example,
it
helped
us
to
understand
whether
we
can
announce
a
file
store
as
deprecated.
A
We,
we
had
Okay
add
this
one.
Yes,
I
did
okay,
so
we
had
full
dashboard.
That
has
a
lot
of
analysis
of
what
we
see
in
the
wild
for
the
breakdown
of
fat
store
versus
Blue
Store
usage,
and
we
saw
that
it
is
pretty
safe
like
to
announce
that
file
store
is
being
deprecated.
A
We
also
wanted
to
know
about
the
original
code,
clay
plugin,
whether
it's
being
used
in
the
field,
so
we
could
check
that
with
Telemetry
as
well,
and
we
wanted
to
know
whether
there
are
crashes
that
are
related
to
to
this
plugin
and
thankfully
they
were
not
and
again
this
increases
observability
and
let
us
understand
how
stuff
is
being
used.
A
Yeah
and
the
dashboard
team
needed
some
help
helping
tuning.
They
wanted
to
understand
the
magnitude
of
RBD
images
that
are
deployed
in
the
wild,
and
this
information
was
essential
for
them
as
well.
We
have
a
dashboard
for
that,
but
I
did
not
have
a
screenshot
I'm.
Sorry,
all
right.
Let's,
okay,
two
more
sides:
let's
say
that
you
want
to
deploy
your
own
Telemetry
service
and
the
reason
the
reason
for
that
would
be
that
sometimes
you
might
have
constraints.
A
The
classroom
can
be
air
gapped
or
you
want
to
have
some
more
observability
in
your
own
Fleet.
You
need
to
do
two
things.
You
need
to
bring
bring
up
a
Telemetry
server
and
to
configure
the
Telemetry
module
so
in
order
to
bring
up
server,
you'll
need
postgres
database,
the
web
server
and
grafana,
and
we
have
a
detailed,
install
guide
in
our
repo
and
for
the
Telemetry
module
side.
A
You'll
need
to
change
the
endpoints,
because,
right
now
they
phone
home
to
Upstream,
stuff
you'll
need
to
opt
in
because
Telemetry
is
never
opted
in
by
default,
and
you
need
to
enable
the
channels
that
you
want.
You
can
do
that
either
with
the
orchestrators
like
cefadm
or
you
can
change
modules
default
when
you
build
stuff
or
you
can
do
that
at
runtime.
If
you
have
a
running
cluster
and
you
need
to
do
that
now,
you
can
just
config
the
manager
module.
A
Please
contribute,
please
join
us
and
obtain
to
Telemetry
yeah
I
know
we
are
over
time.
But
if
you
have
any
questions,
be
happy
to
thank
them.
B
A
Yeah
great
question,
so
we
do
have
IO
errors
that
we
report
that
with
a
crash
metadata,
so
we
just
filter
out
all
these
questions
and
we
do
not
open
tracker
issues
for
them.
C
A
follow-up
question
on
that
regarding
the
device
health
metrics
that
are
being
captured,
there
is
a
device
value
protection
also,
that
is,
you
can
either
set
it
off
or
local,
like
how
efficient
was
local
to
be
proven
because,
like
we
have
tried
turning
it
on
on
certain
clusters,
but
although
there
are
so
many
crashes
like
none
of
what
none
of
that
was
actually
like
really
helpful,
I
mean
we
haven't
seen
any
alerts
from
the
safe
side.
So
is
there
a?
C
Is
there
an
effort
to
increase
that
local
database
to
like
understand
more
predictive
failure,
scenarios.
A
Yes,
so
predicting
predicting
a
dry
failure
is
not
simple.
All
like
the
different
vendors
have
specific
metrics
and
they
there's
now
like
one
size,
fits-all
prediction
model.
Of
course
there
are
some.
There
are
different
types
of
drives
like
flash
and
spinning,
and
we
need
many
models
to
cater
to
all
of
them.
A
We
do
not
have
information
about
how
well
the
disk
failure
prediction:
module
is
functioning
in
the
wild,
but
we
are
working
on
improving
the
models.
We're
collaborating
with
vendors
and
working
on
collecting
some
more
vendor-specific
metrics.
In
order
to
improve
that.
A
So,
first
we're
in
the
process
of
analyzing
the
metrics
in
our
back
end
and
about
overhead,
so
you
mean
in
generating
and
compiling
the
reports
and
collecting.
So
from
our
experience
it
can.
It
can
take
a
short
while
to
compile
the
report,
but
there
shouldn't
be
it
should
it
happens
daily
and
it
should
not
interrupt
basically,
the
the
Clusters
operation
and
and
we're
talking
about
really
really
big
clusters.
E
E
About
trying
to
make
sure
thanks,
we
had
a
lot
of
conversations
on
the
OSD
side
about
how
frequently
we
gathered
some
of
that
data,
and
it
was.
It
was
actually
a
really
big
concern,
making
sure
that
that
was
impactful,
so
I
think,
hopefully
we're
in
good
shape,
right,
yeah,
right.
F
So
the
graphs
were
really
great.
I
noticed
you
know
we
the
one
of
the
graphs.
We
showed
the
minor
versions
and
how
like
1720,
slowly
shrinks
in
terms
of
usage
is
that
does,
as
we
go
forward
in
time.
Does
that
include
clusters
that
have
stopped
reporting,
or
does
that
only
include
clusters
that
are
reporting
at
that
given
time
period.
F
Seventeen
to
zero
Seventeen,
two
one
Seventeen
two
two
and
we
see
you
know
as
as
we
go
forward
in
time.
Most
clusters
are
on
the
latest
version
of
of
version
17,
but
there's
still
some
clusters
that
are
on
1720.
So
it
brings
to
mind
the
question:
are
those
clusters,
the
ones
that
originally
imported
early
on
in
the
17
life
cycle
and
they
just
stopped
reporting
because
they
went
away
for
whatever
reason
or
are
they
still
reporting
that
they're
using
that
version?.
A
I
I
believe
and
I
don't
have
the
exact
breakdown
for
this
specific
version,
but
we
could
look
at
that.
I
believe
that
these
are
the
original
clusters
that
that
were
reporting.
All
the
like
I
find
it
hard
to
believe
that
these
are
new
clusters
that
started
reporting
to
Telemetry
with
this
version,
but
but
it
can
happen.
So
we
don't
have
a
differentiation
on
the
dashboard
to
see
like
when,
when
they
joined
started,
started
to
report
or
whether
they
upgraded
to
to
a
different
version.
A
But
it's
a
good
question
I'll
be
happy
to
to
take
a
look
at
that.
So.
F
Kind
of
related
to
that
is,
you
know
it's
not
in
the
data,
but
it
would
be
interesting
to
see
like
cluster
retention.
Like
do
we
get
reports
from
a
cluster?
How?
How
often
do
we,
you
know,
get
continual
reports
or
do
they
just
disappear?
You
know
similarly
I
think
that
has
an
important
impact
on
like
Drive
disk
prediction.
Right
yeah,
you
know
you
can't
declare
the
drives
died
if
the
cluster
disappears.
You
know
things
like
that.
Right.
A
F
A
So
we
were
talking
about
that
as
well
of
having
an
offline
Telemetry
capability,
so
even
Mark.
We
discussed
that
at
some
point
like
having
a
thumb
drive
for
airgate
clusters,
that
they
can
just
choose
what
and
when
to
report
yes,
so
this
is
something
that
we
discussed.
G
Mike
I
already
got
authorization
for
one
last
question:
it's
a
good
one,
but
back
Blaze
and
now
DigiTech
a
Swiss
electronic
seller
get
a
lot
of
publicity.
Publishing
reliability,
statistics
warranty
rates
for
for
products,
I
mean
ceph,
could
get
a
lot
of
publicity.
If
we,
if
we
make
some
kind
of
public
report
about
the
reliability
and
maybe
performance
of
different
Drive
models,
is
this
but
I
understand
this?
Is
this
can
be
also
a
bit
of
a
delicate
thing
for
for
us
to
publish
what
are
the
limits
of
what
we
can
do
there.
A
It's
a
very
good
question,
I
think
for
start.
We
want
to
use
this
open
data
set
in
research
and
see
what
what
we
could
learn
from
that,
because
one
thing
that
we
are
missing
is
the
labels.
We
don't
know
whether
a
drive
was
actually
it's
actually
failed.
We
can
just
say
that
the
cloud
the
host
is
still
active,
but
that
drive
is
no
longer
reporting.
A
So
that's
a
pretty
good
heuristic
to
understand
that
something
happened
to
the
drive
so
we're
now
in
the
process
of
playing
with
this
database
and
getting
some
more
insights
from
it.
And
yes,
definitely
when
we
have
some
some
good
insights,
we
want
to
make
it
even
more
public
and
yeah,
because
that
was
the
reason
there
was
the
idea
behind
it
to
have
some
more
models
and
vendors
open
and
drive.
Health
metrics
data
set
yeah,
I,
guess
I
guess
the
next
question
is
when's
lunch.
So,
yes,.