►
From YouTube: Ceph Tech Talk: Telemetry Dashboard
Description
A presentation on Ceph Telemetry Dashboard, emphasizing crash telemetry work and use cases for Developers and Operators.
Presented by: Yarrit Hatuka
Join us monthly for Ceph Tech Talks: https://ceph.io/en/community/tech-talks/
Ceph website: https://ceph.io
Ceph blog: https://ceph.io/en/news/blog/
Contribute to Ceph: https://ceph.io/en/developers/contribute/
What is Ceph: https://ceph.io/en/discover/
A
So
hey
everyone,
my
name
is
yurith,
been
working
on
the
telemetry
project
for
a
couple
of
years.
Now,
maybe
even
more
and
yeah.
Let's,
let's
see
what
we
have
so
today,
we're
gonna
have
an
overview
for
telemetry,
we'll
have
we'll
understand
better
than
motivation
for
it.
I
will
talk
about
some
architecture,
we'll
have
some
dashboard
demos
and
we'll
see
some
success
stories.
What
we
that
we
had
so
far
so
self
telemetry
means
that
clusters
phone
home
to
report,
anonymized
non-identifying
data
about
your
installation,
configuration
and
so
on.
A
If
you
want
to
see
a
report
for
a
sample
for
a
telemetry
report,
you
can
do
that
with
the
cli
command
or
with
self
telemetry
show
all
prior
to
quincy
and
in
quincy
we
use
self
telemetry
review
all
so.
The
telemetry
report
is
broken
down
into
several
channels.
Each
with
a
different
type
of
information,
and
once
the
user
is
opted
in,
telemetry
channels
can
be
turned
on
or
off.
A
We
currently
have
five
channels.
First,
one
is
a
basic
channel
that
has
information
about
the
staff
and
kernel
versions,
the
cluster
size,
how
many
demons
are
in
the
cluster
and
so
on?
This
channel
is
on
by
default
again
in
case
the
user
is
opted
into
the
telemetry,
and
then
we
have
the
crash
channel.
It
has
information
about
where,
in
the
staff
code,
the
crash
occurred.
A
A
This
one
is
also
owned
by
fault,
and
then
the
ident
channel
has
the
option
for
users
to
share
their
contact
details
like
their
email
and
what
organization
they're
from
this
one
is
of
course
off
by
default,
and
it
has
to
explicitly
be
turned
on,
and
in
quincy
we
added
the
perf
channel,
which
have
all
sorts
of
curved
counters
of
the
cluster,
and
this
one
is
also
off
by
default.
A
I
want
to
touch
about
privacy.
This
is
really
important
to
us,
so
in
case
users
are
opted
into
telemetry
and
we
add
new
data
new
data
to
the
reports
we
require.
We
require
users
to
opt
in
on
a
sf
upgrade
and
in
quincy
we
changed
a
little.
This
design
that
allows
user
to
keep
sending
whatever
they're
currently
opted
into
the
the
current
data
model
version,
but
they
need
to
re-opt
in
for
any
new
deltas.
A
The
reports
do
not
contain
any
sensitive
or
identifying
data
like
pool
names,
host
names,
object,
names
or
object.
Contents.
We
really
just
care
about.
We
don't
we
don't
care
about
who
owns
the
cluster?
We
just
care
about
the
telemetry
information
in
it.
A
A
We
also
remove
the
disk
serial
id,
it
is
redacted
and
this
is
relevant
for
the
device
channel
and
the
ips
are
never
stored
on
the
back
end
and
in
order
to
enhance
privacy,
we
are
sending
two
separate
telemetry
reports,
one
with
the
anonymized
cluster
data
and
the
other
with
the
anonymized
device
health
metrics.
These
are
sent
into
two
different
endpoints.
A
A
A
We
can
also
learn
and
discover
crash
trends
through
versions,
and
once
we
find
the
solutions
for
those
bugs,
we
can
verify
that
they
actually
work
by
identifying
regressions
if
they
occur,
and
this
is
all
thanks
to
the
crash
channel
and
for
users.
Users
can
validate
validate
their
installations
by
looking
at
what
is
common.
What
usually
ceph
users
are
deploying?
They
can
preemptively
mitigate
failing
devices
by.
A
And
contributing
the
smart
data-
and
this
is
a
bit
of
a
longer
term
goal
that
we
have
here
so
once
we
have
a
more
accurate
failure,
device
failure,
prediction
models.
We
can
help
the
user
understand
that
a
device
is
about
to
fail
which
helps
to
reduce
downtime
in
the
cluster
and
shift
downtime
to
a
maintenance
window
and
not
to
have
it
on
a
peak
hour.
A
Another
big
motivation
for
users
is:
they
don't
need
to
actively
report
issues
or
open
tickets
for
each
crash
that
they
have
in
their
clusters
and
they
can
use
the
open
data
set
of
the
crashes
to
better
understand
an
issue.
So
if
you
see
a
specific
issue
on
your
cluster,
you
can
search
it
on
our
bug
tracking
system
and
see
if
it's
a
real
bug
and
if
it's
a
real
bug,
you
can
learn
what
version
it
is
fixed
in.
A
Now,
let's
see
what
we
have
with
the
cluster
data
so
far,
so
we
can
learn
about
breakdown
by
versions
and
to
see
the
upgrade
cadence
and
version
adopt
of
the
story
reduction
rate
that
I
mentioned
earlier,
and
we
also
have
panels
to
learn
about
capacity
density
in
in
the
clusters
that
are
reporting
in
the
wild.
A
A
The
blue
bump
here
is
quincy.
We
released
it
a
couple
of
weeks
ago
and
you
can
see
that
there
are
already
about
a
70-something
clusters
reported
in
quincy
and
then
we
can
see
the
actual
number
of
demons
that
report
queen
c
as
well.
So
there
are
about
3
700
demons
that
were
upgraded
to
quincy
already
and
we
can
see
all
sorts
of
other
breakdowns
of
information
like
to
learn
about
the
total
capacity
by
version.
A
So
this
help
help
us
understand
if
users
in
the
wild
are
adopting
new
versions
and
how
quickly
that
happens.
This
dashboard
lets
you
see
breakdowns
by
major
and
minor
versions.
So,
let's
see
we
want
to
take
a
look,
for
example,
at
a
pacific,
but
we
want
to
see
all
the
breakdown
by
minor
versions
in
pacific,
so
we'll
just
ask
for
a
display
by
minor
and
we'll
ask
specifically
for
pacific.
A
So
here
you
can
see
the
adoption
rate.
This
purple
is
16
to
7
and
it
was
released
about
the
the
end.
At
about
this.
It
was
december
right
of
2021,
so
you
can
see
how
it
is
being
adopted
by
users
in
the
wild.
A
We
have
some
other
panels
here.
I
will
not
get
into
everything
I
I
really
encourage
you
to
to
take
a
look
at
them,
and
here
we
have
a
complete
dashboard
just
for
the
breakdown
for
capacity
density
of
all
the
reporting
clusters.
A
Let's
take
a
look
at
a
cluster
x-ray
page.
This
is
a
page
of
the
giveaway
cluster
that
we
have
in
our
lab,
so
we
can
learn
about
how
old
this
cluster
is.
How
many
hosts
it
has
the
total
and
use
capacity
for
in
a
time
series
manner
and
then
to
learn
about
the
pools
and
pgenum
of
this
cluster
to
learn
about
the
latest
metadata
of
this
cluster.
A
We
can
see
the
reports
like
to.
We
can
see
individual
raw
reports
for
for
each
of
these
clusters,
like
in
the
cluster
x-ray
page.
We
have
latest
pools
information
and
we
can
learn
about
the
recent
crashes
that
happened
in
this
cluster
as
well.
A
B
There
was
all
right,
one
neha
answered
it,
but
the
question
was:
are
these
1840
clusters
deployed
at
a
customer's
site
and
neja
answered
their
all.
C
So
so
what
what
is
the
typical?
Is
it
only
for
the
upstream
deployments?
If
somebody
deploys
like
reddit
or
a
canonical
distribution,
it
will
likely
not
report.
A
They
they
can
it's
it's
up
to
them.
So,
as
I
mentioned
at
the
beginning,
every
every
user
every
operator
has
their
own
choice.
So
if,
if
they
are
not,
if
the
customer
is
not
air
gapped,
they
can
opt
into
telemetry
if
they
want
to,
but
they
have
to
explicitly
do
that.
It
does
not
happen
in
the
background
or
anything
like
that.
A
They
explicitly
have
to
opt
in,
so
it
can
either
be
via
the
cli,
with
a
cell
telemetry,
on
command
or
via
the
dashboard
and
with
the
self
telemetry
on
command.
We
do
require
to
enter
the
license,
so
it
has
to.
It
has
to
be
explicitly
manually
done.
C
A
C
Just
asking
because
the
comment
was
that
all
those
clusters
in
the
telemetry
are
upstream
clusters.
So
that
means
that
there
are
no
clusters
running
version
from
a
distributor.
C
A
It
is
mostly
upstream
we,
we
did
see
some
reports
from
different
os,
such
as
from
different
distributions,
such
as
red
hat
distributions
and
suzes
distribution,
but
this
could
also
be
you
know
like
a
test
cluster
or
anything
like
that.
We
don't
we
don't
know
we.
We
don't
know
these
clusters,
it's
all
anonymized
and
unless
the
user
chooses
to
identify
themselves,
we
have
no
idea
the
organization.
D
I
guess
to
emphasize
that
I
think
the
answer
is
most
likely
they're
all
upstream.
It
is
quite
possible
that
you
know
there
is.
You
know,
one-off
cases
where
a
distribution
like
red
hat
or
a
canonical
sef
has
enabled
telemetry.
But
at
the
moment
we
don't
differentiate.
A
A
So
you
can
see
that
easily
seagate
is
the
most
popular
device
that
users
are
deploying
with
a
31
000,
nearly
32,
even
devices.
We
can
take
a
look
at
I'd,
break
it
breakdown
by
models
and
learn
that
there
are
eight
devices
that
report
20
terabyte
each.
A
A
So
the
reason
we
collect
health
metrics,
which
is
currently
just
a
smart
metrics,
but
we
are
working
in
order
to
add
vendor-specific
metrics
as
well.
The
the
end
goal
here
is
to
provide
a
disk
failure
prediction
service.
So
everyone
knows
that
in
order
to
have
a
good
model,
we
need
a
lot
of
training
data
and
currently
the
only
open
data
set
out
there
is
by
backplace,
which
is
really
nice,
that
they
provide
it.
But
the
problem
is
that
it
is
limited
and
it's
not
diverse
enough.
A
So
we
are
opening
the
device
data
telemetry.
We
have
an
open
data
set
which
can
be
downloaded
from
our
website,
and
we
call
for
researchers
to
do
an
open
research
about
this
about
this
data
and
come
up
with
the
better
models
for
predicting
failures,
and
we
also
have
plans
to
collaborate
with
other
projects
in
order
to
create
a
larger
data
set
for
this.
A
All
right,
the
crash
data
that
we
have
from
the
crash
channel
has
raw
crash
reports
which
contain
each
one
of
them
contain
a
crash
id
which
is
basically
a
timestamp
plus
a
random
uuid.
A
It
has
information
about
the
demon,
type
and
name
the
ceph
version
of
the
daemon.
It
has
its
text
trace
the
vectors
of
of
that
graph
specifically,
and
it
has
information
about
the
distribution
and
the
kernel
version,
and
if
it
was,
if
the
crash
happened
due
to
an
assert
we'll
have
this
information
as
well.
A
A
It
will
make
more
sense
for
developers
to
take
a
look
at
them.
The
problem
is
that
same
issues
can
have
different
back
traces,
and
this
can
happen
due
to
different
versions.
The
code
has
changed,
so
the
back
trace
looks
a
bit
different.
There
could
be
differences
due
to
different
compiler
versions
or
compiler
optimizations.
A
So
in
the
back
end
we
have
a
crash
processor
that
looks
at
all
these
raw
crashes
and
identify
similarities
among
them
by
taking
those
raw
crashes
and
sanitize
their
back
traces,
and
he
does
it
by
removing
the
offsets
and
addresses
from
all
the
frames,
and
then
it
applies
some
search
and
replace
patterns
and
filter
out
patterns
from
some
frames
that
are
just
noise
in
the
back
trace,
and
then
it
adds
the
assert
data
if
it's
there
and
it
calculates
the
signature
using
a
256.,
and
this
processor
supports
multiple
generations
of
recipes
of
signatures
which
allows
for
backward
compatibility.
A
So
it
also
supports
the
version
of
the
crash
signatures
that
we
have
on
the
cluster
side
and
then
it
populates
the
database
creating
all
of
these
signatures
for
the
raw
crashes,
but
just
having
the
data
is
not
enough.
We
need
to
take
action
in
order
to
to
do
something
with
those
crash
reports.
A
Otherwise
it
creates
a
new
issue
and
it
knows
how
to
pick
up
the
the
right
project
in
mind
for
that.
A
A
Another
important
thing
that
the
bot
is
doing
is
identifying
regressions.
So
in
case
there
was
a
a
crash
report
that
it
synced
with
redmine,
and
we
found
that
it
was
a
real
bug
and
we
fixed
it.
But
then
we
we
receive
new
crash
reports
with
a
newer
version
than
the
one
that
has
already
been
fixed.
A
A
So
we
have
custom
queries
for
the
crashes
that
we
think
so.
First
one
would
be
the
fresh
triage
or
this
one
specifically
is
for
ceph
for
the
entire
project.
So
we'll
have
all
the
crashes
here
that
that
are
new
that
are
open
by
the
telemetry
bot.
If
you
want
to
take
a
look,
for
example,
just
the
blue
store
pressures
that
were
opened
by
the
bot,
you
can
choose
those
custom
queries
from
the
sidebar
here,
so
we
have
yeah.
A
We
have
both
the
queue
and
the
triage
q
has
everything
which
is
open
and
the
triage
just
the
new
ones.
A
The
latest
telemetry
crashes
sink
that
we
had
was
mainly
for
16
to
7
crashes,
and
it's
very
important
to
emphasize
here
that
not
all
crashes
mean
safe,
bugs
could
be
hardware
issues,
it
could
be
environment
or
resource
limitations
or
configuration
issues,
or
it
could
be
issues
with
other
dependencies
as
well.
So
there
might
be
many
signatures
linked
with
red
mine,
but
they
not
all
represent
real
sandbags.
It's
really
important
to
emphasize
that
all
right,
so
we
can
take
a
quick
look
at
the
architecture
on
the
server
side
for
the
crash
telemetry.
A
So
here
we
will
have
the
telemetry
report
lands
on
the
rest,
api.
It
goes
to
the
database
and
then
the
crash
processor
sanitizes
the
back
trace
and
generates
the
correct
signature,
the
there's
a
grafana
instance
that
knows
to
query
this
database.
A
Of
course,
the
crash
processor
updates
the
database
with
all
the
new
signatures,
and
then
the
red
mine
bot
stinks
those
signatures
with
red
mine
and
there's
another
component
that
I
will
not
talk
about
today
that
its
essence
is
to
improve
the
signature
creation
for
for
the
crashes.
So
basically
we
we
have
a
bet
better
deduplication
for
raw
reports.
A
It
allows
searching
by
vectorized
frames
by
versions
either
major
or
minor
all
revisions
of
signatures,
because
earth
function
and
condition
number
of
affected
clusters
and
to
see
the
crash
status,
and
it
allows
a
drill
down
to
cluster
information,
as
I
mentioned
earlier,
if
you
want
to
take
a
look
at
how
big
the
clusters
that
experiences,
a
certain
crash
are
what
versions
they
are
currently
and
so
on
all
right.
A
In
order
to
access
the
dashboards
developers
need
to
have
an
access
to
the
sepia
lab
and
to
be
members
of
the
staff
organization
in
github,
users
can
search
redmine
for
the
batteries
for
specific
frames
in
their
batteries
and
for
crash
signatures,
and
I
just
want
to
emphasize
here.
A
If
you
manually,
create
a
a
red
mine
tracker
and
you
add
the
crash
dump
there,
please
do
not
remove
the
stacks
key.
It
is
not
a
secret
and
it
really
helps
the
crash
bot
to
sync
similar
issues
that
we
receive
through
telemetry.
A
All
right,
so,
let's
take
a
look
at
the
crashes
landing
page,
so
here
we
have
all
sorts
of
panels
that
help
us
have
a
bird's-eye
view
of
time
series,
data
of
all
the
crashes
and
their
signatures
by
the
day,
and
if
you
want
to
take
a
look
at
the
new
crash
signatures,
for
example,
in
the
latest
30
days,
we
can
take
a
look
at
that
and
we
can
see
that
we
have
a
breakdown
by
versions
here,
either
major
or
minor.
We
can
learn
about
how
many
clusters
are
experienced
experiencing
this
information.
A
All
right,
so
we
can
we
can.
We
can
learn
a
lot
about
all
of
the
crashes
that
we've
seen
in
telemetry
in
the
last
30
days.
So,
for
example,
if
we
see
that
there
are
six
clusters
that
are
experiencing
a
certain
issue,
that
happens
only
on
some
sort
of
a
quincy
version,
we
can
see
it
17,
1,
0
and
17
2
0.
A
We
can
click
on
that
and
yeah.
It's
too
big.
This
is
why
it's
a
bit
broken
now
and
we
can
see
that
there
are
there's
a
total
of
11
raw
crashes
reported.
It
has
a
breakdown
by
versions.
So
just
one
happened
in
17,
1,
0
and
10
by
17
2
0
we
can
have.
We
can
have
a
look
at
the
sanitized
factories,
and
here
we
can
click
on
the
sanitized
frame.
A
This
is
python
crash,
so
we
can
click
on
that
and
see
all
of
the
other
crashes
that
have
this
exact
frame
in
them.
So
not
necessarily
the
same
issue.
A
A
So
we
can
see,
for
example,
their
usage,
how
big
they
are
and
we
can
learn
about
their
current
and
recent
versions.
So
you
can
see
that
one
of
them
has
mixed
versions.
So
not
necessarily
all
the
cluster
is
upgraded
to
quincy,
and
here
we
can
see
the
actual
raw
reports.
A
A
A
This
work
all
right.
I
want
to
take
a
look
at
a
few
examples
of
some
success
stories
that
we
have
with
telemetry.
So,
as
you
know,
we
launched
quincy
a
couple
of
weeks
ago,
so
this
really
helps
us
to
monitor
crash
reports
of
new
releases
and
we
used
that
for
quincy
as
well.
So
in
the
example
that
we
just
saw,
there
were
a
few
crashes
just
for
that
happened
in
quincy
as
well.
But
here
you
can
see
that
the
time
frame
is.
A
We
took
a
very
big
window
here,
which
is
of
course
too
big,
but
then
we
chose
here
in
the
versions
just
just
17.20,
which
was
released
a
couple
of
weeks
ago,
and
here
we
can
take
a
look.
There
are
currently
28
press
signatures
reported
so
far
and
we
can
see
that
some
of
them
happened
in
other
versions
as
well,
not
necessarily
quincy
and
again,
as
I
mentioned.
A
Not
all
of
them
are
real
crash
soft
bugs,
but
it
does
help
us
to
to
monitor
and
better
understand.
For
example,
this
one
has
many
affected
clusters,
but
this
can
just
be
a
problem
with
a
hard
hardware
or
anything
that
is
not
related
to
ceph.
A
All
right,
let's
take
a
look
at
some
bug,
fixes
that
happened
thanks
to
the
sink
with
the
red
mine,
so
this
one
was
created
by
the
telemetry
bot,
this
tracker,
it
assigned
it
to
the
cfs
project
and
it
filled
up
all
the
relevant
versions
and
the
crash
signatures
that
it
saw
in
the
wild
and
also
one
that
was
created
on
the
back
end
and
then
in
the
description.
A
It
has
a
link
to
to
the
dashboard
and
has
information
about
the
assert
that
happened,
the
sanitized
spectres
and
a
sample
of
of
a
raw
crash
dump,
and
it
was
picked
up
by
the
developers
of
ffs
and
there
is
a
pull
request
that
is
fixing
this
issue
that
was
seen
in
the
wild.
A
This
is
the
page
that
was
linked
from
the
tracker,
so
you
can
see
that
there
are
a
total
of
two
affected
clusters
that
reported
this
issue
with
a
total
of
13
row
reports
and
here's
the
breakdown
by
version
here.
We
can
sorry
click
on
any
of
these
frames
and
see
if
they
happened
in
other
if
they
occurred
in
other
crashes
as
well.
A
But
here
we
see
just
one
example,
which
is
the
one
that
we're
looking
at
and
we
we
can
see
again
the
daily
occurrences
and
if
we're
curious,
what
version
the
clusters
currently
have.
So
we
can
see
that
one
of
them
actually
upgraded
to
quincy.
So
maybe
this
can
help
us
narrow
down.
A
If,
if
the
crash
happens
just
in
pacific
and
not
in
quincy
yeah,
and
here
basically
we'll
we
could
have
them
the
contact
information
details
for
users
that
identify
themselves
that
experienced
those
issues.
A
All
right
now
we
have
another
example
for
another
crash
that
was
reported
through
telemetry
and
was
also
picked
up
by
this
time,
rgw
team
and
they
had
even
backwards.
So
this
issue
happened.
It
was
reported
for
1627,
but
they
realized
that
it
actually
went.
Sorry
yeah.
It
also
happened
in
octopus,
so
it
helped
us
discover
an
issue
that,
even
though
it
was
reported
just
for
one
version
needed
to
be
backported
even
further.
A
A
All
right,
then,
I
want
to
talk
about
this
tracker,
real
quick,
so
this
issue
was
first.
Oh
sorry,
is
it
this
one?
A
Yes,
so
it
was
opened
by
the
telemetry
bot
and
we
can
see
that
we
when
it
was
found
during
the
bug
scrub.
We
discovered
that
we
need
more
information
to
debug
a
crash
like
this,
and
the
user
actually
found
this
tracker
by
searching
by
searching
it
and
it
supplied
us
provide
us
with
some
additional
information,
so
users
can
respond
to
whatever
we
see
in
telemetry
through
the
bug
tracking
system-
and
I
mentioned
earlier
that
the
bug
can
detect
regressions.
A
So
we
can
take
a
look
at
this
tracker
here
that
it
is
resolved
and
the
version
here
is
1528,
but
we
can
see
that
a
new
tracker
was
opened
recently
by
the
telemetry
bot
and
it
says
that
new
crash
events
were
reported
via
telemetry
with
newer
versions
then
encountered
so
far.
This
happened
because
that
tracker
is
related
to
other
trackers.
This
is
why
it
picked
up
16
to
zero,
but
it
linked
it
to
the
previous
issue,
so
might
be
a
regression
might
not
be
a
regression.
A
So,
like
I
mentioned
sometimes
just
the
raw
crash
reports
are
not
enough
and
users
identify
themselves.
So
we
can
contact
them
and
ask
for
more
information
to
better
debug
an
issue,
and
this
issue
was
first
reported
in
a
bugzilla.
A
You
can
see
about
a
year
ago
and
it
was
picked
up
by
by
the
bot,
and
that's
thanks
to
to
the
fact
that
we
had
the
stack
signature
here.
So
there
were
similar
crash
reports
through
telemetry
that
the
the
bot
could
scan
redmine
and
update
an
issue
instead
of
opening
a
new
one
and-
and
we
saw
that
we
have
links
here
to
to
these
in
telemetry.
We
can
see
that
there
are
49
affected
clusters
by
it
and
this
helped
to
prioritize
this
issue.
A
D
Yeah,
so
I
think
this
particularly
is
interesting,
because
this
is
one
issue
that
we
saw
as
gary
mentioned
in
downstream,
but
we
hadn't
seen
it
in
upstream
and
when
you
look
at
the
the
crash,
it
seems
very
intuitive
like
it
should
have
shown
up
and
clearly
that
was
where
my
curiosity
arose
and
I
checked
the
dashboard
and
I
saw
that
there
are
users
that
are
hitting
it.
Clearly,
there
was
something
missing
in
our
integration
tests
that
was
not
catching.
D
It
and
junior
was
assigned
this
bug
and
he
did
a
great
job
of
identifying
why
we
were
not
catching
it
and
clearly
with
so
going
into
the
specifics.
There
is
a
way
in
cepheid
am
to
remove
demons,
so
the
there
were
no
tests
that
were
actually
exercising
the
fact
like
to
reduce
demons
in
a
cluster
in
this
case
monitors
so,
which
is
why
we
would
see
this
and
also
turns
out
that
if
you
do,
you
use
a
regular
manual
procedure
of
removing
monitors.
D
You
wouldn't
hit
this
crash,
which
explains
why
the
other
tests
weren't
catching
it.
So
essentially
that's
I
mean,
I
guess
that's
where
one
extra
data
point
helped
us
prioritize
this
bug,
and
this
is
a
real
issue
which
we
are
fixing
and
now
also
reproducing
in
pathology.
A
Thanks
and
I
want
to,
I
want
to
take
a
look
again
at
this,
this
tracker
and
see
how
let's
say
that,
for
some
reason
we
did
not
have
anything
in,
we
did
not
have
the
telemetry
bot
synced,
whatever
we
saw
in
telemetry
with
redmine,
maybe
because
it
was
an
older
version
or
we
just
haven't
synced
it.
Yet
we
could
manually
search
the
telemetry
dashboard
for
it.
A
So,
for
example,
we
can
see
that
we
have
the
back
trace
included
here
and
if
we
scroll
all
the
way
here,
we
can
see
that
there's
the
the
function
name.
So
we
can,
we
can
copy
even
a
small
part
of
it,
and
we
can
go
to
to
the
search
page
and
basically
search
just
for
this
specific
function.
So
we
can
leave
the
five
years
window.
That's
fine
and
we
can
see
that
there
are
seven
crash.
A
Fingerprints
or
crash
signatures
are
reporting
a
pretty
similar
issue,
so
the
reason
that
they
are
again
not
all
grouped
together
in
case
in
case
there
are
they
are
similar
is
we
did
not
detect
the
the
bad
choices
were
different
enough
and
the
filtering
out
did
not
detect
that
it
is
indeed
the
same
issue.
So
this
is
one
one
thing
that
we're
still
working
on
improving.
A
So
so
again,
even
even
if
you
see
any
problem
there
out
into
even
in
tautology
or
in
downstream
or
wherever-
and
you
don't
find
it
in
tracker-
please
use
the
dashboard.
You
can
again
search
just
for
the
assert
function.
There
was
also
an
asserta
condition
in
this
case.
So
if
we
can
well
this
one,
I
guess
it
was
good
enough,
but
sometimes
the
search
condition
is
not
it's
not
very,
not
very
I'll.
Give
us
too
many
details.
A
So
in
this
case,
yeah
still
7
same
thing
or
we
could
just
use
some
frames
in
the
back
trace,
but
here
it
will
not
help
us
to
search
for
frames
that
we
filter
out,
because
we
we
will
search
only
in
the
sanitized
vectorize.
So,
for
example,
if
I
search
for
this
frame
over
here,
it
would
better
it
would
probably
find
better
results
so
yeah.
So
you
can
see
that
there
are
three
crashes
that
were
not
mentioned
here,
probably
a
different
way
of
execution.
A
So-
and
this
is
this
is
an
important
point
here.
If
you
don't
find
it
in
tracker,
please
use
the
dashboard
it
can
it.
It
might
not
be
synced
with
tracker,
yet
so
that's
important
to
to
emphasize
yeah,
and
there
are
some
other
use
cases
that
telemetry
was
very
useful.
A
So,
for
example,
we
wanted
to
know
whether
a
file
store
can
be
deprecated,
so
we
looked
at
the
data
that
we
have
so
far
with
telemetry
and
produced
these
panels
to
see
how
many
files
or
versus
blue
store
osd's
are
out
there,
and
you
can
see
that
we
have
a
breakdown
here
by
major
versions.
A
So,
for
example,
in
pacific
there
are
very,
very
few
demons
that
are
reporting
files
or,
and
if
you
want,
we
can
just
have
a
breakdown
by
just
specific.
For
example,
if
you
want
to
see
the
minor
versions
that
are
reporting.
A
And
I
think
I
think
we
announced
that
it
will
be
deprecated,
and
this
data
point
was
very
helpful,
so
we
we
did
use
the
survey
for
that
as
well
and
probably
mailing
lists,
but
telemetry
give
us
real
real-time
data
and
it
makes
your
voice
heard
as
users.
So
so
it
helps
us
to
better
understand
what's
going
on
in
the
wild,
so
you
can
see
the
ratio
for
blue
store
versus
file
store
in
16
two
to
seven.
A
We
were
also
asked
whether
the
regular
code,
clay
plugin,
is
being
used
in
the
field,
so
we
did
the
same.
We
had
panels
for
that
and
we
saw
that
it
is
being
used
by
real
clusters
in
the
field
and
there
were
no
related
crash
reports
to
that.
So
another
real
data
point
to
make
decisions,
and
I
think
now
we
are
even
developing
it
further.
If
I'm
not
mistaken,
it
was
this.
This
code
was
donated
by
a
researcher
and
but
we
did
not.
A
So
for
all
the
users
out
there,
please
join
us
with
opting
into
telemetry
with
the
staff
dome
trion.
You
can
see
that
it
is
super
super
useful
for
us
and
it
helps
us
make
a
better
product
more
robust
and
have
a
higher
quality.
A
Here,
yes,
so
I
will.
I
will
have
a
link,
a
link
to
that
as
well,
but,
basically,
on
that
on
that
page
you
can
just
click
here
and
see
all
of
the
related
all
the
related
dashboards.
But
I'll
I
will,
I
will
add,
a
link
to
that
in
the
ether
pad
as
well.
That's
a
good
question
and
how
can
developers
access
collected
reports
so,
as
I
I
showed
with
them
faster
in
the
cluster
page.
We
have
all
the
reports
here.
A
We
can
see
raw
reports
here,
but
if,
if
there's
a
need
to
have
them
in
another
format,
we
we
need
to
have
access
to
the
database.
C
If
the
reporting
and
it's
understandable,
if
the
reporting
comes
mainly
from
the
upstream
deployments,
it
is
likely
very
skewed
towards
developers
and
less
maybe
less
the
production
and
then,
if
you
make
decisions
for
future
directions
based
on
those
reports,
maybe
they're
a
little
bit
biased
towards
those
developers
upstream
deployments
right.
So
are
there
any
plans
to
promote
telemetry
adoption
by
the
people
who
use
the
distributions
and
then
not
upstream.
A
A
A
Another
thing
is
that
there
are
clusters
that
the
admins
really
want
to
report
telemetry,
but
they
cannot
because
they're
air
gapped
and
we're
thinking
of
supplying
a
solution
for
that.
So
these
are
again
real
deployment,
but
they
cannot
contribute
their
data
because
of
this
issue
of
being
air
gapped.
D
Going
back
to
the
development
versus
a
real
cluster
question,
I
think
the
scale
of
the
cluster
tells
us
a
story
about
whether
it's
a
real
cluster
or
a
dev
cluster,
and
the
other
thing
is
most
of
the
development-
only
happens
on
the
master
branch
or
the
main
branch.
So
the
version
number
is
an
indication
of
whether
it's
a
development,
cluster
or
a
real
cluster
as
well.
E
The
demo
detail:
oh
sorry,
yeah
the
demo
really
yeah
hi
yeah,
the
demo
you
did
just
like
for
a
brief
period
with,
like
the
the
bug
that
I
was
working
on,
was
really
helpful,
like
just
showing
around
like
how
we
could
you
know,
find
the
the
actual
like
classic
signature
using
like
strings
and
using
assert
functions.
That
was
nice.
So
thank
you
for
doing
that.
A
Thanks,
I'm
very
happy
to
hear
that
all
of
the
information
is
there,
but
not
yet
linked
with
redmine,
because
we
want
to
better
dedupe
the
crashes,
so
we're
not
overwhelming
developers
with
the
crashes
that
can
be
better
deduped.
So
so
it's
there.
We
just
need
to
actively
search
it.
Yeah.
B
Yeah,
I
just
had
a
quick
detail
to
add
about
the
developments
we're
doing
and
or
the
idea
we
have
to
collect
or
track
unavailable
data
in
clusters.
So
that's
not
a
data
point
that's
being
currently
collected,
but
that's
set
in
place
for
reef.
B
So
essentially
the
idea
was
that
in
stuff
clusters
there
are
ways
to
identify
unavailable
data
through
like
pg
states
when
pgs
were
last
active
and
there
are
ways
to
see
that
in
the
stuff
clusters
right
now
like
through
through
warnings
that
pop
up
about
data
availability
and
by
looking
at
the
pg
map.
B
Excuse
me,
but
there
aren't
ways
to
track
that
data
over
time.
Right
now.
So
we
are
thinking
about
ways
that
we
can
take
that
data.
Look
at
the
pg
states
when
they
were
last
active
and
calculate
some
sort
of
a
data
availability
score
that
can
be
that
can
indicate
how,
when
data
was
available,
and
maybe
during
over
the
course
of
a
week,
your
data
availability
score
was
like
80.
Something
like
that.
B
And
then
the
goal
is
to
include
that
in
telemetry,
so
that
we
can
have
reports
of
data
availability
tracked
over
time
and
collected
through
opting
into
telemetry.
And
that
is
an
idea
for
for
the
next
release.
Reef.
A
Yes,
thanks
for
mentioning
that,
yes
and
this
this
data
is
going
to
be
collected
in
a
perf
channel,
so
yeah
we
might,
we
might
add
some
highlight
information
in
the
basic
channel
as
well,
but
yeah.
This
can
really
help
help
us
and
better
understand
deployments
as
well
yeah.
So
so
please,
please
again,
like
we
said
when
you
want
to
make
your
voice
heard.
Please
opt
in
to
telemetry.
We
really
just
care
about
the
data,
we're
very
open
to
feedback.
A
Let
us
know
if
you
have
any
ideas
for
improvement
or
anything
we
can
do
better,
we'll
be
very
happy
to
do
so
and
developers.
Please
use
the
dashboard,
the
crash
dashboard.
B
I
have
just
one
more
question
about
the
the
general
ui
of
the
dashboard,
so
the
the
public
telemetry
link
is
that
providing
more
of
just
an
overview
of
all
of
the
data
collected
versus
the
sepia
link,
which
is
where
developers
can
find
the
crashes.
Is
that
the
difference
there?
Yes.
A
A
Yes,
that
specs
tech
search
page,
I
will
add
a
link
to
that
in
the
in
the
ether
pad
as
well
yeah,
and
please
remember
to
to
check
the
time
frame
here.
This
is
really
important
because
sometimes
just
last
30
days,
but
you
actually
want
to
see
some
more
data,
so
you
need
to
go
back
a
little
bit
more
and
make
sure
that
the
fields
are
so
if,
for
example,
now
I
want
to
see
just
in
the
last
30
days
all
the
crashes
that
happened,
let's
see
just
in
1627.
A
And
then
I
don't
understand
that.
Okay,
there
are
just
two
crashes,
but
that's
because
I
have
this
search
for
this
string
in
the
batteries.
So
if
I
move
it
I'll,
just
I'll
see
everything
everything
on
the
last
30
days
for
16
to
7..
A
I
can
also
see,
but
let's
say
that
I'm
curious
what
also
occurs
in
quincy
any
version
of
quincy
so
I'll.
Add
the
major
affected
version
here
and
I'll
see
that
there
are
17
crashes,
that
happened
in
1627
and
any
version
of
quincy
and
of
course
they
could
happen
in
other
versions
as
well.
But
this
would
be
the
defaults
and
also
we
can
see
only
a
new
fingerprints
in
this
time
frame.
So,
for
example,
this
signature
here
was
first
seen
in
2019
and
last
occurred
in
2022
like
yesterday.
A
So
let's
say
I
just
care
about
new
fingerprints
in
the
last
30
days,
so
I
will
change
that
to
only
new
fingerprints
and
there
aren't
everything
that
happened.
Everything
that
we
saw
happened
prior
to
the
last
30
days,
so
so
that
that
can
really
help
also
to
narrow
down
any
any
issues
that
we
want
to
to
look
at.
Of
course,
you
can
search
by
demons
here
as
well.
A
You
can
search
by
the
stack
signature,
the
crash
signature,
sorry
either
version
two
or
version
one,
and
if
you
want
to
search
by
more
than
one
string
in
the
back
trace,
you
have
three.
You
have
three
substrings
that
you
can
search
for,
so
this
can
also
help
narrow
down
relevant
issues.
A
Yeah
there
are
some
status
search
as
well.
It's
very
it's
very
extensive,
so.
A
Sure
pleasure,
yeah
and
if
you
have
any
questions,
please
reach
out
and
currently
again,
as
I
mentioned,
there's
work
done
to
improve
the
duplication
of
the
crash
signatures.
So
casey
helped
with
his
feedback
with
the
the
bug
scrub
that
they
did
on
the
most
recent
sync
for
16
to
7..
It's
not
an
easy
problem
and
try
to
apply
some
ai
tools
in
order
to
better
dedupe
that
so
work
in
progress.
A
All
right
thanks,
everyone
and
we'll
see
you
in
the
next
tech
talk
or
any
other.