►
From YouTube: Ceph Developer Summit Quincy: RADOS
Description
00:00 - Beginning
00:23 - Dashboard and rados: overview of current status and next steps
34:11 - Review crash telemetry panels: https://pad.ceph.com/p/telemetry-crashes
48:05 - Immutable Content Optimizations
1:10:37 common mgr pool
1:46:14 structured confvals to generate document (and c++ source code): https://pad.ceph.com/p/confva-yaml-doc
2:02:08 - automated auth key rotation: https://pad.ceph.com/p/auth-key-rotations
https://ceph.io/cds/ceph-developer-summit-quincy/
A
So
we've
got
quite
a
few
topics
already
on
the
agenda.
Should
we
begin.
A
Oh
all
right,
I'm
just
going
to
follow
the
order
in
which
the
topics
are.
If
there
is,
you
know
anybody
who
wants
the
topic
to
be
brought
up
first
or
anything
or
later,
let
me
know
okay,
so
the
first
topic
is
dashboard
and
rados
overview
of
current
status
and
next
steps.
A
So
this,
I
believe,
will
be
led
by
ernesto
and
alfonso.
B
I
think
you're
going
to
run
the
demo,
we're
basically
going
to
add
him
around
the
support
of
rather
specific
stuff
in
the
dashboard
and
for
the
future
things
or
things
to
come.
Well,
I
will
probably
will
again
to
get
your
feedback
and
suggestions
and
also
know
about
the
things
that
are
new
to
brothers
and
that
you
can
see
their
key
for
usability
and
and
that
an
operator
should
should
have
in
the
in
the
dashboard.
C
C
C
C
C
Here
we
have
the
services
that
we
have
right
now
running
and
if
we
click
on
the
service,
we
can
see
the
diamond
type,
the
id
the
container
id
and
some
other
details,
tattoos
another
another
information
that
should
be
relevant
and
well
that
the
plan
here
is
to
to
collect
feedback
on
what
should
be
showing
which
we
should
going
or
other
relevant
info
that
can
become
important,
another
kind
of
stuff.
C
C
Right
now
I
have
no
osd,
so
I
will
proceed
to
create
some
osds
like
I
can
go
to
the
primary
devices
not
and
right.
Now
this
is
the
current
way
of
adding
waste.
This
is
to
try
to
to
apply
some
filters
in
order
to
proceed.
So,
for
example,
I
will
go
for
the
type
and
select
hdd,
so
I
have
applied
a
filter.
B
Regarding
this
specific
workflow,
we've
got
recently
a
request
from
downstream
qe
on
being
able
to
specify
the
path
for
the
filter
in
the
track
group.
I
think
that's
supported
by
the
specification,
but
I'm
not
sure
how.
How
useful
is
that
for
if
there
are
specific
use
cases
where
we
can
foresee
that
a
user
would
need
to
filter
by
a
specific
path,
or
apart
from
the
other
fields,
that
we
are
exposing
right
now.
E
I
think
that
depends
a
bit
on
what
you
want
to
do
right,
if
you,
if
you
have
a
day
one
situation
and
you
want
to
create
a
lot
of
force,
these
on
a
lot
of
hosts,
then
specifying
the
path
might
be
a
bit
problematic
and
on
the
other
hand,
if
you
just
want
to
add
a
few
of
these
on
on
one
specific
new
host
or
so
then
having
the
password
make
sense.
So
I
think
we
kind
of
need
both.
F
B
E
E
Probably
not
understand
them
and
will
create
mistakes
right
on
on.
On
the
other
hand,
using
the
path
on
on
a
lot
of
posts
is
also
going
to
be
problematic.
E
Yeah,
I
I
don't
have
a
clear
answer.
That's
that's
my
problem
that
I
have.
D
But
I
guess
in
my
mind
this
case
is
like
there's
a
specific
drive
on
a
specific
host
that
I
want
to
go,
create
an
osd
out
of,
and
it
seems
like
that's
a
slightly
different
workflow.
D
It's
almost
like
you
want
to
go
to
the
device
inventory
list
and
you
see
a
disk
that's
available
and
you
want
a
button
that
says,
create
an
osd
out
of
this
disk
and
it
just
like
it
just
does
it
right,
like
that's,
what's
going
to
happen
when
you've
like
replaced
the
disk
or
you've
like
swapped,
something
or
whatever
you
just
want
to
like?
Go
okay
where's,
the
disk
there
it
is.
G
E
G
C
E
G
H
By
the
way,
I
noticed
something
like
sca
listed
in
some
examples
here.
I
think
you
want
to
be
careful
about
using
persistent
name
names
for
the
disks
as
well.
Things
like
sda
or
scp
are
liable
to
change
with
hardware
changes
or
cross
reboots.
Potentially,
I'm
not
sure
if
this
is
like
a
split
that
needs
to
be
considered
in
cydia
or
dashboard
or.
E
C
H
G
C
Okay,
so
well,
we
showed
how
to
what
the
the
the
the
way
that
the
dashboard
has
implemented
the
osd
specification.
Well
in
my
cluster,
I
should
add
more
horse
in
order
to
allow
more
hostic
creation,
but
the
thing
is:
we
have
to
find
ways
to
well
to
to
simplify
this
osd
creation.
H
G
D
C
Okay,
so
we'll,
then
we
will
take
note
about
the
suggestion
and
do
we
have
a
an
official
document
from
retrieving
all
the
feedback.
I
mean
for
all
the
components
or
should
any
component
maintain
its
own
document
in
order
to
have
the
the
minutes
of
all
the.
B
I'm
writing
that
down
on
a
notepad,
but
I
I
can
put
that
in
the
third
part,
but
perhaps
it's
going
to
be
a
bit
messy
right.
If
we
put
everything
there.
C
C
D
G
A
B
G
I
G
D
D
C
C
B
Think
that's,
okay!
I
I
perhaps
just
wanted
to
go
back
to
usd
because
I
don't
think
you
mentioned,
for
example,
the
flag
setting
for
usds
and
also
the
recovery
profile,
those
things
that
maybe
are
worth
situation.
I
could
mention
in
those.
C
C
G
C
Options-
I
don't
know
if
again
well,
I
I
I
from
our
understanding
that
right
now
this
is
already
available
for
the
user,
but
I
don't
know
if.
C
We
have,
we
have
put
some
info,
but
I
don't
know
if
this
symphony
is,
should
be
enough.
This
briefly
descriptions
that
we
have
with
this
information
or
or
should
we
put
more
info
for
the
user,
or
should
we
understand
that
it
is
enough
this
info,
that
we
have,
for
example,
at.
C
B
G
D
G
D
A
Yeah
everything,
essentially,
you
can
overwrite
the
some
of
the
defaults,
but
under
the
hood
it
auto-tunes
all
the
values
at
the
moment
with
pacific.
We
also
have
new,
like
employee
based
profiles,
which
we
could
add
to
the
dashboard,
which
will
always
have
a
default.
But
if
the
user
wants
to
change
the
profile,
it
should
just
be
like
select
abc
from
abc
and
we'll
do
everything
under
the
hood.
So
all
the
recovery,
sleep
and
max
backfills
everything
should
just
go
away.
A
B
Yeah,
no
okay,
so
that's
just
only
for
the
recovery
right,
not
for
the
scrubbing.
There
is
no
profiles
previously.
A
Yeah
with
a
scheduler
change,
it
applies
to
scrub
as
well.
Okay,
and
all
this,
like
some
scrub
settings,
let's
just
say
that.
So
that's
what
I
said:
let's
have
a
separate
meeting
to
describe,
which
which
of
those
settings
should
not
appear
when
we're
using
scheduler.
D
Just
it's
sort
of
a
separate
comment
on
just
the
ui.
I
wonder
if
it
would
it's
a
little
bit
confusing
to
me
that
there's
this
drop
down
with
cluster
wide
stuff,
but
it's
on
a
page
that
has
all
these
ocs
listed.
So
when
a
dialog
pops
up
not
obvious
whether
this
is
like
came
because
you
clicked
on
something
on
the
lsd
or
whatever.
I
wonder
if
it
may
would
make
more
sense
of
these
cluster-wide
options
were
just
like
different
tabs.
G
D
B
G
G
E
Useful
information
and
advanced
information.
B
Yeah
that's
mean
I'll,
understand,
understanding
topic
having
a
kind
of
advanced
mode
or
basic,
mid
advanced
modes
yeah.
You
can
filter
narrow
down
the
amount
of
information
you're
displaying
yeah
yeah.
C
J
G
F
A
To
comment
the
ones
we
can
get,
we
can
have
some
more
iterations
over
these.
I
think
the
whole
osd
configuration
and
how
we
display
stuff
and
what
is
important,
what
is
not
we
can
have
separate
discussions
in
the
interest
of
time
I
just
want
to
you
know
we
have
13
topics
and
we're
almost
like.
You
know,
half
hour
into
the
meeting.
What
what
other
major
topic
like
areas
do
you
want
to
cover,
except
for
the
ost
section
in
the
dashboard
ernesto
and.
C
Yeah
yeah,
we
were
doing
just
showing
the
cluster
area
of
the
diver
in
the
sense
that
if
there
is
any
missing
feature
that
you
think
that
from
the
rados
team
should
be
there
here
in
the
driver-
and
you
have
noted
or
noticed
that
something
is
not
there
and
should
be.
For
example,
it
can
be
a
great
feedback
for
us,
because
if
we
go,
if
we
finish
quickly
the
walkthrough
you
have,
we
have
the
christmas
viewer.
C
You
have
the
also
the
manager
modules,
so
you
can,
for
example,
select
a
module
from
the
dashboard.
Well,
you
have
some
the
details
also
about
the
settings,
but
the
you
can
click
on
edit
and
then,
if
I
want,
I
can
edit
some
settings
and
this
will
be
reflected
like
a
theft
config
down,
for
example,
if
I
want
to
put
the
I
will
do
a
secret
key
and
the
access
key,
I
can
do
it
in
a
graphic
manner
in
order
to
have
access
to
the
object
gateway.
C
For
example,
then
you
up,
you
update
and
it
gets
updated
and
you
do
a
safe
conflict.
Then
you
will
see
the
changes
and
for
in
the
list
of
the
manager
model,
we
can
see
what
are
enabled
and
what
are
always
on
example-
and
I
don't
know
if
we
should
add
any
other
relevant
info
here
about
the
modules
about
modules,
or
we
are
okay
here,
for
example,
can.
B
C
B
B
C
C
K
K
C
B
G
G
B
B
Yeah
but
you
have
to
select
the
eraser
code
pool.
C
D
J
I
think,
for
the
sake
of
time
I
mean
there's
a
lot
of
items
on
your
list
today
and
I
I
you
know
we
got
on
the
agenda
because
you
know
we
basically
you
know,
don't
think
of
enough
people
use
this
and
give
us
feedback
so
overall
we'd
love
to
see
that
we'd
love
to
see
you
come
to
the
dashboard
meeting
and
use
this
on
occasion
give
us
that
feedback
we
need.
Otherwise,
this
product
won't
be
what
our
customers
need,
let
alone
our
engineers.
I'd
love
to
see
everyone
use
this
on
a
regular
basis.
J
I
mean
so
overall,
please
send
feedback.
You
know,
please
help
us,
you
know
you
know
with
the
features
that
we
require
moving
forward.
I
mean
it's
an
absolute
must
for
for
that.
Otherwise
you
know
it
will.
It
will
not
be
successful
and
you
know
at
some
point
I
mean
sebastian's
going
to
talk
about.
You
know
dashboard,
it's
who
say
and
their
successes
it
may
be.
Hopefully
our
next
dev
meeting
there'll
be
some
interesting.
You
know
discussions
at
that
about
you
know
suse.
J
D
And
just
just
for
the
benefit
of
everyone
here
I
see
on
the
calendar.
There's
there's
a
ceph
dashboard
stand
up
almost
daily,
but
it's
at
like
3
30
a.m.
My
time
is
there.
Another
dashboard
stand
up.
That's
at
a
later
time.
B
We
have
the
bi-weekly
thing
now,
I'm
not
sure
what
time
is
there,
but
it's
three
to
three
pm
european
time,
so
we
can
try
to
I
think
in
the
past
you
joined
that.
B
So
we
may
try
to
look
for
a
plot
where
we
can
all
gather
okay.
A
C
A
Time
I
get
the
idea,
I
get
the
idea,
so,
let's,
let's
yeah,
let's
you
know,
meet
more
often
we
can
discuss
that.
How
okay,
thanks
all
right.
A
The
second
topic
here
is
about
review
crash.
Telemetry
panels.
Are
you
yuri?
Do
you
want
to
leave
this
sure.
M
So
yeah
I
linked
another
ether
pad
that
has
a
few
links,
there's
a
link
to
a
sentry
dashboard.
That
has
sorry
so.
First
of
all,
I
don't
know
if
everybody
knows
we're
collecting
telemetry
data
and
some
basic
data
about
the
cluster
and
data
about
crashes
and
sometimes
data
about
the
disks
of
the
clusters
and
now
we're
going
to
focus
specifically
on
the
crashes
data
collected.
M
So
the
sentry
link
is
the
first
one.
I
don't
know
if
you
guys
can.
M
Access
that
you
need
to
have
the
self
member
membership
of
github,
both
the
team
and
the.
M
No,
no,
it's
okay!
It
might
be
even
faster
still
having
network
problems,
so
I
can
try
if
you
want,
but
yeah.
N
G
M
Cool
so
right
now,
only
the
last
30
days
of
the
data
that
we
collected
from
crashes
could
be
imported
to
century.
M
There
is
an
issue
with
the
importing
the
older
events
and
there's
another
issue
with
the
latest
seven
days,
but
we're
handling
that
the
main
idea
is
to
integrate
the
reviewing
of
these
crashes
daily
when
we're
doing
a
bug
scrubbing
so
that
we
actually
get
value
out
of
it.
So
I
don't
know
niha
josh
state.
If
you
want
to
suggest
ideas
of
how
we
can
actually
integrate
looking
at
them
daily.
M
There
is
one
thing
that
still
in
process,
which
is
to
enhance
the
daily
emails
once
we
once
we
know
where
exactly
to
link
them
to,
we
will
have
on
the
daily
emails,
just
a
a
summary
of
the
latest
24
hours
and
14
days
with
the
new
crashes
that
were
reported
through
telemetry,
and
then
we
can
link
to
either
sentry
or
the
other
grafana
instance
and
decide
how
to
proceed
with
opening
a
correlating
tracker
issues
for.
D
Them
yeah,
I
mean
I
ideally
we'd,
have
this
nice
there,
the
redmond
plug-in
for
century
is,
is
pretty
useless,
so
I
think
we
need
to
probably
write
around
just
so
we
can
link
these
to
tickets,
but
even
without
that,
I'm
hoping
we
can
incorporate
this
into
our
bug
scrub
routine,
so
that,
in
addition
to
looking
at
the
tracker
issues,
we
also
look
here
and
you
know
iowa.
We
can
ignore.
Obviously,
but
you
know:
here's
random,
a
random
issue.
D
That
was
only
one
event,
but
if
we
sort
by
events
and
find
things
that
users
are
actually
hitting
like
here's,
an
rgw
issue
that
hopefully
affecting
two
different
users-
and
it
looks
like
it's
just
crashing
repeatedly
on
this
particular
thing-
you
can
tell
what
versions
are
being
affected.
D
Here's
some
more
stuff
version,
the
the
frequency
over
time,
and
ideally,
ideally
we'd-
have
this
link
to
the
trackers
you
see
be
able
to
see
until
then
like.
Maybe
we
can
come
up
with
some
like
we
ignore
it
once
there's
it's
solved
or
resolved,
or
I
don't
know
it's
kind
of
hard
to
link
it
to
the
without
actually
having
it
linked.
I
guess
that
isn't
quite
like
you
can't
have
a
custom
state.
D
D
A
And
then
yeah
I
mean
I
at
least
the
way
I
use
sentry
at
the
moment
is
not
like
to
do
bug
scrub,
but
when
I
see
issues
that
can
be
used
that
can
be
tracked
using
sentry.
I
do
link
it
back
to
the
tracker
or,
if
I
open
a
tracker,
I
link
the
entry
event
associated
with
it,
so
anybody
looking
later
can
go
and
see.
You
know
what
is
the
frequency
or
when
it
started
occurring
and
all
that
kind
of
stuff
but
yeah.
A
I
think
it
will
be
useful
if,
if
we
could
make
this,
you
know
more
unique
failures
that
you
know
are
happening
over
a
week.
This
could
be
part
of
our
bug:
scrub,
routine.
G
D
Yeah
I
mean
at
the
bug:
scrubs
are
usually
once
a
week
which
isn't
ideal,
which
is
why
I
guess
why
those
those
daily
emails
will
probably
still
be
important
but
seems
like
at
least
once
a
week.
We
should
be
checking
this
just
to
see
like
on
the
latest
stable
point
release
like
what
crashes.
G
D
People
still
hitting,
because
that
should
be
guiding
our.
A
That
is
a
clear
data
point
of
what
we
need
to
focus
on
or
if
there
is
a
regression,
particularly,
I
think
the
questions
we
had
have
was
like.
You
know
how,
when
do
you
start
tracking
or,
like
you
know,
we
currently,
we
have
like
these
14
days
of
crashes
that
we,
you
know
then
make
that
more
useful
and
have
separate
sections.
A
M
Okay,
I
maybe
it's
worth
mentioning
that
the
entire
database
is
browsable
through
the
other
dashboard
and
there's
lots
of
search
options
there
as
well.
So
if
there
are
new
crashes
reported
on
the
latest
point
release,
you
can
also
see
if
it
was
reported
prior
to
that
and.
M
Yes,
okay,
cool,
so
that's
the
other
dashboard
has
all
sorts
of
panels
with
the
statistics,
according
to
versions
and
other
other
parameters
as
well,
and
I
just
wanted
to
focus
on.
Maybe
this
is
the
most
interesting
part
you
can
see
how
many
new
crash
we
call
it
fingerprints,
because
we
try
to
group
together
similar
crashes.
M
So,
for
example,
in
the
last
30
days,
you
can
see
that
there
are
about
300
new
fingerprints
that
were
never
seen
before.
So
obviously
these
are
all
15..
M
Well,
it's
not
that
accurate,
because
if
all
of
a
sudden
there's
a
cluster
that
started
reporting
and
they're
running
mimic,
so
we
will
see
new
crash
fingerprints
of
13
here.
So
these
will
be
outliers.
M
You
can
specifically
filter
a
certain
point,
release
here
and
focus
just
on
that
the
minor
affected
column
will
show
you
all
the
minor
affected
of
that
specific
crash
signature
and
that's
on
purpose,
because
if
all
of
a
sudden,
you
see
something
weird
on
a
certain
point
release,
you
can
see
all
the
other
versions
that
were
affected
by
that
as
well.
M
And
then
you
can
do
some
more
drill
downs.
You
can
see
when
it
was
first
reported.
You
can
see
the
actual
factories
of
that.
M
You
can
see
just
the
crash
id,
and
here
you
have
the
actual
trace
that
wasn't
filtered
and
all
of
them
captured
details
from
that
crash.
So
it's
easily
browsable
and
by
the
way,
if
you,
if
you
see
a
certain,
a
certain
function
that
catches
your
attention,
you
can
just
click
on
it
and
you'll
see
all
of
the
fingerprints
that
actually
contain
this
so
and
you
can
keep
filtering
by
more
functions
on
the
on
the
sectors
that
has
same
one.
M
So
it'll
be
nice
if
it'll
be
used
more
often.
M
A
No,
we
can
hear
you,
you
can
do
you
fine,
okay,
this
is
it
anything
else
you
want
to
cover,
or
are
you
good.
A
The
next
topic
is
about
clue,
store
blood
cash
improvements,
but
I
do
not
see
any
of
the
interested
parties
on
the
call.
So
maybe
we
will
keep
that
for
now
and
if
igor
adam
or
mark
joined
later,
we
can
cover
it.
O
O
You
hear
me
yes
cool,
so
there
we'll
try
to
keep
it
short,
because
it's
kind
of
a
little
bit
out
of
topic.
So
there
are
three
of
us.
First,
massimo
is
from
university
of
peace,
and
his
focus
is
on
streaming
data
and
high
performance
computing,
and
there
is
also
nikola
dondrimo,
who
is
here
listening
and
he
is
working
for
software
heritage,
which
is
an
initiative
to
collect
all
the
course
code
available
in
the
world
and
produced
by
humanity
and
keep
it
safe.
O
The
same
way
archive.org
does
so
they
the
problem
that
we
we
faced,
and
there
is
me
luik
and
I'm
actually
from
a
company
which
is
called
easter
eggs.
That's
not
a
joke,
it's
actual
company
name.
O
O
O
It
also
shares
commonalities
with
eos,
which
is
the
storage
system
of
the
cern
in
which
they
store
much
bigger
objects,
but
they
have
billions
of
them.
O
So
when
we
we
try
to
use
stuff
for
that
small
objects
and
there
are
tens
of
billions
of
them.
The
the
problem
that
we
faced
is
twofold:
really
it's
space
amplification.
O
That
is,
we
want
to
save
space,
so
we
use
an
erasure
coded
pool,
but
unfortunately
there
is
a
space
in
purification
that
insafe
insaf
grows
over
35
of
the
total
storage.
O
So,
no
matter
how
hard
we
try
there,
there
is
a
very
significant
space
and
amplification,
because
the
objects
are
too
small
and
the
second
problem
is
enumeration.
That
is
what
software
heritage
is
about:
the
the
users
of
software
heritage.
O
O
So
when
trying
to
to
figure
out
how
to
help
with
both
problems,
we
stumbled
upon
the
solution
that
lincoln
did
found
in
their
system,
which
is
called
ambry
and
essentially
what
they
did
is
grouping
the
objects
together,
that's
a
fairly
simple
solution,
so
they
they
take
millions
of
objects,
they
put
it
in
a
one,
big
100
gigabyte
container
and
that
allows
them
to
more
efficiently.
O
All
this
wouldn't
work
at
all
if
it
was
read
right
objects,
that
is
it's
not
flexible
enough,
but
the
commonality
of
all
these
workloads
is
that
they
are
immutable
objects,
and
so
we
can
probably
we
can
leverage
this
property
by
packing
them
together
and
that
actually
works.
There
is
no
drawback,
so
for
software
heritage,
what
we're
doing
this
year
is
to
use
ceph
and
pack
the
objects
into
an
rbd
image
so
that
we
we
get
the
benefit
of
everything
that
we
just
talked
about.
O
The
thing
is
it's
on
top
of
ceph
and
it
it
will
be
fine,
but
it's
yet
another
software
to
solve
this
fairly
old
problem,
because
the
first
time
facebook
published
something
about
that
specific
workload
was
in
about
10
years
ago,
and
nowadays
you
would
think
that
software
storage,
distributed
storage
natively
provides
something
that
solves
the
problem
and
you
you
don't
have
to
do
anything,
but
it's
not
the
case.
O
So
there
we
go
what,
if
seth
provided
something
that
would
help
this
use
case
so
that
software
heritage
facebook
link
it
in
just
need
to
use
ceph
out
of
the
box
and
it
just
works.
There
is
no
space
and
solidification
and
if
you
want
to
enumerate
all
the
objects,
it
goes
as
fast
as
if
you
were
doing
an
rsync
on
100
gigabyte
worth
of
volumes,
and
I
don't
know
if
it's
possible,
because
I've
worked
and
before
I
go
into
the
ideas
that
we
got.
I
would
like
to
probe
you
everyone
here.
H
I
think
one
of
the
major
aspects
of
this
is
difficult
is
the
listing
piece
at
the
the
radius
level
is,
but
since
sf
is,
is
starting
objects
across
the
cluster
you're,
treating
them
each
as
an
individual
raiders
objects
and
trying
to
do
listing
across
all
of
those
it's
a
very
different
sort
of
design
than
when
you're
talking
about
packing
a
bunch
of
objects
together,
you're,
no
longer
kind
of
hashing
them
according
to
their
name,
you're
kind
of
trying
to
group
them
together
in
some
way.
O
And
yes,
you
heard
that.
H
I
I
Well,
zipper
is
part
of
an
extension
system
being
developed
for
rgb
that
allows
builds
filtering,
basically
stackable
drivers
and
things,
so
you
could
yeah,
I
think
who's
imagining
this
is
the
driver.
Filter
photo
driver
for
rw,
but
I
think
in
general,
in
in
rgw,
is
where
we
would
want
to
put
this
we're,
definitely
interested
in
incorporating
packing
support
and
we
and
we
some
challenges
in
the.
I
D
I
think
the
one
of
my
questions,
I
guess,
is
how
this
is
used
like
in
in
your
case
the
week
when
you
prepare
these
big
rbd
images.
Is
there
this
whole
sort
of
offline
phase
where
you
back
pack
it
all
together
and
you
build
an
index
and
you
put
the
index
at
the
front,
and
then
you
import
that
into
exactly.
D
That,
once
this
goes
once
this
is
in
ceph
like
you
want
the
index
like
in
the
front,
so
you
can
do
best
listing
of
the
names
of
objects
and
look
up
a
specific
object.
That's
in
this
big
blob
and
that's
a
little
bit
challenging
to
do
if
you're,
like
sort
of
slowly,
you
know
doing
these
over
time,
but
yeah,
I
think
it
would
be.
It
would
be
helpful
to
have
maybe
a
set
of
a
set
of
constraints
to
better
understand
how
it
would
be
used
like
what
are
the.
D
What
are
the
ingestion
works
loads
like?
Is
it
like
a
trichlophiles
that
are
sort
of
randomly
accumulated,
or
is
it
like
a
serial
import
of
a
linear
sweep
of
a
file
system
or
whatever,
and
then
also,
then,
what
are
the
like?
The
subsequent
workloads
like
the
batch
export
or
static
file,
or
so
on.
So
I
think
those
understanding
what
those
workloads
look
like
and
what
the
expectations
are.
There
would
help
drive
with
the
design.
O
Yeah,
I
suppose
that's
part
of
the
problem
because
it
varies.
I
focused
on
the
workload
of
software
heritage,
which
is
fairly
easy
to
understand
because
they
collect
elements
and
then
stack
them,
and
then
you
can
say
after
10
million
objects
or
100
gigabyte.
Okay.
O
Now
I
close
this
and
this
is
immutable
and
we
we
add
an
index,
as
you
said,
and
then
we,
when
you
read,
you
can
look
up
the
index
and
if
it's,
if
it's
a
perfect
hash
table,
then
you
have
fairly
quick
access
to
the
individual
object
and
then
there
there
is
the
mirroring
part,
which
is
again
fairly
easy
because
you
want
you
just
want
to
send
the
100
gigabyte
object
to
someone
else
now,
if
it's,
if
it's
implemented
as
a
rgw
filter
driver
as
it
suggested,
would
would
that
allow
for
that
kind
of
order
of
magnitude
change.
O
Would
it
be
feasible
to
have
to
stack
them
into
gigabytes
and
gigabytes
of
single
object?
That's
what
I
don't
quite
see.
I
Well,
I'm
not
sure,
I'm
not
sure
that
the
film
driver
explains
all
of
it,
for
what
rw
you
do,
but
rgb
might
be
using
that
filter
in
order
to
in
order
to
do
listings
or
to
incorporate
or
incorporate
edits.
There's
also-
and
this
is
this-
is
valuable
this
this.
This-
is
this
this
this,
your
your
your
posts
just
develop,
reminded
us
to
think
about
this
more.
We
have
other.
I
There
are
other
workloads
that
have
been
proposed
for
for
object
packing
with,
with
with
s3
object,
storage,
historically
x,
guy
came
in
with
a
group
of
them
and
and
that
are
that
are
that
are
used
in
hyperscale
deployments,
they're,
not
sure
which
ones,
but
where
you
said
less
of
smaller
s3
objects
and
you'd
like
to
pack
them
into
some
intermediate
form
and
and
that
that
that
isn't,
that
probably
is
not
not
not
not
the
kind
of
object
scales
that
that
your
proposal
has,
but,
but
we
might
want
to
be
able
to,
we
might
be
able
to
reuse
some
machinery.
G
O
D
I
mean
we
have
had
some
discussions
around
in
a
raido's
pools
that
have
different
access
semantics,
where
they
might
be
immutable
or
when
you
create
a
new
object.
You
know
that
it
has
a
unique
name
or
something
like
that,
because
those
would
affect
the
way
that
the
replication
and
recovery
algorithms
work,
but
those,
I
think
those
are
different
because
they
don't
really
address
the
problem
of
tiny
objects
and
bulk
import
and
export,
and
so
on.
So.
I
O
So
that
stores
that
actually
we
can
skip
the
first
idea
because
then
we
talked
about
it.
The
second
idea
is
actually
we
can
skip
it
too.
I
think
because
it
would
be
more
relevant
in
rjw
and
the
bulk
reads.
O
I
guess
that's
where
I'm
a
little
hesitant,
because
when,
when
I
thought
about
it,
the
benefit
of
packing-
let's
say
at
least
one
million
object
together-
that
having
at
least
10
gigabyte
worth
of
something
that
you
can
mirror
seemed
like
a
good
way
to
change
the
order
of
magnitude
of
the
problem,
and
I
I
still
don't
see
if
rgw
can
provide
that
kind
of
change.
If
that
makes
any
sense.
O
D
I
I
don't
think
zipper
does,
but
but
by
itself,
but
zipper
is
a
way
that
there's
a
way
to
do
inline
edits
of
the
other
requests
to
direct
it.
To
do
something
specific,
but
my
sense
is
this
probably
involves
I
mean
I
want
you
to
jump
in
here
too.
We've
talked
about
this
a
bit
and
and
the
the
object
manifest
that
rw
has
a
lot
allow
for
objects
to
have
alternate
structures
that
can
be
arbitrary
and
an
async.
I
Another
piece
probably
would
be
asynchronous
processing
after
ingest
to
ingest
these
into
their
pack
form
or
to
do
later.
Compactions.
If
we,
if
we,
if
we
make
edits
your
system,
never
makes
edits,
but
others
might
so
that
and
zipper
gives
you
zipper.
I
think,
gives
you
inline
access
to
territory
target.
You
know
to
make
more
flexible
access
patterns
instead
of
rhw.
If
we
see
that
we're
looking
for
something
that
has
a
more
complicated
as
an
alternate
access
path,
I
mean
other
than
our
bucket
index
or
or
then
we
have
some.
P
Yeah
well,
yeah
zipper
is
not
going
to
solve
everything
magically
just
start
the
framework
where
everything
will
need
to
be
created
right.
That's
one
thing,
just
a
side
note
that
access
pattern,
where
you
have
multiple
separate
objects,
and
then
you
can
read
everything
in
one
go
reminded
me
of
this
with
the
terribles
with
large
object
implementation.
P
So
rgw
already
does
something
like
that.
Where
you
know
you
can
pack
multiple
objects
as
as
the
one
logical
entity,
but
you
know
that's
that
wouldn't
be
the
way
to
go.
I
Because
yeah
the
ultrasound
of
implementation,
but
yeah,
but
it
does
do
that
and
and
then
the
pieces
are
all
like
you
know,
as
in
your
workflow,
the
pieces
are
sort
of
at
the
inverse
of
it,
but
in
switch
object,
storage
like
these
live
objects.
The
pieces
are,
the
pieces,
are
themselves
objects
that
something
uploaded
and
then
and
then
and
then
it
uploaded
a
rule
for
combining
them
right.
R
R
R
I
think
I
think
the
broader
concern
is
that
most
of
these
projects
have
like
have
very
different
ratios
of
data
to
metadata
to
how
many
like
disk
ios
they
can
afford
per
per
like
packed
object
to
put
it
in
or
to
read
it,
and
we
don't
have
a
great
place
for
building
the
giant
memory
caches
that
something
like
like
haystack
tended
to
have
to
to
amortize
the
lookup
times-
and
I
mean
like
rgw-
is
the
place
to
build
something
like
this
in
in
the
south
ecosystem.
R
But
you
need
to
look
very
carefully
at
whether
we
are
trying
to
what
whether
objects
are
pre-packed
before
rgw
gets
them
or
whether
we
can
afford
to
like
write
a
bunch
of
four
kilowatt
objects
into
r2w
and
then
have
rgw
like
read
them
out
and
rewrite
them
into
a
packed
form
and
delete
them.
Whether
lookups
can
afford
to
go
find
light.
R
Do
the
like
rgw
generic
object,
look
up
to
find
the
the
packed
location
and
then
like
look
at
the
the
index
of
that
packed
object
and
then
do
the
actual
read,
and
that
and
that's
my
concern
with
with
something
like
this,
especially
if
we're
trying
to
like
do
a
generic
implementation.
We
expect
to
work
across
a
large.
A
large
range
of
anything
is
that,
okay,
because,
like
I
think
the
haystack
paper
had
a
lot
of
math
to
demonstrate
that
that
every
that,
what
their
constraints
were
and
and
that
their
design
like
hit
them.
O
True,
unfortunately,
we
don't
have
the
implementation,
but
yeah
it
did
so
one.
That's
okay,
yeah,
sorry,.
D
Last
one
last
comment
again:
I
think
it
would
be
really
helpful
to
have
just
something
written
down
to
better
understand
what
that
what
that
ingest
workload
is,
but,
for
example,
if,
if
you're
ingesting
like
a
project
at
a
time
or
something
or
it's
like
our
ball
of
a
particular
version
of
software
or
whatever,
so
it's
the
files
are
all
localized
within
a
tree
or
a
branch
of
the
hierarchy
all
grouped
together
already,
then
that's
like
that's
easier
to
sort
of
graft
into
some
larger
view
of
reality,
whereas
if
you're
just
getting
like
random
files
spread
across
all
different
parts
of
the
namespace,
all
at
once,.
O
O
And
so
there
was
one
last
item:
it's
the
it's:
the
ability
to
stream
the
object
out
of
ceph,
so
the
general
idea
is
there
seems
to
be
an
ecosystem
developing
of
layers
that
transform
database
updates,
for
instance,
into
streams
like
kafka,
brings
and
people
developing
software
to
analyze
these
these
streams
to
do
various
stuff
and
they
are
interested
in
having
different
backends
from
which
they
can
read
objects.
O
So
currently,
there
are
an
ecosystem
and
they
know
postgres,
they
know
mysql,
they
know
other
backends,
they
don't
know
ceph,
and
I
was
wondering
and
that's
where
massimo
is
most
interested
in,
if
steph
has
so,
it
could
be,
maybe
a
rgw
that
allows
to
plug
in
this
kind
of
back
end,
so
it
made
available
as
a
stream
of
objects
somewhere
which
would
help
again
with
mirroring
and
again.
The
speed
of
this
mirroring
would
matter.
O
O
Sure,
okay,
well,
unless
nicola
or
massimo
have
anything
else
to
add,
that's
all
I
had.
G
O
Well,
thank
you
very
much
and
so
we'll
be
looking
forward
to
examining
how
rjw
does
that
or
could
do
that?
A
All
right
next
topic
is
about
ceph
manager
improvements.
This
would
be
josh
and
me
probably.
H
H
Sure
so
we
talked
about
about
a
number
of
shorter
improvements
in
the
cd
in
november,
so
today
I
wanted
to
focus
more
on
the
larger
scale,
improvements
like
scaling
out
the
manager
and
dealing
with
scalability
issues
in
general
there.
H
So
there's
kind
of
two
categories
here.
But
let's
start
with
this,
the
scaling
out
the
answer
in
general.
H
H
This
gets
us
out
of
the
problem
of
using
sub-interpreters
in
python,
which
aren't
actually
supported
anymore
by
cython
or
and
isolates
the
modules
from
each
other,
so
that,
if
one
has
a
problem,
they
just
doesn't
take
down
another
one
plus
lets
us
easily
say
restart
a
particular
module
or
reload
a
particular
module
that
affects
the
others.
H
It
does
make
probably
the
the
deployment
model
a
little
bit
more
complex,
because
you
know
you
need
something
like
a
zip
adm
to
under
understand
that
needs
to
deploy
a
bunch
of
processes
and
meshes,
and
you
need
to
do
all
of
the
same
kind
of
like
failure,
detection
and
and
restarting
that
you
would
for
a
single
management
process
with
many
processes.
H
But
the
bright
side
is
that
these
are
all
steelers.
So
if
any
time
one
goes
down
restarted,
it's
not
a
very
complex
operation.
It
doesn't
require
a
much
complex
recovery
and
the
the
key
piece
I
mean
that
you
need.
There
is
a
like
a
proxy
layer
like
a
shape
proxy,
which
we
already
often
deploy
with
the
dashboard
to
be
able
to
stand
in
front
of
whatever
ek,
endpoints.
We
may
be
exposing
there,
but
that's
something
that
we
already
need.
H
H
This
information
periodically
say
every
5
seconds,
10
seconds,
60
seconds.
That's
penny
unload.
H
H
B
Yeah
for
a
single
user,
it's
every
five
seconds,
we're
right
now,
working
on
a
caching
layer
for
the
cabinet,
and
so
we
don't
really
need
to
all
the
manager
api.
So
often
so
we
can
keep
the
data
inside
the
module
itself.
A
So
this
caching
structure
could
be
a
shared
caching
structure
across
modules,
which
doesn't
necessarily
need
to
be
only
for
the
dashboard.
B
Yeah,
that's
what
we
are
plotting
right
now.
I
will
share
the
link
to
the
tracker
issue
here.
That's
it!
In
fact,
what
ampere
are
going
to
do
that
I'm
going
to
explore
the
that
part,
so
we
can
generalize
that
and
also
include
the
site.
If
we
want
to
perform
the
question
there.
C
Yeah
they
started
doing
some
profiling
first
in
order
to
to
apply
this
cache
because
it
can
be
done
at
several
levels,
not
only
only
at
manager,
level
or
api
level,
or
that
whatever
it
can
be
to
several
layers.
B
Well,
right
now
there
are
some
existing
caches
like,
for
example,
the
controllers,
the
http
controllers.
There's
this
view
cache,
which
is
caching
things,
for
example
like
the
rbd
images
or
the
pools,
and
apart
from
that,
I
think
we
have
some
caching
in
in
the
front
end
as
well,
we're
also
avoiding
the
browser
to
pull
in
the
backend.
So,
but
I
think
in
the
case
of
the
dashboard,
we
will
probably
need
to
bring
this
kind
of
layered
approach,
because
there
are
multiple
places
where
this
dashboard
can
have
multiple
users.
H
H
The
doesn't
address
is
models
that
might
potentially
need
to
scale
beyond
a
single
process.
I'm
not
sure
that
we
have
any
of
those.
Currently,
I
think
the
one
that
has
been
a
bottleneck
is
prometheus
and
maybe
a
better
approach.
There
would
be
to
change
the
way.
We're
reporting
media's
data
to
give
it
directly
to
prometheus
itself
from
each
of
the
nodes,
instead
of
funneling
it
all
through
a
single
or
even
a
few
of
energy
processes,
I'm
not
sure
yeah
how
much
value
we're
getting
at
that
extra
processing.
There.
D
D
Because
I'm
wondering,
if
that,
if,
if
the
approach
we
take
to
scale,
the
manager
is
to
basically
separate
separate
out
modules
into
separate
manager
processes,
then
that
limits
our
the
the
ceiling
to
what
we
can
get
is
like
the
number
of
manager
modules
right
like.
If
we
have
10
manager
modules,
then
at
most
we
can
scale
10
times
bigger,
but
I
think
the
reality
is
that
we
don't
actually
have
that.
Many
only
a
few
of
the
manager
modules
are
actually
ailing.
D
H
Yeah,
the
other
ones
that
have
been
problematic
are
like
the
progress
modules
in
this
module
because
of
that
non-polling
behavior.
So
if
you
address
them,
make
them
pull
instead
of
processing
every
update,
I
think
they
won't
be
an
issue.
G
D
Yeah,
I
guess
it
seems
like
if
we
sort
of
address
those
piecemeal,
then
there's
only
like
by
count
on
one
hand
like
the
remaining
modules
that
are
problematic,
and
so
I'm
wondering,
if,
like
the
cost
benefit
of
the
complexity
of
breaking
these
things
out,
is
going
to
be
limited.
The
benefit
will
be
limited
by
one
hand.
I
guess
that
makes
sense.
H
I
don't
think
the
http
runtime
is
an
issue
per
se.
I
mean
things
like
the
that's
only
used
by
a
few
modules,
not
not
by
everything.
I
H
I
don't
think
it
requires
a
rewrite,
but
using
more
than
one
process
in
python
as
possible,.
D
G
G
D
H
C
Yeah,
for
example,
creating
countries
or
of
a
lot
of
osds
a
lot
of
rbds
thousands
of
buckets,
for
example,
for
for
dashboard.
I
can
imagine
several
several
testing
and
and
retrieving
info
that
info,
for
example,
retrieving
thousands
of
buckets
10
000,
for
example,
or
1000
rpd
images,
because
we
are
doing
listing
and
we
are
trying
every
time
to
to
improve
the
way
we
retrieve
the
info.
C
I
I
don't
know
I
don't
know,
for
example,
the
an
upstream
consumer
like
the
cern
I
it
would
be
great,
at
least
for
example,
I
don't
know
if
the
the
stress
that
the
cern
put
on
clown
staff
is
something
more
realistic.
I
don't
know
what
are
the
exact
figures
that
the
cern
is
using
for
safe?
I
don't
know
how
many
or
is
this,
how
many
rpds
or
are
you
having,
but
if
we
can
retrieve
this
data
and
create
some
more
resistant.
A
H
A
Q
G
D
Of
fake
osds
that
are
sending
manager
reports,
whatever
it's
whatever
it
is,
I
guess
my
like
high
level
comment.
Thought
is
basically
that
seems
like
there
are
sort
of
lower
hanging
fruit
where
there
are
existing
issues
that
are
sort
of
independently
addressed,
make
the
current,
even
with
the
current
architecture,
just
perform
better
like
the
notifies
yeah.
H
D
On
and
if
we
adjust
address
those
independently,
we
can,
you
know,
keep
keep
that
the
splitting
up
the
manager
card
in
our
back
pocket.
You
know
don't
close
that
door,
but
hopefully
not
I'm
not
sure.
I
would
yeah.
H
I
agree:
it's
not
the
first
first
thing
to
do.
I
think
I
definitely
want
to
introduce
the
progress
module,
insights
notification,
consumption
first,
but
I
think,
like
the
we've
already
seen,
that,
like
the
balancer
and
the
other
modules
doing
a
lot
of
cpu,
so
it
might
be
time
to
think
about
that
splitting
out.
Even
with
a
single
you
know
single
host
into
multiple
processes.
D
It
might
be
that,
like
the
balancer
module
is
a
good
example
where
it
doesn't
actually
have
to
talk
to
other
manager
modules.
It
only.
H
D
Manager,
so
it
could
be
in
a
separate
interpreter
in
a
separate
well
out
of
I
don't
know
it
might
be
easier
to
break
that
one
out,
I
guess
than
the
others,
that's
what
I
was
getting
at,
but
the
python
interpreter
constraints
are
sort
of
bizarre.
H
G
H
A
A
A
I
guess
one
more
point:
before
we
move
to
the
auto
scaler
I
wanted
to
bring
up
this
stage.
We
had
this
discussion
about
using
a
common
pool
for
some
of
the
modules
like
insights
and
device,
health
and
stuff.
I
think
that's
still
a
good
idea.
We
just
have
like
one
manager,
module
pool
versus
you
know
different
rules
and
inside
it's
doing
its
own
thing.
D
D
A
Okay,
cool
for
the
auto.
A
N
Sure
so
me
and
josh
has
been
working
on
a
kind
of
like
creating
a
new
behavior
of
how
the
auto
scale
would
work.
So
just
a
bit
of
background
of
the
problem
of
the
old
or
scalar
is
that
it
starts
out
with
like
a
minimum
number
of
pgs
for
who
and
it
scales
up
when
the
pose
is
like
used
more.
N
This
is
a
problem
for
like
self
out
of
the
box
users,
if
they're
creating
poles
in
a
large
cluster,
and
so
they
will
have
like
a
low
performance
at
the
start,
because
there
will
be
like
the
minimum
number
of
pools
and
it
would
only
scale
when
there
is
like
pressure,
assuming
that
they
don't
know
like
how
much
usage
like
the
cluster
in
general
they're
going
to
be
using.
N
So
the
new
algorithm,
how
it
works
is,
is
that
just
in
a
high
level
term
is
that
it
starts
out
with
like
a
full
complement
of
pgs.
So
that's
like
so
how
much
pg's
the
post
in
general
should
be
used
so
like
it
depends
on
the
amount
of
like
osds
and
the
monitor
target
pg
per
osds.
So
let's
say:
there's
like
four
osd,
so
four
times
like
100
would
be
400
and
assuming
the
the
replication
size
of
each
pool
is
like
one.
N
The
full
complement
in
this
case
would
be
like
400
and
let's
say
there
are
four
pools
and
it
so
when
you
start
out
with
four
pools,
each
pool
would
get
400
divided
by
four.
That's
like
a
hundred
pieces
each-
and
I
guess
like
if
you
round
it
up
to
power
of
two
there's
128
pgs
each
but
let's
say,
pool
one
started
using
50
of
the
capacity
of
the
space.
So
how
it
works
is
that
pool
one
would
get.
N
Fifty
percent
of
like
the
full
complements
of
pg's
in
this
case,
is
two
hundred
to
round
it.
Up
with
power
of
two
is
two
fifty
six
and
the
rest
of
the
pools,
whose
two,
three
and
four
would
get
so
the
200
divided
by
three,
which
is
like
67
p
like
yeah,
so
it
would
get
like
64
pgs.
So
that's
just
how
it
works.
Assuming
like,
let's
say
stuff
like
the
bias
is
like
one.
The
problem
with
this
that
I
encounter
is
that
when,
if
you
can
see
in
the.
N
There's
a
problem
with
when
you
for
rgw
with
this
new
r
scaler
it.
So
when
you
start
out
the
pool
for
the
device
health
monitor
starts
up
with,
like
128
pgs
and
when
you're
creating
like
rgw
pools.
It
did.
The
device
health
monitor
did
not
scale
down
in
time.
So,
therefore,
that
hits
like
the
cap
for
mon
max
pg
per
osd's
cap
and
one
of
the
solutions
that
I've
been
working
on
and
fixing.
That
is
just
to
create
a
pg
num
max
on
it's
another
feature.
N
So
basically
just
capping
the
number
of
pgs
you're
allowed
like
each
pool
just
similar
to
how
pg
num
min
works.
We
could
possibly
do
that
on
device
health,
monitor
pool.
So
that's
where
I
wanted
to
discuss
more
on
what
we
should
do
about
that.
D
H
So
we
do
have
like
ways
that,
like
ffs
and
rgw,
create
their
metadata
pools
and
they're
already
applying
certain
settings
to
them
so
that
we
could
apply
similar
kinds
of
mac
settings
for
those
metadata
pools.
At
least
I
was
thinking.
Maybe
this
max
should
be
like
a
percentage
of
the
like
pg
budget,
rather
than
a
fixed,
absolute
number,
though,
or
like
a
large,
a
very
large
cluster.
You
might
want
to
have
a
more
parallelism
for
your
metadata.
A
That's
yeah,
that's
that
definitely
sounds
better
than
a
hard
cap.
What
max
should
be
like
yeah?
What
you
said
is
right.
I
mean
if
we
know
what
the
application
is
going
to
be.
Maybe
we
can
determine
those
gaps
based
on
whether
it's
rgw
data
pool
or
metadata
pool
whatever,
but
it's
just
like
a
generic
pool.
The
user
is
trying
to
create,
for
those
it'll
be
hard
to
determine
what
those
gaps
should
look
like
or
that's.
We
just
have
to
go
with
something.
H
These
days,
like
the
data
pools
you
kind
of
want
them
to
expand,
so
they
feel
the
whole
cluster.
It's
only
the
the
pools
aren't
going
to
get
too
much
data.
I
can't
use
that
level
of
parallelism
that
you
don't
want
to
have
that
many
pgs.
G
D
Seems
like
there's,
I
mean
there's
no,
there's
no
substitute
for
actually
having
some
information
about
what
the
pool
is
going
to
be.
If
we
can
somehow
induce
a
user
to
set
the
target
ratio
on
a
pool,
then
we're
going
to
like
pick
the
right
number,
the
first
time
and
then
we're
not
going
to
have
to
like
glitter
merch
cool.
But
in
the
absence
of
that
information
we
basically
have
two
options.
D
I
think
actually,
the
the
the
scaling
up
is
more
conservative,
but
the
the
starting
with
like
lots
of
pgs
and
scaling
down
is
probably
more
likely
to
not
move
data
because
in
general
the
pools
are
going
to
get
created
at
the
beginning
of
the
cluster
before
there's
anything
in
there,
so
the
pool
the
pg's
will
fill
up
in
place.
D
H
H
Yeah,
like
I
think
when
we
were
hitting
this
and
testing
it
was
actually
it
wasn't,
even
a
pool
that
was
being
created
with
a
particular
pc
name.
So
it
was
just
using
the
default
low
number,
but
because
the
articulator
had
already
kind
of
used
up
the
entire
budget
and
because
of
like
the
rounding,
it
ended
up
being
close
enough
to
the
cap
that
we
just
went
a
little
bit
over.
D
G
D
D
D
I
guess
the
the
one
other
sort
of
element
in
the
room
is
that
this
was
such
a
shock
when
it
when
it
happened
during
when
I
was
testing
the
pacific
upgrade
because
as
soon
as
the
manager
upgraded,
which
was
like
the
very
first
step
of
the
upgrade
suddenly
pg's
went
crazy
right.
There
was
a
bunch
of
splitting
all
the
whatever
everything
went
nuts,
and
so
I
think
on
upgrade
like
I
don't
know
that
we
want
to
like.
H
D
Yeah
or
even
like
an
auto
scaler
profile
or
something
so
there's
like
the
conservative
profile
that
has
the
current
behavior,
where
things
start
small
to
scale
up
and
there's
like
a
yeah,
a
new
profile
that
we
introduce.
That's
this
new
behavior,
but
isn't
the
default
for
upgraded
clusters,
but
it
is
for
new
clusters
or
something
like
that.
Yeah.
G
A
A
H
So
that
making
up
related
to
the
auto
scaler
was
that
we
have
some
of
these
warnings,
like
the
learning
about
object,
skew
and
I'm
not
sure
if
there's
maybe
a
few.
Others
like
this,
that
we
introduced
before
the
auto
scale
existed
to
try
to
warn
people
about
pgs
being
imbalanced
or
not
having
too
few
or
too
many
or
something
like
that.
H
H
Yeah
yeah
we've
seen
like
the
this
one
with
objects.
Q
happen,
I
think,
just
because
there's
like
something
that
was
used
a
lot
more
than
another
pool,
or
something
like
that.
I
forget
what
the
scenario.
A
A
D
H
G
R
D
Yeah,
I
wonder,
if
probably
just
like
a
pass
over
what
telemetry
is
collecting
and
making
sure
it
has
all
of
the
relevant
inputs
that
the
autoscaler
would
be
using.
D
But
it
could
have
like
a
list
of
pools
with
the
eg
min
max
number
of
objects,
number
of
bytes
number
of
whatever,
like
all
the
sort
of
relevant
inputs
that
would
go,
get
fed
into
the
auto
scaler
algorithm
minus
the
names
and
application
tags
or
whatever.
H
D
Yeah
there
might
be
some
refactoring
needed
with
the
with
the
auto
balancer
the
balancer
code,
so
that
you
could
actually
feed
in
like
a
telemetry
input
into
the
algorithm
and
see
what
it
would
do.
D
A
You
talk
about
the
balance
right,
but
the
osd
map
tool
has
an
option
to
simulate
some
of
this.
If
you
feed
an
an
existing
osd
map,
it
will
tell
you
what
kind
of
balancing
a
dry
run,
that
it
does
already
david
added
that
at
some
point.
A
A
A
Anything
else
on
topic-
yeah,
okay,
on
to
the
next
one,
so
this
is
about
avoiding
cluster
log
messages
to
go
through
paxos
and
storing
them
in
them.
A
E
H
This
but
yeah,
but
I
guess
fundamentally,
we've
run
into
this
a
few
times
where
we've
had
a
lot
of
extra
detail
from
coming
into
the
cluster
log
like
slow
requests
reports
or
causing
the
monitor
to
store
them
in
the
database
in
the
database
to
eventually
fill
up
since
we
weren't
matching
the
interest
rate
with
the
deletion
rate.
H
T
Oh
yeah
sure
so
the
change
that
that's
been
merged.
Basically,
what
that
does
is
is
that
we
dynamically
change
what
amount
of
logs
are
trimmed
before
this
change,
we
used
to
have
an
upper
bound
specified
by
paxos
service
trim
max.
So,
according
to
the
log
ingest
trade
right
now
we
just
or
change
the
max
accordingly.
T
A
I
guess
I
mean
what
what
I
sure,
as
pr
does,
is
much
better
than
what
we
had
earlier,
where
somebody
had
to
manually
go
change.
Some
setting
after
the
monitor
had
already
the
monitor
db,
had
already
filled
up
to
maintain
the
ingest
rate
versus
the
trimming
rate.
But
I
guess
the
bigger
question
is
that
is
there
a
need
for
all
this
to
be
stored
in
you
know,
go
through
access
and
store
them
on
db
or
not
or
like?
What
is
the
historical
significance
of
something
that
it's.
D
Historic
reason
was
just
to
have
a
consistent
view
of
what
the
cluster
log
contained
and
because
everything
was
persisted
through
praxis,
so
it
was
easy.
I
mean
I
kind
of
like
the
simplicity
of
everything
that
the
monitor
stores
always
being
consistent
and
always
going
through.
D
Paxos
feels
like
if
there
are
specific
issues
with
that,
then
either
it
doesn't
belong
in
the
monitor
at
all,
or
we
need
to
make
it
work
right
like
that,
like
the
ingest
versus
trimming
like
that's
something
that
we
just
need
to
fix
right,
that
was
just
really
just
yeah.
D
Suspect
that
a
fairly
big
part
of
the
problem-
maybe
not
if
yeah
one
part
of
the
problem,
is
also
just
that
the
way
the
log
monitor
is
implemented
is
like
not
efficient
at
all
like
it
has
a
it
has
like
a
there's,
a
log
summary
class
or
something
that
has
like
the
last
100
entries,
and
it's
basically
that
entire
structure
is
rewritten
on
every
commit
like
it's
just
it's
just
totally
the
way
it's
persisted
is
just
totally
stupid,
and
that
was
just
because
it
was
like
expedient.
I
think.
L
But
I
think
the
way
how
we
are
using
the
the
cluster
log
is
the
wrong,
because
I
think
the
large
the
cluster
lock
is
for
random
message.
Human
readable
message,
which
is
very
critical
and
which
should
get
the
attention
of
the
administrator
immediately,
which
should
up
in
the
of
course
call
for
human
intervene
at
this
very
moment.
Instead
of
for
some
some
slow
operations,
I
think
the
flow
operation
should
be
should
be
to
be
sent
to
manager
to
module.
For
example,
the
earth
module
you
you
created.
L
It
could
be
repurposed
for
for
collecting
the
subscribers
subscribe.
The
slow
message,
for
example,
when,
when
the
alert
list
module
notice
that
there
is
some
slow
operation
going
on,
it
is
supposed
to
collect
the
details
from
the
from
the
demons,
which
is
that
which
has
low
messages
and
log
into
a
local
time.
L
Local
database,
which
is
not
hosted
by
by
civ
cluster
and
later
on
the
when
the
as
major
as
a
major
administrator
log
on
to
the
system
using
the
dashboard
or
something
he
or
she
will
notice
that
something
goes
wrong
and
it
will
close
the
data,
along
with
the
timestamp
and
figure
out
what
was
going
on
and
and
looking
to
the
log
or
whatever
he
or
she
can
have
done.
Instead
of
looking
at
a
class
lock,
prologue
is
for
something
a
a
system
is
going
wrong
or
or
something
from
something
big
happens.
H
So
I
think
that
that's
the
thing
I
like
with
the
slops,
though,
is
that
that
information
is
very
helpful
in
the
cluster
vlog,
because
it's
very
it's
a
fairly
simple
place
to
match
up
what
is
going
on
with
the
attack
cluster
state.
I
don't
think
we
need
to
have
all
details
of
every
single
slow
request,
which
is
what
we
currently
do.
L
H
D
G
A
S
I
would
say
that
you
need
to
know
which
pgs
are
experiencing
slow
ops,
so
even
just
a
stat
line
in
the
pg
stats
would
do
the
job
you
don't
even
need
to
list
the
slow
requests
at
all
yeah
like
aggregated
just
this
pg
is
as
of
the
sosd
mapbook
or
as
of
this
reporting
interval,
seeing
slow
ops.
That's
it.
S
A
I
guess
the
idea
is
that
if
you
don't
have
the
ability
to
lie
right,
but
if
you
don't
have
the
ability
to
do
live,
debugging
or
like
capture
the
state
of
the
usd
at
that
moment.
Having
that,
like
the
initial
information
of
you
know
where
the
initial
store
slopes
were
or
why
they
were
in
the
cluster
log,
as
an
afterthought
is,
has
been
often
useful.
S
C
G
S
H
D
L
L
S
I
mean
the
managers
could
offer
a
general
purpose.
Sampled
logging,
yes
mechanism,
without
needing
to
go
through
the
monitors
access
system.
I'm
saying
we
could
have
not
a
logging
system
that
isn't
the
cluster
log
that
serves
the
same
purpose.
The
cluster
log
currently
does,
but
without
screwing
up
the
monitors.
G
D
H
D
I
think
it's
nice
because
you
have
a
there's
a
log
last
command
that
like
dumps,
the
last
few
messages,
and
actually
one
of
the
things
that
I
wanted
to
do
was
change
it.
When
you
just
select
w
you
can
give
it
the
channel
name.
I
wanted
to
combine
that
with
the
logs
command
so
that
you
could
do
you
could
do
like
an
equivalent
of
a
tail
dash
f,
do
do
it
logs
what
the
channel
is
number
event
recent
entries
and
then
also
block
and
pull
or
whatever
to
follow
the
log.
D
R
H
R
H
L
G
G
H
D
A
It'd
be
nice
to
have
the
summary
you
know,
as
what
sage
was
suggesting
you
know
when,
when
it
started
happening
and
how
many
slow
ups
yeah.
H
H
A
Okay,
I
think
we're
at
the
top
of
the
air.
Let's
move
on
to
the
next
topics,
maybe
we'll
have.
A
L
L
L
The
first
place
is
the
lego,
lego
options,
dot
cc
and
another
place
is
option.cc,
which
is
for
for
so
we
can
read
it
using
get
get
val,
and
the
third
place
is
that
the
rc
file,
where
we,
which
is
rendered
into
a
sphinx
document,
so
I
think
the
better
idea
better
way
to
do
it
is
write
in
a
add
a
option
in
a
single
place.
L
I
propose
an
solution
that
we
can
add
is
the
option
in
engineering
file
with
with
a
predefined
scheme
which
is
flexible
enough.
So
we
can
even
write
some
c
simple
code
in
it
and
it's
inflexible
structured
enough.
So
we
can
use
a
python
script
to
extract
the
interesting
information
from
it
and
generate
generated
rc
file
and
the
more
interesting
things
that
we
could
generate
different
option
versions
from
a
single
source
of
yaml
file.
L
For
example,
if
we
were,
if
a
certain
option
is
only
consumed
by
osd,
we
can
extract
the
the
partial
options
which
is
read
by
common
resistive
common
and
the
subset
of
ost
subset
of
options,
which
is
only
which
are
only
interested
by
ost
that
could
potentially
reduce
a
little
bit
memory
footprint
and-
and
I
think
we've
been
thinking
about
of
just
up-
that's
a
topic
subject
that
we've
been
thinking
about
to
to
split
the
message
message.edge
into
some
smaller
pieces
right,
because
some
some
messages
are
never
seen
by
a
ost.
H
L
F
G
A
D
One
thing
we
probably
want
to
think
about:
if
we're
going
to
the
point
where
we're
generating
the
rst
docs,
the
other
place
where
options
exist,
is
in
pi
bind
manager
whatever
module.pi.
Yes,.
D
L
Yes,
that
can
do.
We
can
do
that.
L
G
D
I
H
L
I
think
we
could
start
with
some
some
basic
commands,
but
by
first
step,
is
to
to
convert
the
the
option.cc
to
a
jumbo
file
and
whip
up
as
python
script,
to
generate
the
cc
file
and
legacy.h
file
from
this
yaml
file
and
then
expand
this
script.
So
it
can
can
can
generate
the
subset
of
the
options
and
render
it
using
a,
for
example,
a
ginger
template
to
rst
and
render
it
using
things
later
on.
We
can
expand
it,
expand
it
by
adding
more
labels
and
schemes.
H
H
Yeah
thinking
of
our
existing
configuration
documentation,
there's
a
lot
of
kind
of
expository
text
about
the
general
category
of
options
and
then
the
options
themselves.
So
if
we
could
do
the
filtering,
be
able
to
keep
that
like
the
general.
A
D
Yeah
yeah,
so
this
was
ignored
for
a
long
time,
because
we
didn't
have
encrypted
sessions
between
demons
and
the
monitor,
and
so
any
sort
of
automated
key
rotation
would
just
start
sending
keys
over
the
wire.
That
was
actually
would
be
worse,
but
now
we
have
messengers
to
secure
mode,
so
no
more
excuses.
I
think
that
there
are
two
big
challenges.
D
The
first
is
that
we
basically
need
a
two-phase
commit,
because
we
need
to
update
the
key
both
on
the
monitor
database
and
on
the
client,
and
if
we
update
only
one
and
not
the
other
and
the
system
restarts
or
something
like
that,
then
something
can
authenticate
so
be
a
little
bit
careful
and
then
the
other
challenge
is
that
it's
one
thing
to
do
this
so
that
you've,
like
re,
you
update
the
key
and
then
restart
the
daemon,
but
that's
kind
of
disruptive
it'd
be
nice
to
be
able
to
update
the
keys
without
restarting
the
thing,
the
osd
for
example,
especially
when
you
have
form
caches
and
all
the
rest
of
it
so
they're
at
a
high
level.
D
There
there's,
I
think,
there's
one
really
big
decision
point
here
and
there
are
two
options
for
doing
that
like
two-phase
thing:
either
the
monitor
keeps
track
of
like
both
the
old
key
and
the
pending
new
key
and
either
key
works
for
a
interim
period,
while
the
client
is
being
updated,
which
means
that,
on
the
monitor,
all
the
auth
paths
need
to
be
updated
so
that
they,
if
a
client
tries
to
authenticate
with
either
the
old
key
or
the
new
key
they'll,
both
work,
which
is
sort
of
hairy.
D
D
R
D
I
think
the
goal
is
compliance.
This
is
something
that
users
ask
for,
and
it
also
seems
just
like
a
bad
security
practice.
I
have
a
key.
That's
been
in
use
for,
like
three
years,
that's
been
sitting
on
a
drive
and
who
knows
whether
it's
been
exposed
during
that
period
or
not.
P
So
that's
not
related
to
the
demon,
key
rotation,
internal
key
rotation,
a
complete
different
topic.
D
P
You
mean
in
suffix,
there's
the
demons
themselves.
You
know
you
have
a
service
key.
D
H
I
mean
the
other
stuff
like
that.
Separate
would
be
the
discrete
key.
I
like
the
luxe
encryption
keys.
Is
there
some
kind
of
version?
I
thought
that
was
like
some
kind
of
version
of
lux
the
the
way
you
could
rotate
those
keys
by
having
like
having
a
key
that's
encrypted
with
another
key,
and
you
rotate
the
encryption
key
or
something
like
that.
D
Yeah
I
mean
there
are,
we
do
have
two
key
encryption
keys.
Obviously
you
can't
change
the
actual
encryption
key,
but
right.
G
D
Could
be
rotated,
so
probably
a
similar
mechanism
would
need
to
be
used
in
that
case
too,
that
one's
probably
a
little
bit
easier,
because
it's
there's
only
like
one
place
where
the
dmcrypt
is
like
started
up
and
so
like
identifying
the
path
that
needs
to
try.
Two
different
keys
would
be
a
little
bit
simpler,
but
I
think
even
then
it
just
it's
a
similar
versus
question
of
like
which.
P
The
the
the
osd
is
the
decline
to
authenticate
against
are
the
session
keys,
and
these
are
based
on
the
the
rotating
keys,
the
temporary
ones,
that
are
every
30
minutes.
D
Yeah,
I
think
we
would
we
would
automate
like
osd
keys
and
mds
keys
and
manager
keys
like
those
would
be
the
ones
that
would
be
easier
to
automate.
Clients
are
harder
because
you
don't
have
necessarily
control
over
how
the
client
is
being
used,
although
I
think
it
might
be
possible
to
do
that
too.
D
There's
a.
I
have
a
question
down
here
about
that,
but
right,
so
I
I
guess
there
are
these
two
options
right
either
you
have
like
the
old
key
and
the
new
key
both
available
on
the
monitor
and
the
client
will
use
it
whatever
key
it
has
and
either
if
it's
old,
it'll
work
and
if
it's
new
it'll
work
or
you
flip
it
around-
and
you
have.
D
Kind
of
leaning
toward
option
one
having
it
done
on
the
monitor,
because
they're
sort
of
the
way
we
can
narrow
down
the
cases,
and
it
might
even
be
that
the
idea
of
having
multiple
keys
associated
with
the
same
auth
entity
might
actually
be
a
good
thing.
Also
in
general,
I'm
not
really
entirely
sure.
D
Maybe
not
maybe
that's
silly,
but
it
seems
like
it's
a
little
bit
tidier
that
way,
but
that's
sort
of
the
first.
The
first
question
to
answer
the
next
is
around:
maybe
not
a
question,
but
the
the
next
real
challenge
is
making
the
the
mom
client
and
the
auth
protocol
such
so
that
you
can.
D
Actually
you
can
do
it
a
key
rotation
so
that
when
you,
when
you're
renewing
your
ticket,
you
could
try
using
the
new
key
and
it
would
transition
seamlessly
from
the
old
key,
so
that
you're
sort
of
you're
an
exist.
An
existing
session
would
be
able
to
transition
from
an
old
key
to
a
new
key
without
being
interrupted,
and
that
I
think,
needs
a
more
careful
read
of
the
auth
code
to
see
exactly
how
that
would
be
implemented.
I
don't
think
delia
is
here.
D
Unfortunately,
but
we
probably
need
to
brainstorm
on
that
figure
out
exactly
how
that
would
work,
and
that's
assuming
that
we
actually
want
to
be
able
to
do
this
without
restarting
demons.
If
we,
if
we
think
it's
tolerable
to
just
if
you
rotate
every
six
months,
for
example-
and
you
just
like
restart
demon
demons
as
you
do
it,
then
that's,
then
you
can
do
sort
of
a
simpler
solution.
H
D
Yeah,
so
I
think,
I
think
being
able
to.
I
think
it's
an
it's
a
good
idea
to
teach
them
on
client,
basically
how
to
and
the
author
protocol
or
whatever
how
to
transition
from
an
altitude
to
a
new
key
while
it's
online
and
so
there's
sort
of
some
hypothetical
scenarios
here
like
the
client,
would
generate
a
new
key
and
then
store
install
it
on
the
installed
on
the
on
the
monitor
and
then
it
would
re-authenticate
using
a
new
one
and
then
the
monitor,
perhaps
as
soon
as
it
sees
that.
D
D
There
are
a
few
things,
there's
sort
of
another
question
of
how
you
actually
update
the
key
on
the
client
like.
Would
you
want
stuff
adm
to
like
update
the
key
ring
file,
or
would
you
want
the
process
that,
like
the
mod
client,
for
example,
that
you
just
told
to
go,
do
this
sort
of
online
key
change
to
rewrite
its
own
key
file
key
ring
file
and
in
the
case
of
blue
story,
you
also
have
to
remember
that
the
key
ring
file
is
actually
on
a
temp
fest.
D
D
And
then
the
last,
the
last
challenge
is
around
kernel
clients
because
that's
like
sort
of
a
separate
implementation
and
it's
a
little
bit
more
awkward,
because
you
know
mount
that
stuff
probably
needs
to
be
able
to
accept.
If,
if
you
go
the
two
keys
on
the
client,
it
needs
to
have
those
keys,
but
the
kernel
is
not
going
to
rewrite
its
keyring
file,
so
you're
going
to
have
to
have
some
other
like
helper
utility
or
whatever.
D
That
runs
on
the
client
host
that
you
know
pokes
sysfs
or
whatever,
to
tell
seth
to
go
change
its
key
and
then
updates
wherever
the
key
ring
file
is
stored
in
etsy,
ceph
or
whatever
it
is
so
it'll
be
a
little
bit
tricky.
We
need
to
make
sure
that
we
standardize
the
way
that
those
kernel
clients
are
managed
right
now.
I
think
mount.set
doesn't
take
a
traditional
keyring
file
which
has
sort
of
been
a
future
ticket.
D
Q
No,
it
does
now
sage,
oh,
it
does
actually
reads
the
config
file,
the
cell
config
and
then
uses
the
monitor,
ips
and
also
the
dev
key
ring.
Oh
good.
D
D
D
D
D
It's
going
to
say,
I
think
that
probably
the
practical
implication
of
this
having
the
demon
rewrite
at
some
queueing
file
is
just
the
way
that
the
permissions
are
set
up.
We
need
to
make
sure
that
the
file
is
writable
by
the
process
that
runs
it
or
whatever.
D
Q
We
don't
allow
access
to
etsy
so
where
you
can
read
only
access
to
edc,
but
I
don't
think
we
do
that
with
sub-adm.
In
fact,
I'm
not
sure
what
this
is
kind
of
off-topic,
but
I'm
not
sure
what
kind
of
protections
we
have
in
the
system
to
unifiles
we
generate
with
cdm.
Maybe
that's
something
we
should
look
at
because
there's
a
divergence
there.
D
D
In
general,
nothing
gets
or
demons.
At
least
nothing
is
demons,
don't
use
etsy
stuff
at
all.
The
only
time
separation
uses
epcs
is
when
you
use
the
shell
command,
and
that's
mostly
just
so
that
you
can
have
if
there
is
an
etsy
stuff
on
the
host,
that's
sort
of
like
the
default
cluster
that
the
fediam
shell
will
bring
up.
So
you
have
to
pass
all
the
arguments
all
the
time.
D
D
Okay,
well,
I
guess
the
action
items
are
probably
just
to
knee-high.
You
could
check
the
source
and
see
if
there
are
any
like
additional
constraints
and
then
I'm
going
to
lean
towards
seeing
if
option
one
it's
feasible.
If
there
are
a
small
enough
number
of
paths
that
the
monitor
can
do
it,
I
think
that's
probably
going
to
make
a
little
bit
more
sense
and
then.