►
Description
Led by: Kevin Hrpcek
Ceph Month 2021 schedule: https://pad.ceph.com/p/ceph-month-june-2021
A
All
right:
well,
I
guess
it's
time
for
the
birds
of
feather
for
stuff
and
research
and
scientific
computing.
A
I'm
kevin.
We
kind
of
just
do
this
every
other
month,
a
group
of
us
and
that's
on
the
the
stuff
community
calendar.
If
anybody
ever
wants
to
join
in
on
that
or
you
can
just
contact
me
and
I'll,
add
you
to
my
email
list
as
well.
A
Otherwise,
there's
a
pad
in
the
chat.
If
you
want
to
add
topics,
anything
it's
a
birds
of
a
feather,
so
I
have
no
presentation.
There's
no
set
the
topics,
just
ideas,
it's
what
we
want
to
talk
about
with
stuff
and
scientific
computing.
A
So,
if
anybody's
interested
in
sharing,
if
they
have
interesting,
use
cases,
they're
doing
with
staff
or
fun
experiments,
they've
done
to
you
know,
push
it
to
its
limits
and
make
it
work
for
what
they're
doing
and
be
the
love
to
hear
about
it.
B
Ones
I
won't
say
because
next
week
our
new
fellow
artillery
that's
on
the
line,
will
give
a
presentation
on
some
rbd
mirroring.
B
S3
wise
well
enrico's
on
the
line.
I
don't
know
if
he
wants
to
say
something
about
s3.
Maybe
we
already
just
covered
that
again.
Fs
wise
we've
been
doing
some
tests.
We
want
to
we're
adding
a
new
cfs
region
and
we
want
to
until
now,
we've
never
had
snapshots.
We've
never
used
snapshot
in
any
scale,
and
we
want
to
for
this
next
cluster
to
enable
snapshots
from
the
beginning.
B
So
we've
been
kind
of
stress
testing
it
to
see
what
the
limitations
are,
and
we
learned
a
couple
of
things
that
maybe
people
already
know.
Maybe
this
is
obvious,
but
we
didn't
know.
So
I
guess
what
we
did
is
we
just
untarred
linux,
a
bunch
of
times,
hundreds
and
hundreds
of
times
and
then
we're
taking
snapshots
and
we're
deleting
things,
and
we
realize
that
deleted
files
go
into
the
stray
directory,
just
like
just
like
deleted
hard
links.
B
Go
into
stray
so
there's
a
limit
in
nautilus
and
octopus
of
something
like
1
million
1
million
files
and
then,
after
after
you
delete
the
after
you've
after
you've
like
taken
snapshots
of
one
million
files
and
then
deleted
them
from
the
head.
You
now
can
no
longer
remove
files
which
are
in
snapshots
so,
but
good
news
is
that
in
pacific
this
is
fixed
because
the
three
directories
can
be
fragmented.
B
So
I
think
we've
come
to
the
conclusion
that
for
this
big
cluster,
we're
going
to
need
to
upgrade
to
pacific
in
order
to
in
order
to
to
go
production
with
that,
and
then
the
other
thing
that
we
realized
is
that
so
we
started
deleting
all
of
these
millions
and
millions
of
snapshots
or
hundreds
of
snapshots,
let's
say
to
then
trim
the
snapshots
and
we
didn't
see
any
progress
for
hours
in
the
snap
trim.
Pgs
cluster
was
stable,
but
there
was
no
progress.
B
So
then
we
realized
we
looked
into
how
it's
implemented
and
in
fact
you
only
delete
like
15
files
per
second
from
each
osd
with
the
default
configuration.
So
this
is
actually
really
slow.
So
if
so,
if
we
have
a
lot
of
churn,
the
pgs
will
be
in
snap
trim.
Like
all
the
time.
I
guess
we
were
able
to
change
the
configuration
and
managed
to
trim
our.
I
think
we
had
something
like
50
million
files
that
needed
to
be
trimmed
and
we
trimmed
them
within
half
of
a
half
a
day.
B
Sorry
this
is
on
octopus.
I
arthur
this
is
an
octopus
cluster,
yeah
yeah.
This
is
octopus,
I'm
wrong.
It
was
15-13.
C
B
C
Yeah
I
mean,
I
think
in
principle
what
the
the
the
new
scheduler
should
do
on
the
osd
is
that
it
should,
you
know,
go
at
full
speed,
but
at
lower
priority
than
than
client
work.
C
Sure
if
they
looked
at
this
specifically
so
this
would
be
a
good,
probably
a
good,
like
micro
test
benchmark
whatever.
So
I
queue
up
a
whole
bunch
of
snap
trim
work
and
then
make
sure
it
makes
good
progress
with
on
an
idle
cluster
and
also
in
competition.
I
guess
with
client
workload.
C
B
No
we're
we're
just
upgrading
the
first
clusters
to
where
we're
doing
octopus
now
we're
still
on
nautilus
for
most
things.
We
in
there
was
that
long
thread
about
the
cephem
containers
stuff.
So
we
didn't
chime
into
that
because
we
didn't.
We
don't
have
anything
useful
to
comment
there.
So
this
week
we
started
playing
with
cepheum
so
that
we
can
have
more
like
tangible
feedback.
B
B
C
B
C
Yep,
okay,
on
the
on
the
on
the
time
synchronization,
my
recollection,
isn't,
is
that
all
stuff
idiom
does
is
make
sure
that
either
ntp
or
crony
d
system
d
unit
is
turned
on,
but
it
doesn't
try
to
configure
it
for
you.
It
just
makes
sure.
B
C
All
your
puppet
stuff
should
really
just
make
sure
that
docker
podman
is
installed
and
that
time
synchronization
is
turned
on
and
that
lvm2
package
is
installed
and
that
python
3
is
available
like
that's
really.
Basically
it.
B
Yeah,
the
one
other
thing
I
just
remembered
was
that
everything
we
have
to
do
everything
with
stream
eight
now
so
we're
using
we're
using
streaming
and
there's
a
there's,
a
issue
with
pod
man
in
stream,
eight.
So
this
you
have
to
it,
uses
podman,
3.1
and
there's
some
kind
of
bug.
I
don't
remember
the
some
cape,
some
capsis
or
something
I
forgot
this
there's
a
there's,
a
bz
about
a
bugzilla
about
this.
B
B
C
Right
so
that
I
believe
that's
fixed,
it's
certainly
fixed
in
specific,
I'm
pretty
sure
we
backported
that
fixed
octopus
too,
but
yeah
basically.
B
C
Yeah,
it
depends
on
if
you're,
when
you
use
the
bootstrap
script,
it's
like
the
that's
used
for
like
starting
up
the
initial
set,
but
everything
else
is
deployed
by
the
manager
and
it
will
redeploy
things,
and
so
it
matters
more.
What
container
image
you
use
so,
if
you're
installing
octopus,
it
might
have
still
anyway,
I
did
the
bug
that
I'm
thinking
of
was
that
if
you
specify
privileged
to
podman
and
also
capsis,
it
gives
you
an
error
as
they're,
redundant
or
mutually
exclusive
or
whatever.
C
B
D
On
the
upgrade
question,
we
haven't
gone
anywhere
near
pacific
yet,
but
we
did
just
recently
upgrade
our
big
production
cluster
from
luminous
to
octopus.
In
I
mean
we
stopped
briefly
in
mimic,
but
it
was
a.
It
was
a
sort
of
one
day
process.
D
We
started
the
start
of
the
day
at
luminous
and
ended
the
day
in
octopus
and
got
it's
about
a
20
petabyte
raw
capacity
cluster,
and
we
got
all
that
upgraded
without,
I
think
any
disruption
service,
which
was
quite
nice
and
we're
now
going
to
probably
look
at
pacific
on
our
test
cluster.
We
did
encounter
almost
an
outage.
The
the
cv
fix
about
insecure
global
id
reuse.
D
So,
on
our
test
cluster,
we
got
the
warnings
about.
You
know,
you've
got
clients
using
it
and
your
mods
allow
it
and
we
upgraded
all
the
packages
on
and
we
started
all
the
self
demons
and,
and
then
the
cluster
said
you
have
many
clients
using
the
old
thing
anymore,
so
we
disabled
it
and
then
our
openstack
started
misbehaving
and
the
thing
that
isn't
quite
very
obvious
from
the
document
from
the
documentation
is.
You
have
to
restart
all
of
your
virtual
machines.
D
Having
done
the
upgrade
before,
you
can
safely
turn
off
the
insecure
globality
thing
yeah
and
I
I
I
think
that
there
could
have
been
a
bigger
scarier
flag
in
the
documentation
that
about
that,
because
I
I
thought
you
know,
we
wait.
Wait
till
the
cluster
says
young
running
clowns
using
the
old
thing
anymore,
it's
safe
and
it
turns
out.
That's
not
quite
the
case.
F
You
you
have
to
restart
those
virtual
machines
that
are
using
the
old
libraries
yeah
yeah.
That's
that's
really
important
for
openstack
environments.
C
Okay,
I'm
surprised
the
the
the
warning
went
away,
even
though
you
still
have
the
answer.
You
think.
D
Yeah,
it
might
just
be
because,
with
this
earthquake,
it
might
be
that
none
of
them
were
doing.
None
of
the
machines
were
doing
anything
much
at
that
at
that
point,
because
it's
just
a
test
cluster,
so
sometimes
there's
very
little
activity
on
the
openstack
clients
there.
So
I
don't
know
if
it
was
that
or
something.
C
Oh
yeah
yeah.
It
has
to
be
like
an
active
connection.
So
so,
if
the
virtual
machine
is
actually
running,
then
the
warning
should
have
shown
up.
But
if,
like
vendor
has
a
ceph
credential
that
isn't
actually
like
issuing
any
commands
or
something
like
that,
then
it
won't
show
up.
A
Matthew,
when
you
did
your
upgrade,
you
run
like
debian
or
ubuntu
right.
A
D
Yeah
no
and
the
the
way
canonical
produce
packages
you
can
keep
the
staff
upgrade
and
the
underlying
operating
system
upgrade
distinct,
so
yeah
we're
gonna
upgrade
to
2004
later,
but
probably
later
this
year.
That
won't
be
me
because
I'm
leaving
the
sangha,
but
that's
another
thing,
but
we
can
now
upgrade
from
html4
to
2004
and
the
seth
version
will
stay
the
same.
So
but
that's
quite.
A
F
Did
did
you
change
any
any
kind
of
defaults,
because
I
bet
that
on
that
side
of
the
cluster
you
have
some
specific
parameters
to
tune
up
the
cluster.
So
did.
Did
you
see
anything
when
you
were
upgrading
from
other
one,
the
newer
release,
any
time
outs
or
things
like
that.
D
I
don't
think
we
had
to
make
any
specific
tuning
changes
between
luminous
and
octopus.
I
thought
so.
We
used
seven
spot
and
obviously
our
set
ansible
needed
quite
a
bit
of
reworking
and
so
that
and
then
the
servants
will
change
changes.
What
the
rados
gateway
demons
are
called,
which
is
a
bit
annoying,
so
that
meant
we
had
to
be
quite
careful
about
starting
and
stopping
and
restarting
the
demons
of
the
new
service
names.
D
But
I
don't
think
we
found
particular
tunables
that
we
needed
to
change
so
far
from
littleness
photographers
I
mean
some
of
that
will
be
servants
for
some
sensible
defaults
for
you,
other
than
that.
I
think
we
kept
the
tunables
we
set
previously
and
haven't
really
done
that
much
with
them
at
least
so
far.
D
D
D
Cluster
there's
a
there's,
a
bug.
In
the
version
we've
of
octopus
we've
got
deployed
on
production.
That
means
turning
on
telemetry
doesn't
work,
but
it's
because
it's
a
slightly
old
version
of
octopus
and
canonical
have
now
rolled
out
a
new
version
of
the
oxford
packages
that
will
fix
that.
So
that's
that's
slated
for
next
week's
at-risk
period,
but
it's
on
the
thereafter.
We
are
planning
on
turning
telemetry.
C
C
Are
you
I
you
might
have
said
this
and
I
maybe
I
missed
it,
but
are
you
planning
a
transition
to
that's
idiom
now
that
you
are
an
octopus.
D
I
don't
know
yet
so
we
we
use
ansible
for
everything
and
so
using
seth
answer
to
manage
chef
makes
quite
a
lot
of
sense
because
we
have
stefan's
ball
and
then
we've
got
some
of
our
local
roles
and
we
do
everything
like
that.
So
I
don't
know,
I
think,
not
immediately
yeah,
it's
one
of
those
things.
You
know
we're.
We've
got
a
lot
of
experience
with
stefansville
now
and
we're
quite
comfortable
with
it.
G
So
that
took
me
a
little
while
to
be
brave
enough
to
do
that
and
afterwards
it
was
fine
and
they
flushed
immediately,
because
it
was
a
bit
strange
and
I
I'm
also
not
quite
sure
what
we
did
other
than
I.
So
we've
got
a
cfs
it's
not
particularly
massive,
but
I
was
doing
another
thing
of
40
terabyte,
so
file
space
with
millions
of
files.
So
maybe
the
mds
got
too
busy
in
some
some
way
and
we've
got
two
active
nds's.
G
So
right
now
is
a
bit
strange
and
the
the
other
bit
that's
a
bit
more
exciting.
Is
we
get
to
switch
off
our
cluster
on
on
the
weekend
for
some
power
upgrade,
and
so
I
figured
that
what
we
are
going
to
do
is
we
switch
off
all
the
clients,
first
disable
self-reference
and
then
follow
the
instructions
for
switching
off
the
cluster?
C
F
Which
steph
version
you
were
running?
Nautilus.
G
C
Yeah,
I
mean
the
way
that
the
the
mds
works
with
the
the
log
segment
segment
trimming
they're,
basically
just
like
a
reference
count
that
has
to
go
to
zero
before
it
can
drop
the
log
segment
from
memory,
and
so
there's,
probably
just
some
really
subtle
reference
counting
issue
where
it's
not
letting
go
of
that
particular
segment
and
just
restarting
the
nvs
clears
that
out,
because
it's
sort
of
reading
it
fresh
from
the.
C
Yeah
yeah
one
of
those
like
mostly
harmless
bugs
that
are
like
really
hard
to
track
down.
Probably,
but
who
knows
I
wish
patrick
was
on
because
he
would
probably
have.
He
would
probably
remember
whether
there
was
anything
that's
been
fixed
there
recently
or
whether
this
is
something
that
he's.
A
A
I
guess
a
general
update
from
me
is
we're
considering
switching
from
using
our
web
radars
safa
sas,
we'll
probably
make
a
few
people
on
this
call.
Happy
we've
been
doing
some
performance.
A
Testing
of
you
know,
throwing
our
three
or
four
thousand
cores
at
it
and
whatnot,
and
it's
been
looking
pretty
good
to
support
processing
just
straight
on
this
ffs
volumes
without
copying
data
to
like
the
local
processing
nodes-
and
you
know
it
seems
like
a
file
store,
this
bug
report
or
whatever
out
there
to
deprecate
it
in
a
couple
versions.
C
Yep
yeah,
the
the
like
percentage
of
file
store,
osds,
is
sort
of
steadily
dropping,
but
there's
still
quite
a
lot
of
them.
I
think
it's.
E
H
C
You
look
at
that
the
telemetry
it's
only
like
20
clusters,
I
think
total
that
are
better
have
telemetry
on
that
are
still
reporting
file
stores.
So.
C
It
it
used
to
be
the
plan
that,
once
we
had
sort
of
orchestrator
support
for
the
ost
and
with
stuff
avm,
then
we
could
use
that
to
automate
the
transition
from
bell
store
to
blue
store.
C
A
C
A
Yeah
pretty
much,
I
think,
what
we'll
end
up
doing
is
I'm
going
to
get
some
we're
still
on
nautilus,
I'd
like
to
at
least
get
up
to
octopus
and
then
take
an
osd's
on
pedro
zieger
or
something
pull
them
out,
rebuild
them
as
blue
store
and
build
a
new
cefs
on
that
with
erasure
coding
and
then
just
start.
The
slow
transition
process
by
moving
from
autorados
on
the
set
ffs
and,
at
the
same
times
just
start
shrinking
the
file
store
osds
into
blue
store
ones
and
coordinate
it.
That
way,
I.
A
F
F
A
I'm
not
sure
what
I
thought
stream
refills
are
what
version
they're
packing.
First,
let's
just
use
the
primary
repo.
F
F
B
B
B
C
B
F
I
have
a
question,
may
maybe
someone
here
can
enlighten
me?
There
is
a
some
bugs
or
missing
functionalities
in
nautilus
and
in
pacific
current
back
porch,
and
there
is
a
on
the
both
things
that
I'm
waiting
or
I
would
like
to
get
on
on
production.
F
There
is
a
code
for
for
it
in
the
master,
but
bringing
that
on
on
a
on
a
test
environment,
for
example.
So
back
porting
yourself,
it's
crap
getting
really
complicated,
because
the
code
base
in
rados
gateway
in
a
rudder's
gateway,
for
example,
is
is
getting
at
least
for
me.
I
I
see
that
that
is
difficult
to
backport
or
you
just
would
like
to
well.
F
We
discussed
that
earlier
and
I
think
that
walowski
did
some
patches
on
a
master,
but
they
they
are
in
a
limbo.
Currently,
I
cannot
test
them
because
I
don't
know
how
many
different
parts
I
have
to
back
port
in
order
to
get
them
on
on
a
my
test,
environment
or
how
to
compile
them
properly
without
pulling
them
from
the
full
cef
development
code.
F
C
Yeah
it
looks
like
kc's
are
not
I
mean,
I
think,
that
the
larger
context
here
is
just
that:
there's
a
lot
of
refactoring
going
into
raiders
gateway
racing
recently,
and
so
every
release
a
lot
of
turn
there,
and
so
just
backboarding
generally
in
general,
is
hard.
F
If
there
is
a
refactoring
of
rados
gateway,
is
there
any
possibility
to
bring
those
brothers
gateway,
functionalities
on
all
kind
of
stable
releases
with
the
same
time?
So
I
know
the
nautilus
is
going
away,
but
still
on
octopus
and
pacific,
I
would
like
to
see
the
same
code
base
that
we
have.
We
are
running
in
master
because.
E
H
C
B
We
use
it
only,
we
don't
use
it
for
so
we
have
everything
integrated
with
openstack
via
magic,
which
I
don't
understand.
We
use
the
swift
interface
only
for
because
the
users
can
query
their
bytes
used
and
quota
through
the
swift
api,
but
they
can't
get
that
through
the
s3
api.
D
I
happen
to
know
that
wikimedia
used
the
swift
interface
for
the
rattles
gateway
for
their
media
internally,
even
though
they
don't
run
openstack.
I
don't
know
why
they
use
it
rather
than
s3,
but
that
that's
primarily
what
they're
using
set
for.
F
F
So
you
would
you
wouldn't
give
your
credentials
on
a
hpc
environment
like
fully
open,
like
with
the
s3.
You
have
now
to
give
your
passphrase
somewhere,
but
with
this
rift
you
authenticate
yourself
and
then
it's
active
for
a
while,
and
then
it's
gone
and
that's
my
second
challenge.
I
would
like
to
get
that
sts
hard
on
the
safe,
rattles
gateway
working
better.
I
Pcc,
this
is
a
computing
center.
In
poland,
we
are
keeping
swift
interface
because
some
users,
actually
internal
users,
especially
the
digital
repository
people,
started
to
use
c
suite
like
five
years
ago,
and
then
they
don't
want
to
stop.
We
are
killing
this,
the
original
swift
instance
and
moving
people
to
self-based,
swift
and
actually
partially
done,
but
still
it's
hard
to
them
to
get
rid
of
swift
because
of
some
software
dependencies.
I
F
C
I
didn't
realize
that
wikimedia
had
switched
over
to
using
lyft
on
stuff.
Last
time
I
talked
to
them.
It
was
probably
like
eight
years
ago
or
something,
but
they
had
they're
actually
deploying
both
proper.
A
I
guess
regarding,
like
our
you
know:
bi
monthly
get
together,
I'm
kind
of
leaning
towards
skipping
july,
since
a
lot
of
people
on
vacation-
including
me,
probably,
and
maybe
we
just
do
august
or
september-
for
the
next
memory
chats
send
out
my
usual
emails
and
let
everybody
know.