►
Description
Lightning Talk: Exascale Schoal Architechture in Ceph with Kyle Bader of Red Hat.
Filmed on October 28th 2019 in San Francisco.
A
So
Kyle
Bader
I
were
the
storage
bu
a
Red
Hat
been
working
on
stuff
for
a
while,
and
one
of
the
things
has
been
coming
up
recently
is,
is
how
do
we
kind
of
take,
takes
F
and
take
object?
Storage
to
the
next
next
level
of
scale.
Right
I
have
this
old
logo
from
Ceph
from
like
2007
on
here
we
used
to
say
it
was
petabyte
scale
storage.
Well,
it's
like
not
cutting
the
mustard
anymore.
So
how
do
we
adapt?
A
So
some
of
the
largest
tests
we've
done
were
with
maybe
10,000
OS
T's
every
now
and
then
CERN
the
same
place
that
collides
the
particles.
They'll,
say:
hey
we're
getting
a
new
shipment
of
hardware
in.
Do
you
guys
want
to
run
some
tests
on
it
for
three
weeks?
So
we've
done
this
a
series
of
time.
The
most
recent
one
we
did
was
with
10,000
drives,
so
we
built
a
10,000
drive.
Shaft
cluster
ran
it
for
a
few
weeks
and
each
of
these
times
we've
kind
of
pushed
it
to
make
a
bigger
cluster.
A
So
if
you
look
at
the
kind
of
the
highest
capacity
drives,
you
can
get
these
days
they're
around
16
terabytes
and
if
you
were
able
to
have
10,000
of
them,
you're
gonna
get
into
the
on
order
of
a
hundred
sixty
petabytes,
which
is
a
really
big
data
store.
But
we
have
customers
that
are
starting
to
say:
hey
I
want
I,
want
a
hundred
I
want
200,
petabytes
of
storage.
A
What
does
this
look
like
and
they
also
want
to
have
access
like
really
demanding
throughput,
particularly
if
they're
gonna
be
using
it
for
machine
learning
applications
where
you
know
they're,
saying
hey,
you
know:
I'm
gonna
have
all
these
cars
they're
gonna
be
pumping
all
this
data
into
this
thing.
You
know
on
the
order
of
petabytes
per
day
and
then
I'm
gonna
have
to
every
few
every
so
often
every
so
many
months.
I
need
to
go
through
all
of
the
data
and
you
know
retrain
it
with
the
with
the
new
fresh
data
and
so
you're.
A
Looking
at
like
a
very
serious
amount
of
throughput
demands,
so
we
were
kind
of
thinking.
Okay
well!
Well,
how
can
we?
How
can
we
kind
of
solve
this?
How
could
we
kind
of
cater
to
these
really
really
demanding
use
cases
where
you
know
people
need
hundreds
of
gigabytes
per
second
of
throughput.
They
need
these.
You
know
200
petabyte
clusters.
What
can
we
do
so?
A
One
of
the
things
that
was
done
number
of
years
ago
was:
we
were
working
with
the
Yahoo
folks
and
they
kind
of
we
helped
kind
of
Co
develop
an
architecture
where
they
had
multiple
SEF
clusters,
backing
backing
the
storage
for
Flickr,
and
so
we,
you
know,
I,
think
this.
This
same
sort
of
architectural
approach
is
still
relevant
today
and
so
kind
of
you
can
create
architecture,
that's
kind
of
like
a
Sheol
or
like
a
group
of
squids.
A
So
what
is
you
know?
What
can
you?
What
can
you
get
out
of
this?
Well,
in
terms
of
sub
clusters,
we
were
kind
of
seeing
what
sort
of
throughput
you
can
get
in
and
out
of
a
relatively
modest
sized
cluster,
and
then
you
can
kind
of
extrapolate
things.
So,
even
with
a
relatively
modest
cluster
of
about
700
spindles,
we
were
able
to
do
a
little
bit
over
a
petabyte
in
24
hours.
So
this
is
kind
of
validating
the
use
cases
where
people
are.
A
A
So
we
tested
I,
think
up
to
250
million
objects
and
a
single
bucket,
which
is
which
is
a
lot,
and
you
can
see
that
our
latency
after
kind
of
an
initial
step,
there
was
some
sharding
where
we
internally
shard
the
metadata.
But
after
that
we
had
very
consistent
latency,
even
though
we
had
you
know:
250
million
objects
in
a
single
bucket,
which
we
considered
to
be
a
lot.
A
But
yet
we
also
wanted
to
test.
You
know
storing
whole,
it's
not
a
placeholder,
but
this
is
a
lightning
talk,
so
we
also
wanted
to
see
you
know,
let's
put
billions
of
objects
into
the
cluster
and
global,
and
so
we
did.
We
put
a
billion
we
put
over
a
billion
objects
into
the
cluster
and
observed
how
the
the
change
in
performance
changed
over
time.
You'll
see
that
it
was
relatively
the
red
bar
being
the
latency
and
then
the
blue
bar
is
the
object
population.
A
So
our
latency
a
stayed
relatively
flat
until
the
point
where
we
were
taking
up
all
of
the
SSD
for
our
metadata
and
we
slowly
were
starting
to
spill
some
of
the
metadata
over
to
disk.
So
as
there
was
a
higher
percentage,
you
can
think
of
it
like
as
a
cache
miss
right
so
as
as
less
of
it
was
on
SSD,
the
more
of
it
had
to
come
from
hard
disk.
That's
why
the
latency
was
starting
to
creep
up
over
time
and
then
for
reads.
A
A
So,
with
kind
of
those
individual
sub
clusters
out
of
the
way
you
know
what
is
a,
what
is
a
show?
What
would
a
show
look
like
well
in
a
safe,
multi-site
topography,
you
have
these
ideas
of
zones
and
zone
groups,
and
these
zones
and
zone
groups
were
originally
kind
of
put
into
place
to
Anor
in
order
to
be
able
to
do
replication
between
them,
but
that
doesn't
necessarily
have
to
be
true.
A
So
by
creating
a
realm,
that's
like
an
s3
global
namespace
for
buckets
and
then
each
zone
group
a
bucket
can
live
in
exactly
one
zone
group,
and
then
you
can
potentially
configure
have
multiple
zones
in
his
own
group
and
then
do
replication
between
them.
But
if
you
only
have
one
zone
in
each
its
own
group
and
you're,
basically
just
partitioning
the
namespace
of
buckets,
so
each
bucket
lives
in
exactly
one
zone
group,
and
so,
if
you're,
interacting
with
the
object
store,
you
don't
necessarily
need
to.
A
A
Dns
right,
so,
if
you
go
to
your
s3
example.com,
it's
going
to
map
to
kind
of
the
the
main
IP
address
and
then
or
if
you
go
to
use
like
path
style
access,
you're,
going
to
go
through
the
s3
dot
example.com,
and
this
is
going
to
round.
You
know
you
could
potentially
just
round-robin
this
around.
The
cluster
is
like
in
the
most
simplistic
sense,
but
you
could
also
potentially
do
something
sophisticated
in.
A
You
know
h8
proxy,
where
you
have
some
sort
of
lua
based
routing
that
that
actually
looks
into
the
cluster
to
find
out
where
a
bucket
lives-
and
you
could
kind
of
do
it
like
I,
don't
know
if
anyone
is
familiar
with
like
an
SDN
controller,
but
it's
kind
of
like
a
first
packet
approach.
Where
you
you
know
the
first
packet
gets
resolved
by
the
control
plane
and
then
it
embeds
kind
of
like
a
something
in
a
lookup
table
so
that
subsequent
ones
don't
have
to
go
through.
A
So
if
you
say
you
know,
Kyle
is
the
name
of
my
bucket
I
go
to
Kyle
right,
you
go
to
the
DNS
and
you'd
have
a
DNS
plug-in
right
now
someone
has
written
one
for
power,
DNS
that
talks
to
staff
and
then
says:
ok,
this
bucket
is
in
this
one
and
then
they'll
respond
with
the
record
that'll
route
them
to
that
cluster.
Well,
you
could
do
the
same
thing
and
I
think
kind
of
the
where
you
could
kind
of
take.
A
This
is
to
take
it
and
put
it
into
core
DNS
and
then,
if
you
had
an
operator
that
was
deploying
multiple
clusters
inside
of
an
open
shift,
it
could
automatically
configure
all
of
this
wiring
and
route
the
traffic
appropriately.
So
that's
kind
of
that
was
my
quick,
quick
little
talk
Thanks.
Thank
you
very
much.