►
Description
A detailed look at Medallia's approach and implementation of database workloads on top of Ceph. This presentation was run as a part of Ceph's monthly Tech Talk series (http://ceph.com/ceph-tech-talks/)
Associated slides: http://www.slideshare.net/Inktank_Ceph/2016jan28-high-performance-production-databases-on-ceph-57620014
A
So
I'm
the
well
I'm
here
today
to
talk
about
a
high
performance,
Prussian
databases
running
on
safe
and
I'm,
an
architect
at
my
lab
Medallia.
For
those
of
you
who
don't
know
us,
we
collect
we
analyze,
we
display
terabytes
of
structured,
unstructured
data
for
our
multibillion-dollar
clients,
and
we
do
all
of
this.
A
In
real
time
now,
I've
been
here
since
2010,
we've
grown
from
70
to
700
employees
and
the
reason
I'm
here
today
is:
we
actually
run
real
live
high-performance
databases
in
production,
with
customer
data
on
safe
now
our
journey
there
is
actually
been
a
little
bit
of
a
long
one
started
about.
A
year
ago
we
have
an
in-house
analytics
engine
which
is
really
high
performance.
We
had
a
new
version
of
it
was
even
higher
performance
scale
to
thousands
of
servers,
and
then
we
took
a
peek
at
our
pressure
environment,
which
was
hundreds
of
servers.
A
Not
thousands,
and
all
these
servers
had
individual
names
and
they
were
all
of
them,
individual,
tiny,
little
precious
snowflakes.
They
were
all
entirely
different.
All
the
services
that
were
running
there
were
manually
placed
somewhere,
and
so
was
the
server
placement.
It
literally
was
whom
there
seems
to
be
space
in
this
rack.
Therefore,
this
is
where
the
server
goes.
There's
no
connectivity
on
the
switch,
but
there's
connectivity
and
switch
over
there.
So,
let's
have
a
long
cable.
A
This
quickly
ended
up
with
all
the
snowflakes
in
environment,
where
it
is,
the
mantra
goes,
don't
touch
it
and
the
most
precious
thing
we
had
was
the
database
servers.
They
were
really
don't
touch,
it
don't
upgrade.
The
BIOS
don't
upgrade
the
firmware.
If
there's
a
critical
bug
fix
for
the
storage
controller,
well,
upgrading
it
will
imply
downtime
and
there's
a
possibility
that,
with
this
critical
bug,
fix
also
comes
in
your
buggy.
A
So
do
you
really
really
need
to
fix
a
bug
fix
that
may
or
may
not
cost
or
is
corruption,
so
it
was
clear
we
needed
to
do
something.
In
fact,
we
need
to
quite
a
lot
of
something
so
we've
said:
ok,
instead
of
incrementally
going
step
by
step
by
step,
we're
gonna
actually
jump
directly
into
the
future.
I'm
gonna
skip
ahead,
two
or
three
generations
and
go
directly
to
next
generation.
That
means
we
looked
at
micro
services.
We
live
containers.
We
looked
at
well
pretty
much
every
industry
buzzword
that
is
out
there.
A
We
did
set
up
a
proof
of
concept,
our
proof
of
concept
uses
40,
gigabit,
networking,
end-to-end
all
the
way
down
to
the
server
it's
non-blocking.
It's
open
networking
we
use
SEF
for
the
storage
layer
and
we
use
darker
for
the
containerization,
and
this
proof
of
concept
really
really
blew
us
away.
We
did
have
to
modify
doctor
and
the
network
Union
safe,
a
little
bit
to
get
to
the
level
we
wanted,
but
at
the
end
it's
resilient
enough.
That
is
actually
a
major
problem
for
us
to
test
the
resiliency.
A
We
have
to
kill
so
many
servers
to
get
to
the
point
that
this
does
not
survive.
Then
we
actually
ran
out
of
capacity
for
another
resiliency
animal.
Those
are
extremely
performant.
In
fact,
it
was
performant
enough
that
it
was
on
par
with
our
current
dedicated
database
servers,
and
these
are
purpose-built
beasts,
Geron
databases
as
fast
as
they
can
now,
whenever
all
performance
is
at
that
level
and
the
resilience
is
at
that
level,
the
question
becomes
can
run
everything
we
have
on
this
new
infrastructure.
A
Now,
to
do
that,
we
actually
puente
said:
okay,
we're
gonna
need
some
designing
tools
here.
We
first
of
all
we
want
commodity
products
means
we
want
commodity
components,
Wiimote
supported
open
standards.
We
really
like
things
where
we
can
go
look
at
the
source
code,
while
it
doesn't
always
allow
us
to
actually
fix
issues.
It
gives
us
the
illusion
that
we
can.
It
gives
us
defensive
control
and
it
does
give
us
a
little
bit
of
an
insight
into
what
this
company
or
provider
actually
thinks
about
quality
of
code.
We
weren't
fully
automated
provisioning
and
reinstalls.
A
We
want
things
that
are
cheap
and
scalable
and
always
a
cheap
here.
That
doesn't
mean
I
want
to
spend
less
money,
but
it
does
mean
that
I
want
one
more
compute
power
or
more
storage
capacity
or
more
performance
or
more
networking
for
every
single
dollar
that
I
spend
and
we
want
to
be
scalable.
We
really
want
something
where,
if
we
need
more
capacity,
if
we
need
more
performance,
if
we
need
more
of
anything,
we
just
add
more
acts
to
it.
A
Leslie
we
want
immutable
servers
and
that
really
goes
towards
once
the
server
is
up
and
it's
serving
traffic
for
some
kind.
It
shouldn't
change
if
you've
ever
done,
production
level,
debugging
and
figure
out
that
somebody
has
a
credit
JVM,
but
the
JVM
that
was
running
when
this
application
was
started
was
a
different
version.
These
are
not
so
fun
to
debug.
A
We
also
very
much
wanted
a
case
where
there
are
no
special
machines,
no
magic
appliance
sitting
in
the
corner.
No
little
thing,
that's
sitting
over
there
that
everybody
knows
is
essential,
but
nobody
really
knows
what
it
does.
All
of
those
have
to
go.
We
shouldn't
be
a
single
service
that
is
tied
to
specific
hardware.
That
means
every
component
must
be
able
to
run
anywhere
as
much
as
possible.
A
We
run
redundancy
at
the
a
player
at
the
software
layer
instead
of
having
individual
servers
with
double
power,
supplies
double
networking
more
layers
of
raid,
and
you
really
care
to
think
about
these
things
will
inevitably,
sooner
or
later
fail,
regardless
of
how
many
cables
you've
actually
purchased
into
the
box.
So
if
you
design
for
failure,
think
that
it
is
gonna
fail
sooner
or
later
so
will
this
sign
around
the
failure
with
the
sign
for
self-healing
without
for
sale?
Lindsey,
you
end
up
with
something
that
is
ultimately
much
much
easier
to
maintain.
A
The
most
important
principle,
though,
is
keep
it
simple,
because
if
your
design
for
your
entire
infrastructure
is
simple,
then
it's
also
simple
to
fix.
It's
simple
to
diagnose
it's
simple
to
understand.
If
somebody
new
joins
your
organization,
it's
simple
to
explain
to
them,
however,
thing
works
so
that
it
will
be
productive
in
a
very
short
time
and
in
all
long
half
of
the
case
were
when
something
breaks.
A
Everybody
is
looking
to
a
consultant
to
go
fix
it
for
you,
because
nobody
actually
understands
it,
because
it
is
complex
and
mostly
simple
means
that
in
a
short
presentation
here,
I
could
actually
explain
the
major
design
goals
and
a
major
components
of
how
it
works,
and
hopefully
to
a
level
that
someone
else
can
you
know
replicated
now
we
ended
up
with
a
standard
rack.
Our
standard
rack
is
22
compute
nodes,
the
compute
nodes
are
Intel,
CPUs,
run
on
Linux,
with
memory
for
gigabit
networking
and
a
small
that
is
D.
A
This
SSD
here
is
used
mostly
for
the
components
and/or,
the
containers
and
v2s.
Our
storage
nodes
are
also
Linux,
with
an
Intel
CPU,
some
memory
same
40,
gigabit,
networking,
a
lot
more
storage,
necessarily
and
PCIe,
and
we
RAM
for
journals.
The
N
realm
for
journals
is
being
rolled
out
right
now.
Actually,
as
we
speak
so
by
the
end
of
the
day
or
tomorrow,
we
should
have
that
up
and
running
one
of
the
great
advantages
of
Ceph.
You
can
do.
On-The-Fly
updates,
I
love
it.
When
our
networking
we
have
three
switches
per
act.
A
This
is
also
Linux
with
cumulus
and
it's
also
Intel
granted
it's
a
lot
less
powerful
CPU,
but
it
has
the
same
amount
you
own.
If
that's,
the
same
concept
of
binary
is
portable
across
all
of
them.
Yes,
memory,
and
it
has
a
lot
more
networking.
In
fact,
if
you
log
into
the
switches
here,
they
look
like
any
other
server
that
just
happened
to
be
have
32
network
cards,
so
this
is
unified.
A
Now
the
challenges,
if
you're
wanting
everything
as
container
is
really.
Where
do
you
draw
the
line?
You
can
run
your
application
in
a
relocatable
container
and
in
fact
that's
been
done
quite
a
lot.
You
can
run
your
load
balancer
in
a
relocatable
container.
That
has
been
done,
but
it's
a
little
bit
more
challenging.
You
can
run
your
DNS
server
in
a
relocatable
container
that
actually
turns
out
to
be
quite
challenging,
and,
finally,
you
can
run
your
database
in
relocatable
containers
or
any
database
in
relocatable.
A
Containers
is
a
major
problem,
because
the
database
sort
of
requires
resists
resilient
storage
and
it
really
wants
a
storage
to
be
the
same
after
a
power
loss.
Also,
unlike
my
application,
my
application
talks
to
zookeeper.
It
has
no
problem
doing
dynamic
discovery.
My
database
Posterous
has
no
concept
of
what
zookeeper
is
it
talks
IT
addresses.
A
So
there
are
some
challenges
we
need
to
overcome.
The
first
one
is
networking.
Let's
follow
the
life
of
a
web
request.
Web
request
comes
in
talks
to
the
datacenter
firewall.
Our
firewall
forwards,
this
to
the
host,
where
it's
been
told
that
a
load
balancer
is
running
in
this
case
nginx.
Now
our
engine
X
modified
to
go
talk
to
the
zookeeper,
and
it's
known
to
a
zoo
keepers
in
zookeeper.
This
information
of
here
is
the
application
for
this
URL.
You
know
this
is
forwarded
now
at
some
point.
A
In
the
past
application
talk
to
the
zookeeper
with
this
basic
set
up,
which
you
can
just
get
off
the
shelf
almost
anywhere.
Your
application
can
be
relocated
anywhere
because
it
doesn't
really
matter
what
the
IP
address
for
it
is
it's
gonna
talk
to
the
zookeeper
drop
that
itself
when
it
boots.
But
what
are
these
two
servers
dies?
A
Generally
speaking,
having
automation
that
updates
your
firewall
from
the
data
center
network
itself
is
not
really
something
your
security
group
is
gonna
enjoy
and
for
the
zookeeper.
How
do
you
find
the
zookeeper
if
you
don't
use
the
IP
addresses?
If
the
is
you,
keepers
changed
IP
address?
Well,
so
you
keeper
does
have
forums.
You
have
five
of
them,
but
soon
later,
if
you
just
wait
long
enough,
you
will
have
lost
one
event
you
so
you
can
say
that
I
find
zookeeper
through
DNS.
A
A
To
do
this
propagation
of
IP
addresses,
which
you
need
in
order
for
us
to
figure
out.
Where
is
the
IP
address
located
now
we
use
OSPF.
You
could
also
use
BGP.
We
just
picked
off
AF
because
it
is
fantastically
easy
to
set
up.
Most
bf
is
a
link
state
database.
It's
supported
by
every
vendor,
and
this
support
by
multiple
vendors
is
important
for
us.
A
In
all
the
components
we
looked
at
for
OSPF,
we
looked
at
all
the
major
storage
providers,
Network
routers
and
quite
a
few
of
the
white
box
providers
and
all
of
them
every
single
one
has
a
working
OSPF
implementation
and
we
were
not
able
to
find
a
pair
of
two
vendors
that
were
not
compatible
inter
implementation,
but
it
means
it's
very
easy
for
us
to
switch
to
any
other
provider.
If
you
saw
one
entry-
and
this
does
give
us
fully
relocatable
IP
addresses
now.
A
Let's
say
that
I'm
not
talking
about
a
web
application
I'm
talking
about
Postgres
database.
Well,
let's
say
that
we're
running
this
application
is
talking
to
Postgres
and
the
server
dies.
That's
unfortunate,
but
hey
I
can
relocate
the
Postgres
database
instance
and
it
maintains
the
same
IP
address.
So
the
application
will
able
to
connect
to
it
real
soon.
There's
only
one
small
problem,
which
is
that
the
storage
it
was
using
is
still
on
the
host
that
is
now
dead.
A
This
is
where
we
went
to
SF,
because
the
problem
here
is
darker.
Images
are
ephemeral.
They
have
persistent
volumes
which
work
right
on
your
local
machine.
There
are
a
lot
of
solutions,
both
proprietary
and
open
source,
for
dr.
volumes
to
be
relocated,
but
if
you
want
something
that
is
actually
a
full-on
high
availability,
in
other
words,
it
will
survive,
not
just
a
voluntary
shut
down,
but
whoops
the
power
went
out.
A
You
have
to
go
look
at
ICC,
which
you
have
to
go
talk
to
a
large
storage
vendor
which
tells
an
appliance
and
they're
very
happy,
because
the
large
storage
vendor
there's
a
lot
of
money
on
this
appliance.
You
can
talk
to
the
same
large
storage
vendor
to
say:
I
want
NFS,
because
I
want
something
that
is
a
little
bit
more
file
system
like
great.
A
The
wash
drawers
render
is
now
super
happy,
because
this
is
even
more
expensive
and
even
less
performant
or
you
can
try
P
NFS,
which
seems
to
be
the
direction
that
the
storage
vendor
industry
wants
to
go.
We
try
P
NFS.
It
did
work
for
a
very
short
amount
of
time,
and
then
it
really
didn't
work
anymore,
and
most
of
these
proprietary
solutions
are
scale
up.
If
you
want
to
do
scale
out
for
them,
it
really
ends
up
being
you,
my
multiple
appliances-
and
you
say
these
file
systems-
are
here
these
file
systems
over
there.
A
The
really
major
one,
though,
for
us,
was
the
SLA
all
of
these
large
vendors.
They
support
for
our
hardware
support
on
site.
They
will
have
a
tech
on
site
in
four
hours
to
tell
you
that
you
have
a
problem,
you
already
knew
you
had
a
problem
and
if
you
have
a
customer
on
the
phone,
and
this
customer
is
telling
you
hey
I'm,
paying
you
a
lot
of
money.
Where
is
my
data?
I?
Can't
access
my
application
right
now
for
our
hearts?
Port
isn't
good
enough!
You
need
something
that
is
simple
enough,
easy
enough
to
go.
A
Fix,
diagnose,
repair
that
your
own
people
in
house
can
go
fix
it
in
a
matter
of
minutes
not
hours.
So
this
being
the
sophistic
talks,
I'm
just
gonna
go
too
deep
into
house.
F
works.
The
important
parts
for
us
is
that
there's
no
need
to
communicate
with
a
metadata
service
in
the
hot
path.
It
truly
is
a
scale
out
solution.
It
is
also
very
very
clean
design.
A
There's
a
white
paper
for
house
F
works,
I
recommend
you
go,
read
it
because
reading
that
this
was
easy
enough
for
us
to
understand
that
we
can
go
fix
some
of
the
basic
problems
ourselves
and
it
gives
us
the
confidence
that
in
the
future,
if
there
is
a
problem,
we
can
actually
go
fix
it
and
there's
a
large
enough
community
that
will
make
sure
that
the
problem
doesn't
appear
again
and
it
is
you
know
if
you
need
more
capacity,
you
just
add
more
servers.
If
you
need
more
aggregate
performance,
you
just
add
more
servers.
A
But
with
this,
the
storage
problem
is
solved,
because
if
my
poster
has
a
host
now
dies,
I
will
just
start
the
server
somewhere
else
and
is
connected
to
the
same
replicated
cluster.
Yes
same
IP
address
to
the
application
distance.
Look
like
a
temporary
Network
glitch,
at
which
point
it
we
give
reconnected
to
Postgres
database
and
everything
is
not
fine.
Now,
if
you
have
relocatable
infrastructure,
you
can
actually
have
these
to
piggyback
on
each
other,
because
what
happens
when
the
server
for
your
set
for
monitor
dies?
A
If
the
machine
that
hosting
the
monitor
dies,
we
just
start
to
say,
monitor
somewhere
else
with
the
same
IP.
At
that
point,
the
monitor
is
got
to
come
up
conclude
that
it
is
very
out
of
date.
In
fact,
it
has
no
data
whatsoever,
so
it
will
happily
the
data
from
in
the
other
managers
and
then
it's
up
and
running
now.
This
is
not
automated.
A
The
foobar
potential
here
is
very
high,
because
if
you
have
a
split
brain
scenario
or
if
the
system
decides
that
the
server
that
was
running
the
monitor
is
dead,
but
it
turns
out
it
actually
isn't.
Now
you
have
three
monitors
with
the
same
ID
and
the
hilarity
ensues
so
so
far.
This
is
a
task
where
the
human
being
must
be
able
to
go
in
and
say.
A
Yes,
the
server
is
really
dead,
yes,
really
started
somewhere
else,
but
it
does
give
us
relocatable
monitors,
there's
human
intervention,
but
no
human
has
to
go
to
the
data
center
and
with
this
setup
we
are
really
in
the
space
were
for
our
servers.
If
a
physical
server
dies
and
it
doesn't
matter
which
server
it
is
or
if
a
physical
switch
diet
doesn't
matter
which
wrist
it
is
well,
it's
a
problem
to
be
solved.
A
Next
week
we
just
mark
the
server
as
down
and
once
a
month
we
have
our
harbor
vendor,
come
on-site
with
spare
parts
fix
the
servers
that
are
broken
and
going
in.
There's
no
point
doing
this
middle
of
the
night.
Rushing
to
the
data
center
to
fix
your
physical
server
now
for
provisioning
and
orchestration.
We,
this
is
a
two
full
autumn.
The
first
part
is
we
always
network
boot,
our
servers,
both
the
storage
servers
and
the
computer
servers.
A
We
have
small
POS,
Linux
and
interim
FS,
which
is
actually
the
internet
installer
with
a
few
small
modifications
and
extensions.
We
use
this
to
do
self
encrypting
drives.
We
have
data
at
rest,
encryption
for
everything,
including
all
the
compute
nodes
the
boot
drives.
So
the
key
is
never
known
by
the
runtime
OS,
and
this
literally
means
that
when
a
server
powers
on
it
cannot
boot
because
the
boot
drive
is
encrypted
and
the
server
doesn't
have
the
key
it
has
to
network
boot.
A
To
get
the
key
now,
what
this
gives
us
is,
first
of
all,
it
means
we
need
certain
data
at
rest,
encryption
requirements.
More
importantly,
it
means
that
if
I
have
a
problem
with
the
server,
it
is
actually
completely
safe
to
unrack
that
server
and
toss
it
in
the
trash,
because
all
the
day
is
encrypted,
there's
no
way
to
get
it
out.
A
Other
things
the
remote
guru
does
is
check
the
state.
If
told
to
do
so,
it
will
go
and
update
the
firmware
check
the
bios
version
and
check
the
bios
config
and
make
sure
that
the
OS
is
the
right
release
level
and
when
all
these
are
done,
it
will
go
boot.
The
OS.
Now
this
firmware
and
bios
inversion
and
comfy
is
actually
really
really
important.
A
So
doing
this
gives
us
completely
uniform
machines.
We
don't
have
any
half
installed,
have
forgotten
state.
All
of
them
are
always
completely
identical,
and
for
us
it
means
that
when
we
do
performance
improvements,
we
do
it
on
one
machine
and
once
that
one
machine
looks
good,
we
say
this
is
the
new
Golden
State.
Please
clone
it
for
everybody
else,
including
former
settings
for
orchestration.
We
use
Aurora
and
mrs.
A
Aurora
is
programming.
Answer
data
center,
like
it's
a
pool
of
resources
and
it's
it's
fairly
straightforward
missus
keeps
track
of
your
data
center.
In
other
words,
it
keeps
track
of.
Where
do
you
have
spare
capacity?
How
much
capacity
is
there
there's
a
lot
of
different
schedulers,
one
of
which
is
Aurora
and
for
Ora?
You
will
tell
it.
I
want
to
run
this
particular
job.
A
I
need
this
many
CPU
cores
I
need
this
manage
memory
and
I
want
this
many
instances
you
can
also
give
it
fairly
complex
rules
such
as
well
much
like
the
Sarah
Siddons
rules,
you
can
say:
hey
I,
want
three
instances
and
I
want
them
to
be
in
different
racks.
For
us,
this
is
really
important
because
we
stripped
down
our
servers
as
far
as
we
could,
so
they
run
single
power,
supply
single
networking
and
we
have
resilience
at
higher
levels.
So
our
failure
domain
is
one
rack.
A
So
when
we
run
three
instances
of
something
it's
important
for
us
that
they
run
in
different
tracks,
we
did
have
to
extend
darker
a
little
bit
specifically
for
the
ability
to
give
it
this
unique
IP
until
it
use
layer,
3
routing,
they
are
nothing
no
layer
to
bridging
and
we
also
to
have
modified
darker
to
directly
be
amount
set
volumes.
So
this
will
do
behind
the
scenes,
the
RBD
map,
if
the
device
you're
looking
for
like
in
this
case,
if
there
is
no
RVD
for
demo,
it
will
create
one.
A
Since
the
images
are
thinly,
provisioned
our
default
size
for
everything
is
10
terabytes.
That
may
come
back
and
haunt
me
at
some
point.
It
will
then
run.
You
know
mkfs
and
amount
it
if
needed.
If
it's
already
there.
It's
just
mounted
with
discard
options,
and
in
this
case,
if
we're
telling
it
specifically,
this
is
the
safe
volume
and
it
wanted
to
rewrite.
And
this
is
all
it
takes
to
actually
get
up
a
fully
relocatable
thing
if
I
run,
this
were
to
do
some
modifications
inside
this
demo
shut
down
the
server
run
it
somewhere
else.
A
How
fast
is
it
that's?
Where
things
get
interesting?
On
a
networking
side,
we
get
about
five
microseconds
latency
with
the
super
high
tech
performance
tool
called
ping
in
reality,
latency
can
be
lower
if
you're
running
full-on
RDMA,
but
five
microseconds
is
small
enough.
That
I
don't
really
care
for
our
application.
A
A
lot
of
that,
unfortunately,
is
still
dominated
by
single
stream.
Tcp.
The
performance
for
single
screen
TCP
is
either
22
or
38.
Gigabits
per
second
I'll
get
back
to
why
in
one
slide
for
multi
stream,
TCP
it
maxes
out
at
close
to
40
gigabits,
and
we
can
relocate
IP
addresses
across
the
entire
thing
in
less
than
50
milliseconds.
In
other
words,
if
an
IP
address
was
on
one
end
of
the
data
center
and
the
needs
muted
out
there,
that
takes
less
than
50
milliseconds
for
storage,
single
stream.
A
Io
is
550
megabytes
per
second,
that's
completely
limited
by
the
sizes.
These
random
stores
nodes,
if
you
have
multi
stream
IO.
In
other
words,
if
you
had
do
a
lot
of
I/o
across
multiple
or
the
images
at
the
same
time,
we
can
get
close
to
4
gigabytes
per
second
and
we
can
reattach
in
again
in
less
than
50
milliseconds
in
reality,
I
think
the
real
kick
and
we
attach
time
is
a
little
bit
less,
but
this
was
the
granularity
for
my
timer
Alma
test.
A
A
If
you
have
your
network
card
on
one
CPU
connected
to
the
PCI
Express
bus
and
the
SAS
or
SATA
controller
on
the
other
CPU
connected
to
that
one
sees
PCI
Express
controller
because
for
a
single
request,
imagine
that
all
we
want
to
do
is
copy
data
from
the
SATA
or
SAS
controller
to
the
network
card.
Let's
assume
that
thread
that
wants
to
do
this
work
actually
Rams
on
CPU
zero
when
it
wants
to
talk
to
the
SAS
or
SATA
controller,
it
has
to
go
over
this
inter
CPU
link.
A
You
know,
take
control
of
the
necessary
resources,
do
that
remote
I/o
and
every
single
command
or
port
change
it
once
the
sender.
The
controller
has
to
go
over
this
intercity
bus
when
it
talks
to
the
NIC.
It
is
really
fast
because
it's
local,
this
terse,
seem
to
be
a
little
bit
of
a
performance
bond
work
and
you
have
the
exact
same
problem
if
your
task
runs
on
CPU
number.
One
now
talking
to
the
storage
controller
is
really
fast,
but
a
NIC
is
slow.
A
If
you
are
on
lower
speed,
great
networking
or
lower
speed
grade
SATA
controllers,
then
this
bus
is
not
a
problem
at
all,
but
the
second,
you
really
really
care
about
latency
really
care
about
single
screen
performance.
It
is
a
problem.
Now
we
have
solved
that
on
a
storage
controller
by
running
single
socket,
so
all
of
them
run
single
socket,
which
means
both
the
SATA
controller
and
the
NIC
and
everything
else.
It
is
hash
to
that
single
socket
CPU.
There
is
no
Numa,
there
is
no
bus.
That
is
the
ball
like.
A
This
does,
of
course,
limit
the
number.
Of
course
you
can
have
per
storage
node.
Today
we
run
a
2667
CPUs.
That's
eight
course
at
deliver
above
three
gigahertz,
which,
as
it
turns
out,
is
complete
overkill
for
an
OSD
node
demo
time
so
I
do
apologize.
I
had
record
this
demo
in
advance
being
connected
to
the
VPN
and
on
blue
jeans
at
the
same
time,
turn
out
to
be
a
little
bit
of
a
challenge.
A
So
let
me
open
this,
so
this
is
my
demo.
The
URL
here
by
the
way
should
be
live,
so
you
can
actually
go
and
look
at
the
URL,
though
that's
going
to
show
the
end
state.
So
this
is
a
small
application.
This
is
written
in
node.js
I
do
apologize.
This
is
my
first
nodejs
application
error.
So
there's
probably
quite
a
lot
of
style
things
that
are
wrong
here,
but
it
does
connect
to
a
database.
It
does
a
little
bit
of
a
select,
and
it
just
shows
this
on
the
screen.
A
Now,
if
I
have
a
client
and
I
connect
to
this
database,
I
can
insert
values
to
it.
Everything's
looking
good
I
can
load
it
up,
and
yes,
it
should
there.
It's
actually
a
database
great
now,
if
I
go
to
aggregate.
Where
is
this
database
running
so
on?
My
missus
I
can
figure
out
that
okay,
it's
running
on
this
particular
host
right
now,
so
I
can
see
your
another
shell
into
that
host
and
how
do
we
actually
demo
that
something
is
resilient?
A
A
So
we
just
power
off
the
host
that
was
running
the
database
now.
How
does
this
look
there
application?
Well,
the
database
is
now
gone.
Where
is
it
well
to
the
client?
It
looks
like
somebody
rebooted
the--
or
restarted
at
a
post
press
server.
In
reality,
it
has
now
been
relocated
to
a
different
host.
Gets
the
same
ip
connected
to
the
same
storage
and
our
insert
statements,
work
and
the
application
is
still
responsive.
A
So,
let's
talk
performance
and
let's
talk
specifically
real-world
performance,
because
real-world
performance
and
what
you
can
read
on
data
sheets
are
two
very
different
things,
especially
when
it
comes
to
databases
on
the
data
sheet.
You
will
see
that
an
SSD
has
100
K
for
K
random,
write,
I,
ops,
fantastic
great.
That
is
absolutely
true.
If
you
have
a
very
deep
IO
pipeline
and
you
never
need
to
acknowledge
the
writes
in
the
real
world-
databases
work
differently,
they
don't
have
an
I/o
depth
of
64.
A
They
have
a
higher
depth
of
1
because
they
will
usually
read
an
index.
Then
it
will
okay,
what
is
actually
in
this
index
blocked
and
if
it's
a
b-tree
well,
it
could
be
several
layers
of
index
2.
It's
really
an
x-block
process
index
block,
Teek,
read
the
index
blocks,
process,
innings
block,
and
it's
this
iterative
process
of.
Please
give
me
one
piece
of
data
process
it
and
then
give
me
the
next
piece
of
data,
which
means
it
actually
is
a
random
pattern
where
it
processes
in
between.
So
you
need
full
round-trip.
A
More
important,
though,
is
writes
because
a
database
well
when
you
write
a
transaction,
especially
when
you
write
this
word
known
as
commit
the
guarantee
to
the
user,
is
that
this
is
now
persisted
on
durable
storage.
In
other
words,
if
somebody
were
to
pull
the
power
on
absolutely
every
single
thing
in
the
data
center
right,
this
nanosecond,
it
will
survive
and
be
there
when
you
get
back
up.
A
Now
in
the
real
world,
a
dedicated
database
server
has
a
lot
of
buffer
cache
and
if
you
just-
and
these
prices
are
just
taken
from
New
York
but
eternal
gigabyte
entry
level,
enterprises
is
deep
with
super
caps.
If
you
buy
24
of
them,
that's
15k
worth
of
hardware.
500
gigabyte
of
lpddr4
ram
is
4k.
So,
yes,
a
dedicated
advisers
have
a
lot
of
buffer
cache.
Now,
for
us,
we
have
two
types
of
tables.
We
have
a
few
gigabyte
tables
and
we
have
a
few
terabyte
tables.
Our
application
does
very,
very,
very
heavy
caching.
A
So
there
are
few
read
requests
and
even
for
the
few
read
requests
that
are
there.
Well,
if
you
have
a
database
container
or
word,
if
you
ever,
you
know
dedicate
a
database
but
we'll
stick
with
the
database
container.
They
have
plenty
of
memory,
so
most
of
the
indexes
most
of
the
small
tables
they
actually
sit
completely
in
buffer
cache.
So
your
read
performance
is
dominated
by
how
fast
you
can
read
from
the
buffer
cache
we
loosely
translated.
A
How
fast
is
your
memory
and
the
memory
is
fast,
but
if
a
user
actually
modify
something,
then
there's
a
transaction,
which
means
our
bob
lake
is
around
F
data
sink,
especially
if
there
are
multiple
users
with
multiple
transactions
at
the
same
time,
because
as
long
as
the
transaction
is
running,
it
still
holds
locks.
So
it's
really
important
that
this
if
data
sink
returns
quickly,
so
that
the
locks
the
transaction
holes
can
be
released
so
that
the
other
transactions
can
start
grabbing
them.
A
Now,
if
you
look
at
self
specifically
RVD,
there
are
three
ways
to
mount
it.
You
can
mount
it
refuse,
which
is
easy
and
gives
you
low
performance.
Quick
files
with
mixed,
read,
write
when
F
Data
Sync
gives
about
640
I
ops.
You
can
easily
use
the
ICS
a
target
which
actually
is
a
lot
harder
to
use
and
it's
slow
and
as
a
mixed,
readwrite
IO
on
par
with
what
fuse
does
and
then
there's
KR
BD,
which
is
the
internal,
our
BD.
It's
actually
really
easy
to
use.
A
This
is
our
building
map
and
it's
a
lot
faster
in
the
same
mixed
readwrite
test
that
I
did
for
the
other
two
here.
It
ended
up
with
55
15,
50
I,
ops
per
Java.
The
downside
of
the
qidan
kernel
one
is:
it
doesn't
have
the
fans
image
features,
you
lose
the
exclusive
locking
support
and
you
lose
striping.
A
Hopefully
that
will
be
coming
in
a
separately
student
or
kernel
release.
I
should
say
now.
A
problem
on
doing
realistic.
Testing
with
file
is
that
you
need
something
that
resembles
Postgres.
We
can
and
we
do
use
PG
bench.
The
problem
is
that
PG
bench
workload
and
our
real
application
workload
differs
quite
a
lot.
Our
real
application
workloads
has
a
lot
of
very
large
transactions
with
very
large
objects
where
each
row
is
a
humongous
beast.
A
Where's
PJ
bench
deals
with
very
well-formed
transactions,
so
what
we've
done
is
we've
observed
the
production
area
pattern
when
we
try
to
tune
a
file
to
replicate
the
same
pattern
and
everyone
something
provides
good
results
on
file.
We
apply
it
to
real
database
and
we
have
seen
that
if
we
see
an
increase
of
25%
on
the
file,
our
real-world
database
transaction
rate
will
also
be
increased
by
about
25%.
A
The
important
things
you
need
to
do
is,
first
of
all,
allow
the
buffer
cache
by
default
file
and
most
testing
tools
bypass
the
buffer
cache.
In
order
to
test
your
underlying
storage,
but
hey
infraction,
and
in
all
real-world
scenarios,
you
have
a
lot
of
our
caches
so
to
leave
it
on
make
the
I/o
that's
one.
You
run
multiple
jobs
and
use
8
kilobyte
blocks,
because
this
is
what
the
database
does.
A
I
would
love
it
if
the
database
used
asynchronous
I/o
or
if
you
used
to
direct
kernel
interfaces,
if
it
has
a
very
deep
aisle
depth,
but
it
doesn't
so.
You
need
to
actually
make
your
fire
parameters
suck,
but
it
replicates
exactly
what
the
database
is
doing,
and
that
also
means
using
F
Data
Sync,
every
humbleth
block
or
so,
and
use
very,
very
large
files
and
use
semi
random
access.
The
reality
is
that
even
for
read
requests,
it
is
often
subsequent
blocks
and
it's
often
random.
A
So
again,
now
post
press
doesn't
use
an
offense
aerial,
so
neither
does
the
benchmark
now,
since
I
know
that
my
read
caches
will
actually
cover
up
most
the
weeds
I
focus
mostly
on
write
performance
I
have
to
comparison
targets.
One
is
local
software
raid
zero
with
eight
Samsung
50
pros
three
Samsung
850
pros
for
the
writes
here.
The
latency
is
fairly
low
and
I
obscure
job
is
also
fairly
low,
and
this
is
just
a
sanity
check
for
a
you
know.
How
fast
can
it
be
if
I
just
do
some
local
hacking
I'm
just
tossing
together?
A
Some
hardware
I
also
have
a
full
MLS
I.
My
grade
rate
controller
in
raid
6
with
24
drives
an
interesting
thing
in
there
is
this
one:
does
a
better
job
of
the
eye?
Ops
per
per
job
you
know,
does
have
a
battery
back
right,
back
up
and
so
forth,
but
if
we
compare
the
local
raid
0
with
the
controller
based
raid
6,
if
you
look
at
the
latency,
the
99.99
percent
mark
in
on
the
raid
controller
is
measured
in
milliseconds,
quite
a
lot
of
milliseconds
Paravel.
The
target,
of
course,
is
4k
RBD
to
be
beating.
A
These
and
clarity
does
beat
the
local
raid
zero
in
number
of
eye
ops.
It
does
not
beat
the
local
raid
6
in
a
movie
ops,
but
the
latency,
which,
at
the
end
of
the
day,
is
what
dominates
transactions
for
us.
It
is
much
better
than
even
the
dedicated
rate
controller
and
Emelec
the
dedicated
rate
controller
KR
PD
does
survive
controller
failure.
Battery-Backed
caches
are
fantastic
and
till
the
controller
dies,
and
you
now
have
data
sitting
in
a
cache
that
is
battery
backed
where
the
controller
doesn't
work
anymore.
This
is
when
you
start
finding.
A
Are
your
backups
or
you
go?
Look
at
your
Postgres
slave
and
figure
out:
was
it
up-to-date
and
can
I
over
where
zonkey
RBD
we
just
start
the
same
job
somewhere
else.
A
So
current
challenge
for
us
is
first
of
all,
we
have
a
little
bit
of
a
challenge
around
locking
if
you
run
x4
on
RBD
and
let's
assume
we
have
the
following
scenario:
it's
running
the
database
is
running
and
somebody
reboots
the
switch
I.
Do
it
lost
power
or
it's
maintained,
or
something
like
this.
At
this
point,
a
few
minutes
later,
Aurora
is
going
to
detect
that
hey
this
compute
node
seems
to
be
dead.
Let
me
start
the
job
somewhere
else.
Now
your
database
is
started
somewhere
else.
A
It's
happily
relocated
and
its
new
location
monster,
be
the
image
and
mounts
text
for
a
file
system.
Great
the
database
is
now
back
up
and
then
the
switch
finishes
rebooting
the
all
job,
which
is
still
running
now
as
soon
as
Aurora
can
talk
to
it.
It
will
tell
this
node
whoops.
Please
stop
that
job,
because
I
already
started
it,
but
there
aren't
gonna
be
a
few
seconds
red
runs
and
it
now
happily
renounced
RBD
file,
system
and
rights.
A
To
it,
and
if
this
happens,
you're
gonna
figure
out
how
to
repair
a
broken
x4
file
system
with
a
database
on
top
of
it.
We
thankfully
tested
all
these
failure
scenarios
before
we
roll
it
out
to
production
and
I
strongly
recommend
you
test
your
failure
scenarios
in
a
lab
before
you
roll
it
out
to
production.
A
A
Which
means
our
BD
map
doesn't
check
the
lock,
but
you
can
do
so
now.
We
try
to
lock
the
image
and
if
we
don't
get
to
lock
the
image,
in
other
words,
somebody
else
holds
the
lock.
Then
we
check
the
status
of
the
image.
If
there's
a
watch
from
image,
that
means
somebody
else
is
holding
the
lock
and
they're
still
alive.
In
other
words,
whoever
just
told
us
to
map
this
image
is
wrong,
because
the
original
job
is
definitely
still
running.
A
Would
you
check
for
the
watcher
3
times
15
seconds
apart
and
if
we
find
it,
we
just
say
our
work.
If
we
get
all
the
way
here,
then
for
the
last
45
seconds,
we
haven't
seen
a
watcher
on
this
image.
So
we
know
that
for
the
last
45
seconds,
nothing
as
we
had
this
for
our
buddy
map,
but
her
luck
is
still
there.
At
this
point
we
blacklist
the
original,
lock
holder
and
then
steal
the
lock
on
one
map.
We
are
remove
the
lock
and
when
the
node
is
rebooted,
it
am
blacklist
itself.
A
So
this
means
that
if
we
have
one
of
these
nodes
and
they
go
away
for
more
than
a
minute
or
women
45
seconds-
and
this
is
relocated-
whether
that
nodes
comes
back
up,
it's
now
blacklisted.
If
you
reboot
the
node,
it
will
unblock
this
itself,
because
when
a
node
comes
back
up
fresh,
it
has
no
jobs.
It
will
wait
to
be
told
what
to
do
by.
There
were
masters,
it
has
no
mappings.
A
We
also
need
to
make
this
faster.
We
beat
legacy
hardware
for
latency
in
the
99.9%
range,
but
the
problem
is
in
the
50%
Emily
consumed
work
and
the
90%
latency
March
yeah
we're
actually
beaten,
but
we
don't
want
any
compromise
on
performance.
We
are
currently
rolling
out
enemy
Ram
for
the
sefirot
Romans.
It
did
take
a
little
while
to
get
that
out
there,
because
we
have
a
requirement
that
all
storage
at
rest
must
be
encrypted,
and
if
your
RAM
is
non-volatile,
then
yes,
it
does
kind
of
storage.
A
If
it
does
arrive,
a
complete
power
loss-
and
it
comes
back
in
an
intact
state,
then
it
is
storage
and
it
needs
to
be
encrypted.
Pmc
actually
modified
there.
In
the
60s
didn't
see
thin
to
support
encryption
for
us,
so
we're
really
happy
with
that.
We
do
have
single
storage
server
testing,
which
actually
looks
awesome.
A
The
hardware
is
being
installed
this
week.
Someone
was
installed
yesterday,
some
today,
one
of
the
advantages
of
surface.
We
are
doing
this
while
the
system
is
like
just
taking
down
one
rack
at
a
time
installing
the
the
NVRAM
building
it
back
up.
So
we
should
have
large-scale
testing
results
in
about
two
weeks.
A
If
you
want
to
try
any
of
this
out,
if
you
can
go
to
get
up
come
Italian,
you
will
find
our
modifications
to
both
darker
and
over
up
and
we'll
also
put
out
our
automated
provisioning
there.
This
this
always
Network
boot.
The
automatic
vias
of
this
is
a
trainer
that
will
be
there
as
soon
as
we
can
take
some
things
like
the
SSH
keys
out
of
the
repository.
We
just
need
to
move
that
into
a
separate
repository
and
then
we're
going
to
project
that
also
out
there
as
open-source.
A
If
you
want
an
exact
replica
of
what
we
did,
you
know
I'll
share
the
slides.
You
can
go,
look
at
it,
but
it's
compute,
node
stores,
notes
networking
this
idea
that
everything
is
Linux.
Everything
is
something
you
just
shell
into
same
storage
tools
for
everything
and
open
source
as
much
as
possible
for
all
of
it.
B
Excellent,
thank
you.
That
was
a
great
performance
here.
Everybody
that's
interested
I
will
be
unmuting
all.
So
if
you
would
like
to
ask
a
question,
you
can
unmute
yourself
in
blue
jeans
and
ask
a
question
or
you
could
just
go
ahead
and
type
it
in
to
the
the
comment
box.
Any
questions,
maybe
I
should
open
this
them.
B
A
B
B
B
A
We're
an
almost
bf
demons
on
the
host,
so
we
treats
servers
and
switches
are
all
part
of
the
same
OSPF
domain.
The
container
does
specifically
not
run
OSPF
if
it
did,
that
would
be
a
little
bit
of
a
security
problem
if
any
reach
into
the
container,
but
since
it
runs
on
the
host,
we
run
quagga
on
the
host.
B
Sounds
like
your
performance
was
feature
complete.
Everybody
got
everything
they
needed.
So
thank
you
very
much
and
remember
folks,
next
month,
we'll
be
back
here
on
the
25th
of
February
at
the
same
time,
to
hear
about
the
latest
and
greatest
from
sefa
fests,
so
yeah
I'll
be
sure
to
be
there
for
that.
B
One
all
right
and
if
you
do
want
to
just
keep
an
eye
out
on
the
mailing
lists
and
social
media
and
everything
I'll
make
sure
to
get
the
links
to
this
YouTube
recording
and
the
the
slides
that
I'll
post
up
on
SlideShare
after
he
sends
them
to
me.
So
thank
you
very
much
serve
all
we
really
appreciate
it.
This
is
great.
Thank
you.