►
From YouTube: Averting crisis at multi-petabyte scale - Piotr Dałek
Description
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Piotr Dałek. OVH Software Engineer
A
A
Problem
we
might
have
a
problem
like
Mars
for
some
reason
go
on
the
forum
and
you
guys
can
have
connect
or
the
performance
is
so
severe
that
customers
can't
access
their
data
or
they
Texas
is
very
slow.
So
we
had
this
these
problems
in
the
ad,
so
we
decided
to
fix
this
for
the
customer
because
we
cannot
afford
this.
This
is
not
only
fun
for
us,
but
also
for
customers.
Polite,
don't
like
it
go
to
competitions,
so
they
will
try
to
see
you
guys
and
so
on
and
so
on.
So
they
are
generally
unhappy.
A
So
we
had
to
do
something
and
the
first
thing
we
came
up
with
two
down
scheduled
things
a
bit
and
instead
of
having
one
for
petabyte
coaster,
we
divided
this
in
four.
Now
we
have
for
one
thing
about
casters:
what
does
it
change
when
we
had
similar
issues?
It's
affecting
only
limited
number
of
customers?
Oh,
we
still
have
unhappy
customers
who
are
going
to
sue
us,
come
to
the
district
competitions
and
so
on
and
so
on.
But
at
least
the
number
of
these
customers
is
limited.
A
The
textual
reason
is
not
important
at
this
point,
but
that's
what
this
happened.
Several
handle
is,
of
course,
not
a
problem
for
him,
but
the
problem
is
that
there
is
a
514
terabytes
of
data
to
recover,
which
is
a
lot.
It
will
take
a
long
time
and
will
degrade
performance
for
many
users,
and
this
is
going
to
hurt.
But
if
we
are
going
to
dance
care,
this
2
1
2
divided
in
the
same
scenario,
exactly
same
scenario.
We
have
only
100
35
terabytes
to
which
is
a
lot
less
and
can.
B
A
A
Are
many
reasons,
but
there
are
rock,
for
example,
they
are
issuing
these
phantom
images
of
a
lot
of
images,
so
this
is
going
to
hurt
a
lot
because
the
deep
Japanese
costs
the
operation,
so
the
performance
for
everyone
is
going
to
degrade
or
they
are
doing
character
of
clones,
and
then
they
are
removing
it.
This
is
also
costly,
intent
in
terms
of
performance
for
the
cluster
for
the
user.
A
So
if
we
are
going
to
downscale
this,
we
have
still
the
same
problem
but
on
a
smaller
scale,
because
it
is
affecting
only
one
cluster
and
not
every
customer
and
speaking
up
for
our
users.
We
had
an
issue
with
the
latest
advice
we
had.
Latex
is
fine.
Exactly
five
minutes
which
you
can
see
is
exactly
something
that
looks
act
like
a
cron
job,
so
it
can
be
kind
of
calico.
This
way
we
started
digging.
We
started
asking
around.
B
A
And
it
started
to
affect
customers
what
you
can
do
at
this
point.
We
increase
logging
everywhere,
so
we
could
minimize
the
number
of
the
affected
parties.
Then
we
gather
the
intelligence
that
were
it
now.
We
now
we
had
the
hot
parties.
Then
we
increased
logging
of
those
false
positives,
I
mean
even
further,
and
then
we
track
the
deer
data
in
a
hurry
stage,
which
is
part
of
the
emitters.
We
increasing
the
part
of
the
class
to
20,
which
is
use
it.
A
Usually
it's
safe
to
do
because
object
class
doesn't
talk
much
usually,
then
you
can
use
grader
or
signal
long
gathering
to
to
pinpoint
affected
or
STIs
on
the
screen.
You
can
see
that
the
crater
can
show
you
which
all
these
are
showing
unusual
activity
due
to
increase
department
of
the
task.
Now
on
the
screen,
you
can
see
that
we
have
the
ED
child
and
remove
child
operations
repeating
every
five
minutes,
and
what
is
important
here
to
note
is
that.
A
Now
we
have
collected
the
data,
we
know
the
IP
address
of
the
client
affecting
us
what
kind
of
times
it
is
in
this
case,
OpenStack
sender,
and
something
that
concerned
us
affecting
us
which
is
attribute
tells
us.
This
is
cloning
of
the
image.
So
we
took
those
IPS,
we
search
our
databases
and
we
found
out
that
these
were
two
guys
who
were
traversing
to
telling
us
that
we
that
they
don't
have
anything.
This
tastes
soft.
A
But
it's
not
the
end
of
the
story,
because
it
turns
out
that
this
was
the
reason
for
unexpected
taste
space
in
creates.
Within
three
months,
we
jumped
from
the
nearly
85
percent
to
74,
try
to
percentages
close
to
be
deadly
for
the
dead
cluster,
and
we
can
little
time
to
find
out
what's
going
on.
We
already
done
with
it
can
be
caused
by
adding
columns
and
removing
clones,
but
we
had
no
exact
in
evidence.
So
we
start
the
game
again
and.
A
Interesting
thing
here
is
that
the
Senate
was
reporting
different
space
usage
than
operating
system,
so
we
started
digging
and
deep
in
the
10
meter
and
we
realized
that
information
for
the
10
PT
query
returns
something
out
in
the
field.
It's
not
thank
you,
which
means
probably
that
people
would
stop
singing.
The
first
thing
we
did
was
to
increase
the
natural
frequency
and
concurrency
by
installments
sector.
Was
the
inductors
only
50
miles
from
current
snap
frames?
And
it's
not
translate.
You
can
do
this
at
runtime
and
we
revert
to
the
trend.
A
A
A
This
is
a
pull
request
on
route
also
improves
intubation
with
monitoring,
because
you
don't
need
to
burst,
snapped
a
few
any
field
field
anymore.
It's
just
a
simple
integer
which
you
can
put
into
animals
Monica
to
grab
it
or
put
it
into
cars
or
whatever
you
want.
It
was
meant
a
master
and
it
was
by
deported
to
luminous,
but
unfortunately,
full
bankruptcy
tool
is
not
possible.
So
if
you
need
this
picture,
you
will
have
to
upgrade.
Unfortunately,.
A
We
already
posted
our
video
images
and
we
everything
tends
to
post
the
end.
It's
all
machines,
so
we
are
setting
them
up
in
no
time
so
we
have
two
pools
and
to
users
user,
a
that
means
test,
ND
and
mini
static
user,
which
manages
to
pull
a
which
contains
original
parent
images,
and
we
have
user
B
with
an
edges
clones
of
those
images,
and
it
has
full
rights
to
put
P
which
contain
close
but
doesn't
contain
the
test
and
access
to
the
pool.
A.
A
A
A
So
how
do
we
take
this
to
fix
this
because
don't
understand
how
this
works
internally
itself,
so
self
painted
already
actually
making
something
called
RDD
children
after
which
contains,
and
the
AMA
database,
which
consists
of
key
and
value
which
is
prevented
as
a
key
containing
a
parent
and
staff
information
and
very
containing
in
some
information?
Of
course,
what
we
have
to
do
in
this
case
is
to
remove
offending
cloth
from
this.
A
When
you're.
Looking
at
this
information,
which
is
taken
directly
from
the
set
code,
you
realized
that
we
have
to
decrease
the
field
compound
and
find
the
offending
from
ID
field
and,
of
course,
the
ideal
and
in
the
key
information
about
about
Parenthood
ID,
which
is
idea
of
the
code
containing
parent
image.
A
So,
as
you
can
see,
there
is
no
easy
solution
to
faces
and
instead
of
doing
it
by
hand,
we
wrote
our
own
tool,
we
have
it
published
in
the
end,
but
we
are
trying
to
do
so
in
some
future
and
fortunately
we
stand
upon
the
problem.
Writing
it
because,
as
you
saw,
the
key
and
value
contained
no
charge
which
the
brothers
doesn't
accept.
It
stops
at
first,
don't
talk,
which
means
we
are
probably
when
we
want
to
manage
those
keys
and
values
for
the
everyday
children.
A
Again,
don't
try
this
don't
try
to
fix
this
manually
because
you
can
break
even
more
and
if
you
are
already
in
this
situation,
don't
hesitate
to
ask
us
for
help.
We
can
help
you
out
and
another
end.
Don't
let
this
happen
to
you,
because
it's
a
tricky
situation
and
if
you
are
caught
with
this
individual
of
the
90
might
be
even
more
troubling
10
degrees.
A
B
A
Actually,
this
is
not
our
story,
but
I'm
going
to
share
this.
Some
of
your
or
onset
users
make
melodies
that
will
sound
message
stating
that
someone
was
complaining
about
the
increasing
disk
space
you
to
use
it
on
their
chester,
but
it
wasn't
the
case
we
faced
it
so
after
some
time
they
confessed
was
the
issue.
The
issue
was
they
call
it?
The
flagging
wood,
which
was
invoking
the
raters
Bank
every
minute,
and
sometimes
this
target
didn't
finish
some.
It
was
living
trans
data
and
the
main
main
collection
pool.
A
A
This
is
mostly
when
we
are
doing
three
carburetor
or
some
other
stuff
again,
another
issue
that
needs
performed
right
and
what,
if
you
were
totally
smoothly
read?
It
doesn't
account
for
this
and
in
case
of
temp
right,
our
always
slower
than
write
down
ten
weeks,
and
you
have
different
data
again,
no
relation
to
reality.
A
Finally,
a
final
name
to
the
coffin.
There
is
a
time
I'm,
strictly
a
switch
which
can
affect
which
can
be
problematic
for
many
reasons
like,
for
example,
the
coaster
rides.
So
many
objects
so
fast
that
it
happened
to
not
erase
them
in.
We
may
need
20
seconds,
it
happens,
it
can
happen
and
will
probably
camera
or
simply
the
primary
for
the
objects.
Frightened
can
die
in
the
middle
of
the
test
and
it
cannot
go
away
so
with
Hanks
time
of
kills
it
and
the
creb
and
the
trans
context
remain.
A
What
you
can
do,
instead
is
to
issue
the
chef
Damon
Irv
done
on
any
interesting
or
see
Damon
and
collect
the
data,
and
you
delayed
in
millisecond.
You
could
divide
the
delta
of
sum
and
average
count
kills
and
multiply
them
by
thousand
I'm
talking
about
the,
because
this
is
total
accumulated
numbers
across
entire
sets
lifetime.
A
So
you
should
take
at
one
point
of
time,
then
a
second
point
of
time
calculate
the
difference
between
them
and
then
use
that
in
your
equation,
and
this
will
give
you
the
latency
in
milliseconds,
and
if
you
are
in
this
mess
already,
you
have
trans
context.
You
can
always
clean
this
with
the
radio
string
up
comment
and
also
you
need
to
provide
the
prettiest
benchmark
data.
This
will
click
the
type
objects
without
affecting
anything
else.
So
you
don't
need
to
copy
the
pool
you
don't
need
to
have
some
weird
stuff
around
it
is
issued.