►
Description
Cephalocon APAC 2018
March 22-23, 2018 - Beijing, China
Varada Kari, Flipkart System Engineer
A
Very
good
afternoon,
all
I
am
Barda
I'm
from
Flipkart,
it's
an
e-commerce
company
in
India
and
then
have
team
mates
here.
Also
with
me,
I'm
going
to
talk
about
our
journey
so
far
with
SEF
so
going
to
talk
about
our
journey
and
then
how
do
we
kind
of
deploy
and
use
surf
internally
for
Flipkart
and
what
are
the
challenges?
What
we
faced
and
how
do
we
take
care
of
all
our
internal
loads
and
then
what
kind
of
QoS
mechanisms
what
we
deployed
energy
w's?
Most
of
you
are
using
it
as
an
object,
storage
service.
A
A
It's
only
the
largest
e-commerce
I
would
say
in
India
right
now
we
have
80
million
products
and
we
have
our
own
different,
70
plus
product
catalogues,
and
then
we
have
around
100,000
sellers
and
then
30
million
daily
users
daily
visitors
in
this
to
Flipkart
and
then
8
million
shipments
per
month
in
the
peak
load
will
just
double
that
around
16
million
apartment
and
took
at
all
these
needs
of
Flipkart.
So
we
have
developed
a
highly
scalable
and
reliable
storage
infrastructure
which
is
based
on
safe
and
we
are
using
it
as
an
object,
storage
service.
A
It's
almost
in
production
from
past
3
years
approximately,
and
we
have
two
data
centers,
which
has
around
20,000
hosts
each
and
it's
completely
virtualized
infrastructure.
So
we
have
our
own
infrastructure,
so
this
virtual
essential
for
us,
which
is
catered
and
then
caught
out
for
the
staff
you
see
for
us
and
the
as
I
mentioned.
So
it's
mostly
a
scale
out
objects
or
a
service
on
SEF.
What
we
use
and
then
mostly
use
s3,
edible
accessory
compatible
as
W's
and
our
key
focus
or
key
features
of
using
surface,
mostly
on
durability.
A
What
we
offer
for
Flipkart
one
is
mostly
a
low
latency
platform
which
has
smaller
object,
sizes
but
low
latency
and
then
they're
big
in
number,
and
the
other
workload
is
mostly
the
backup
use
case
where
most
of
the
critical
databases
like
my
sequel
and
radius,
elasticsearch
Kafka,
everything
is
backed
up
on
one
of
our
own
offer
clusters,
which
takes
care
of
all
the
backup
workloads,
and
this
is
mostly
bandwidth
optimized.
We
have
medium
to
be
bigger
objects
when
you
are
or
backing
up
the
whole
database.
A
So
one
this
one.
We
have
one
big
cluster,
which
is
around
two
petabytes
of
raw
storage
and
it
is
growing
every
day
and
which
has
around
750
OS
lease
right
now,
which
has
a
mix
of
SSDs
and
hardrives,
and
it
has
around
1.5
billion
objects.
More
than
that
right
now
and
we
have
around
200
plus
internal
customers
which
use
the
service,
and
then
we
just
grow
it.
50
million
objects
every
month
and
the
peak
you
can
just
reflect.
A
We
are
going
around
100
million
per
month,
so
I'm
just
going
to
give
overview
of
what
is
our
tech
stack
and
how
do
we
deploy
itself
in
our
cluster
and
then
what
are
the
challenges
we
faced
to
arrive
at
this
kind
of
organization
of
the
stack
for
us?
So
we
have
one
endpoint
which
multiple
clusters
are
kind
of
reaching
out
to,
and
then
we
have
multiple
clusters
inside
a
single
endpoint,
depending
on
the
user
and
and
the
SLS
will
respect
to
the
different
cluster.
A
So
we
have
one
load
balancer,
which
is
a
software
load
balancer
and
just
based
on
hitch
a
proxy.
This
is
actually
owned
by
different
teams.
That's
the
reason
we
we
have
it
at
a
different
layer
right
now
and
it
reaches
out
to
a
bunch
of
nginx
nodes
which
works
as
a
request
for
ordered
for
us
and
then
to
the
our
JW's
so
for
to
get
into
the
SLS
on
these
things.
So
we
have
again
split
into
two
two
instance:
groups
what
we
call
it
as
for
the
read
and
then
the
right.
A
So
there
is
a
load
balance
to
load
balance
or
attach
to
this
again
to
handle
all
the
read
traffic
and
write
traffic,
which
is
again
H
a
proxy
based
software
load
balancer,
which
has
two
different
endpoints
and
nginx
will
forward
the
requests.
Looking
at
the
headers,
whether
it
is
a
read
request,
a
write
request
down
to
the
wandering
as
W
group,
and
these
are
W
switch
reached
out
to
a
virtually
VMs,
which
is
mostly
in
the
virtual
infrastructure.
A
What
we
have
so
each
node
is
in
VM,
so
we
have
Mons
and
then
castillo
nodes
index
nodes
and
other,
basically
at
higher
voice
keys,
which
has
a
combination
of
SS
citizen
HPD's.
Here
and-
and
these
are
all
part
of
one
of
the
bare
metal
hardware,
which
is
again
a
debian
host
hypervisor
for
us,
and
then
we
have
different
VMs
carved
out
from
the
hypervisor.
So.
A
This
just
just
talks
about
what
are
the
clients
we
really
kind
of
handle
every
day
in
the
Flipkart
workload.
So
we
have
a
bunch
of
internal
customers
which
are
using
it
as
a
repo
service.
So
all
the
Debian's
or
the
internal
code,
whatever
has
been
developed,
is
actually
packed
into
a
debian
and
then
being
pushed
into
this
cluster.
A
So
and-
and
there
is
a
CDN
which
is
serving
the
public
cloud
or
the
public
interface
to
the
Flipkart
so
which
will
be
accessing
all
the
invoice
and
then
images
and
then
other
details
of
the
products.
This
is,
this:
just
constitutes
the
user
and
part
of
it,
and
we
have
our
own
health
monitoring
system,
which
is
again
time
series
based
database,
which
you
use
open,
TS
DB,
which
has
a
visual
UI
with
the
graph
fauna,
which
we'll
see
in
more
in
detail
in
the
later
slides.
A
A
So
I
mean
so
any
instance,
as
I
mentioned
civilian
for
us
right.
So
any
instance.
We
have
the
following
things
in
that
right,
so
there
is
a
host
to
us
which
is
VPN
for
us,
and
then
there
is
specific
guarantees
for
the
network
depending
on
the
instance
type
and
then
how
many
CPUs
are
the
virtual
CPUs,
what
we
have
and
then
type
of
disks
we
can
have
SSDs
our
hard
drives,
are
a
mix
of
both
and
then
the
main
memory.
A
What
we
want
to
give
it
to
each
of
the
VMS
so
when
we
translate
it
to
the
safe
workload,
so
we
have
monitors
which
we
need
more
CPUs
and
RAM
to
handle
all
the
data
base
for
the
level,
DP,
workloads
and
etc,
and
when
it
comes
to
our
JW's,
we
need
more
CPUs
and
the
more
network
bandwidth.
And
there
is
a
group
of
these
and
our
JW's.
What
we
have
we
all
off
the
same
type
when
it
comes
to
the
voice.
A
These
we
have
SSDs
alone,
which
are
used
for
the
casters
and
then
index
nodes,
but
they
have
different
configuration
for
that
right.
So
one
is
a
low
latency
one
which
are
larger
capacity
of
SSDs
and
cache
series
of
mostly
disk
bound
thing.
We
don't
need
too
many
cords
there,
but
not
too
much
of
RAM,
so
we
would
use
a
different
configuration
of
the
VM
there
and
for
the
base
universities.
We
have
higher
capacity,
but
it's
a
hybrid
one.
A
We
have
SSDs
and
hard
drives,
both
in
the
same
same
VM
instance,
so
everything
put
together
right
now,
so
we
want
to
guarantee
some
of
the
QoS
aspects
of
to
the
internal
customers.
What
we
have
teaching
of
reaching
out
to
the
Flipkart
cluster
right,
so
so
to
guarantee
these
qsr.
We
have
around
three
to
four
layers
of
guarantees.
A
It
is
embedded
into
the
rtw
code
right
now,
so
we
just
look
at
the
headers
and
then
find
out
what
kind
of
adequacy
is
that
and
then
we'll
take
a
call
whether
we
hear
we,
it
should
serve
that
or
not
and
the
in
the
fourth
level
of
guarantee.
What
we're
trying
to
do
is
on
the
crash,
so
we
physically
separate
them
into
different
PDU
level
groups
into
different
pools
and
then,
if
one
one
pool
has
a
problem
with
that
down
low
st
or
some
hardware
failure,
we
don't
want
to
affect
the
whole
cluster
for
that.
A
So
that
is
the
fourth
level
of
Q
s.
What
we
want
to
guarantee
at
the
cluster
level
and
coming
to
the
health
monitoring
part
of
it
or
the
administration
part
of
it
we
have
as
I
mentioned
before,
so
we
have.
We
collect
different
kinds
of
stats
from
each
of
the
of
the
component
in
or
in
the
cluster,
so
mostly
categorize
them
into
two
different
categories.
One
is
on
the
system
side,
where
we
collect
all
the
you
know,
system
specific
how
much
of
Ramy
is
used.
A
What
is
the
disk
capacity,
how
many,
how
much
of
latency
for
each
of
them
and
etc?
And
then
we
call
it
as
half
periodic
stats
which
we
run
it
on
each
of
the
nodes,
OS
keys,
roger
w's
Mons,
and
they
have
dedicated
min
nodes
with
to
run
all
these
stats
together.
When
it
comes
to
one,
we
collect
the
DB
stats.
What
are
the
space
you
use
in
and
then
what
is
the
quorum
status,
etc?
And
when
it
comes
to
our
WS,
we
capture?
How
many
requests
are
we
serving?
What
a
per
user
I
mean?
A
A
What
we
made
so
any
user
is,
is
breaching
the
contract
or
we
have
any
cluster
issue
which
is
actually
making
slow
request
down
to
the
user.
So
we
will
be
alerted
for
that
and
if
there
is
any
download,
OSD
or
don'know
is
what
we
have
or
under
any
failure
with
any
nacho
seller.
It's
like
going
out
of
memory
or
everything
we
captured
it
and
then
to
put
it
to
the
user.
So
the
uncle
person
will
get
a
call
and
a
male
an
SMS
to
attend
to
the
cluster.
A
Immediately
so
this
is
one
of
the
graphs
to
show
how
many
requests
what
we
going
to
cater
so
the
green.
Actually,
so
this
is,
you
can
see
on
the
right
hand,
side
I,
don't
know
the
scare
what
kind
of
HTTP
error
status
we
are
actually
HTTP
status.
We
are
actually
reading
to
the
user
so
to
hundreds
or
404s
or
429
or
5-0
threes.
What
kind
of
history
deepest
status
is
returned
to
the
user
from
GW?
We
capture
that
the
green
actually
is
reflecting
the
200s.
A
What
we
have,
we
typically
cater
around
eight
key
requests
in
in
average,
and
then
the
lower
bots
talk
about
different
error,
different
codes
and
error
reports
there
so
and
we
can
actually
see
on
the
right
hand,
side
again
there.
We
can
actually
go
and
look
at
it
the
level
of
what
is
operation,
what
each
user
is
doing
and
the
user,
and
what
bucket
you
can
actually
go
and
look
at
how
many
ops
are
being
sent
at
the
peak.
A
Okay,
so
that's
that's
how
we
use
it
so
far
and
then
the
journey
is
not
so
smooth
so
far,
so
we
have
different
challenges
to
reach
it
to
this
level
of
organizing
our
own
tech
stack
and
then
handling
all
the
challenges,
the
problems.
So
we
started
with
this
with
a
hard
drive
based
cluster
and
then
we
start
to
kind
of
grow
and
to
the
need
right.
So
the
first
problem,
so
we
have
a
lot
of
writers,
was
coming
in
which
we
do
around.
A
We
used
to
do
around
2
to
3
K
before
it's
a
hard
drive,
blaze,
cluster
and
suddenly
one
of
the
hosts
is
failed,
and
then
we
started
having
rebalancing
that
triggered
a
lot
of
splits
and
merges
on
the
file
store
back-end
and
we
are
not
able
to
serve
any
of
the
I
where
it
was
so
it's
kind
of
an
outage
for
us,
so
we
were
not
able
to
save
any
I.
Was
at
that
point
so
then
we
started
thinking
about
okay.
This
is
not
going
to
handle
it
now.
I
have
to
do.
A
A
So
to
have
all
these
things,
then
we
have
separated
them
at
nginx
layer
to
have
different,
write
and
read
groups.
That
is
the
first
challenge.
What
we
faced-
and
there
is
another
outage
with
the
metadata
poles
right
so
again
again,
it's
a
learning
for
us,
so
some
of
the
customers
which
actually
went
beyond
more
than
hundred
million
objects
in
a
bucket
that
has
a
huge
index
per
bucket,
and
then
we
have
an
usage
pool
also,
which
is
actually
logging
all
the
operations.
So
on
that
it
firfer
each
user
again
the
same
problem.
A
One
of
the
discs
failed
on
that
pool
and
there
is
a
lot
of
rebalancing
happening.
So
the
cluster
is
on
to
its
knees.
So
far
you
can
see
the
trend
there.
So
for
a
couple
of
hours
or
something
we
are
not
able
to
set
any
of
the
eye,
whoa
so
mitigation
again,
so
we
started
looking
back
so
what
we
did
was
we
started
sharding
manually,
shutting
all
these
buckets,
so
we
came
up
with
a
number
of
charts
seats
in
hammers,
so
we
don't
have
any
dynamic
charting
that
time.
A
Third
challenge,
so
there
is
an
application
error
from
the
user,
so
it
was
in
a
loop.
He
was
as
whenever
he
tries
to
put
an
object
into
these
his
bucket.
He
tried
to
do
the
create
bucket
also,
so
he
does
a
create
bucket
and
a
put
object
immediately
in
a
loop
so
which
again
has
a
lock
contention
on
the
level
DP.
We
are
not
able
to
I
mean
plus
trees
again
down
to
his
knees.
A
Right
I
mean
we
are
not
able
to
take
any
of
the
requests
because
of
the
lock
contention
on
the
level
DB
and
then
the
the
sharding.
What
is
happening
everything
so
no
other
ways.
Again.
We
look
back
and
said.
Okay,
we
have
to
enable
bucket
what
us
on
the
create
bucket
creation
request
as
well.
So
the
rate
limits
were
a
player
again
on
the
bucket
creation
and
deletes
what
the
user
can
do
per
minute
or
per
hour,
depending
on
the
time
we
are
at
what
he
shows
to.
A
That's
mostly
on
the
usage
part
of
it.
So
again,
so
we
face
some
more
problems
with
the
coordination
and
some
of
the
missing
features
like
so
something
like
data.
We
have,
we
were
leaking
some
of
the
objects
with
incomplete
multi-part
objects,
so
you
can
just
abort
the
multi-part
also,
but
we
were
not
able
to
kind
of
clean
up
all
the
objects
which
are
incomplete,
so
the
we
have
something
called
as
not
rad
as
got
me,
not
fun.
Fine.
We
are
into
some
billions
of
objects,
so
the
scale
again.
A
A
The
second
thing
is
again
to
the
multi-part,
so
there
was
a
later
loss.
There
was
a
race
in
the
multi-part
thing,
so
what
happened
was
once
become
one
of
the
is
actually
a
bigger
object.
So
there
are
multiple
parts
to
this
object
when,
when
the,
when
the
user
actually
kind
of
did
a
complete
multi-part,
so
it
took
a
lot
of
time
to
update
the
meta
so
and
then
it
was.
It
was
not
in
the
within
that
iMode
period
of
the
boat
or
something
so
he
he
tried
to
complete
the
multi-part
using
a
post.
A
Then
there
was
a
race
we
actually
kind
of
deleted
all
the
objects
which
other
than
the
head
object.
So
there
was
some
kind
of
a
greater
loss
for
us,
and
then
we
have
another
peer
for
the
to
address
all
these
issues,
this
one
having
a
lock
and
then
trying
to
not
to
race
them.
At
that
same
point
of
time
to
account
the
data
loss,
we
have
our
own
tools
again,
not
the
usual
or
fun
find
out
something.
A
A
Then
then
we
said:
okay,
we'll
do
something
else.
Then
they
came
luminous
which
fill
address
our
file,
store
problems
of
spirits
and
merges
what
we
were
facing
earlier
and
then
thought.
Okay,
we'll
use
blue
store
to
do
this
again.
The
path
is
not
so
smooth
from
hammer
to
luminous.
There
is
no
direct
path.
I
have
to
go
to.
We
have
to
go
to
jail
and
then
we
have
to
from
there
to
upgrade
to
luminous,
and
then
the
file
store
OSD
has
to
be
destroyed
and
then
added
as
a
blue
store
back-end.
This
introduced
it
again.
A
Some
wear
and
tear
on
the
hardware
and
the
hard
drive
started,
failing
SSE
started
failing
then
we
are
just
coming
up
with
different
ways
to
kind
of
upgrade
the
hammer
to
luminous
right
now
and
being
in
hammer
again.
So
we
have
occasional
PG,
complete
problems
which
are
resulting
in
data
losses,
suffer
asses,
I
mean
this
happened
a
couple
of
times
right
now
because
of
the
cache
here.
What
we
have
and
then,
whenever
the
PD
one
of
the
voice
D
goes
down
and
comes
back
again.
A
There
is
a
complete,
acting
set
change
which
triggered
in
some
other
problem,
and
we
lost
some
of
the
objects.
So
then
we
have
to
go
back
and
develop
some
offline
tools
to
go
and
figure
out
what
all
objects
we
lost
and
try
to
kind
of
reconstruct
them
or
re
ingest
them
into
the
cluster
again.
So
from
all
this
journey,
what
we
have
is
there
are
a
lot
of
missing
pieces
to
put
together
right
now.
So
what
we
lack
right
now
is
an
end-to-end
queue
from
Ceph
right.
A
A
Actually,
we
are
not
even
so
running
a
live
service
like
an
e-commerce
things,
so
we
have
to
be
enterprise
ready,
but
but
whether
some
lacking
pieces,
what
we
need
to
add
to
be
at
that
level
of
guaranteeing
the
service,
finance
of
reliability
or
anything
so
end-to-end
QoS
and
then
were
actually
looking
at
some
of
the
issues
there,
a
bunch
of
issues
to
list
here,
so
we
do
not
have
time
true
going
to
go.
Do
that
one
of
the
measures
is
like:
how
do
you
predict
the
waste
d
failures?
A
B
A
So
we
have
actually
introduced
our
own
piece
of
software
they're
saying
just
go:
we
have
some
configuration
about
per
user
so
which
will
actually
ultimately
what
we
want
to
is
we
actually
want
to
tie
it
up
with
the
user
quotas?
How
much
an
user
can
actually
go
and
do
how
many
gates
and
puts
he
has
he
can
do
how
many
bucket
creates
he
can
do.
That's
the
reason
we
introduced.