►
From YouTube: Introducing YugabyteDB
Description
View this talk presented by Karthik Ranganathan, Yugabyte Co-Founder and CTO, to get an introduction to YugabyteDB.
This talk was presented at Microservices and Cloud Native Apps SF Bay Area Meetup on March 29, 2018.
Learn more: https://www.yugabyte.com/
A
B
A
A
We
worked
on
the
database
side
of
Facebook
during
this
period
during
the
period
that
they
went
from
pre
cloud
native,
microservices
type
architecture
to
a
post
architecture
which
really
really
held
for
a
lot
of
requests.
You
know,
you
know
all
the
goodness
and
that's
where
we
learned,
you
know,
learned
planet-scale
and
what
are
the
challenges
with
the
data
tier
and
so
on
and
so
forth.
So
we're
about,
like
25
people,
now
a
lot
of
us
from
modern
web
companies,
Facebook,
Nutanix,
Google,
LinkedIn,
so
on
and
so
forth.
A
We've
built
a
whole
bunch
of
apps
on
top
of
no
sequel
platforms
like
I've
worked
on
Cassandra
before
it
was
called
Cassandra
before
it
was
open
source
way
back
when
before
no
sequel
was
a
thing
worked
on,
HBase
was
a
whole
bunch
of
people.
We
put
it
in
production
for
some
very,
very
large
use
cases,
including
Facebook
messages,
Facebook
messages
search,
our
operational
data
tier,
which
did
alerting
and
metrics
and
spam
detection
and
a
whole
bunch
of
others
right.
A
So
our
current
company
is
a
distillation
of
all
of
that,
but
before
we
go
into
what
we
do
now
right
wanted
to
quickly
talk
about
how
do
people
build
planet-scale,
apps
right,
the
docker
plus
kubernetes
ecosystem,
it's
great
for
the
state
list
here.
It's
very
intense
based
really
easy
to
deploy
a
lot
of
great
tools
and
that's
pretty
much
where
everyone
is
headed.
But
how
do
you
deal
with
the
data
tier?
How
does
that
look?
And,
of
course,
we're
talking
about
the
we're
public
clouds
have
become
mainstay.
A
There's
like
70
80
90
data
centers
available
to
you,
I
just
have
two
data:
centers
is
no
longer
an
excuse
and
your
users
don't
really
care.
They
want
their
answers
quickly.
If
not
they're
going
to
the
next
app
right
or
Amazon
will
come,
get
you
depending
on
your
field.
So,
okay,
how
did
people?
How
do
people
do
this?
Well?
Typically,
there's
a
shortage,
sequel,
master/slave
for
the
most
critical
data
and
then
there's
one
or
more,
no
sequel
tiers
for,
depending
on
your
velocity
of
data
density
of
data,
type
of
access
pattern,
etc,
etc.
A
So
you
now
put
your
data
across
these
tiers
and
because
you
have
your
data
in
so
many
tiers,
you
now
need
to
bring
what
the
user
wants
into
something
that
can
serve.
It
really
fast
right.
So
now
begins
the
fun
dance,
like
figure
out
which
pieces
of
data
go
where
figure
out
which
pieces
of
data
go
into
the
cache,
because
the
user
cares
about
them
and
the
users
behavior
constantly
changes.
A
So
your
cache
has
to
adapt
now
figure
out
how
to
replicate
each
of
those
tiers
and
finally,
of
course,
they're
going
to
be
failures
in
this
sort
of
a
thing
figure
out
what
went
wrong
and
how
right
and
do
it
quickly
and
keep
changing
stuff
changing
machine
types.
Changing
data
centers,
keep
doing
all
of
this.
Well,
you
get
a
generous
three
months
to
six
months
to
do
that,
but
it's
a
full-time
effort
from
the
app
team,
the
ops
team,
the
database
team,
everybody
right
and
we've
seen
the
story
play
over
again
and
again.
A
A
You
have
Aurora
or
an
RDS
that
takes
care
of
your
sequel
tier,
but
it's
essentially
the
same
architecture
underneath
and
you
have
dynamo
DB,
which
replaces
the
no
sequel
tier
right
so
you've
taken
care
of
the
machines
and
the
patching,
but
the
app
complexity
and
the
complexity
of
running
this
system.
With
data
integrity
and
production
still
the
same
same
thing
right,
so
what
did
we
want
to
do?
Ido
goodbye?
Well,
we
just
start
with
solve
world
hunger
and
put
it
all
in
a
single
database.
A
Well,
of
course,
I'm
joking
right,
so
there
are
certain
trade-offs
that
you
have
to
make
when
you're
trying
to
build
scale
out
applications
in
a
cloud
right.
So
we've
made
those
trade-offs
and
we've
seen
what
it
takes
to
put
a
database
for
a
variety
of
different
workloads
and
what
are
the
common
asks
right,
and
so
that's
what
we're
going
to
be
looking
at
okay.
So
what
are
your
choices
really
right
and
where
is
the
world
evolving?
How
do
you
make
sense
of
this
space
if
you
want
a
strongly
consistent
database
to
build
global
applications?
A
What
are
really
your
choices?
You
have
the
traditional
sequel
databases,
but
they
really
not
scale
out
and
they
don't
really
do
planet
scale.
You
wouldn't
be
able
to
deploy
your
data
across
different
data
centers
scale
up
when
you
need
to
add
a
whole
bunch
of
machines,
because
you
need
to
serve
more
queries
or
store
more
data,
etc,
etc.
A
So
the
no
sequel
guys
came
up
with
a
solution,
we'll
take
care
of
the
scale-up
right.
They
gave
up
on
transactions,
they
said
will
be
high
performant
and
we'll
do
scale-up,
but
there's
still
that
20
30
%
of
your
workload
that
needs
transactional
guarantees
and
what
do
you
have
to
do
for
it?
You
need
another
database
which
then
brings
in
a
cache
and
you're
back
where
you
started
right.
So
databases
like
spanner
took
another
tack.
A
They
said
we're
going
to
push
sequel
to
its
limit,
we're
going
to
take
it
to
the
bleeding
edge
right,
so
they
give
you
transactions,
they
give
you
scale
out,
but
they're,
not
the
database.
You
would
go
to
if
you
want
to
write
a
lot
of
data
or
serve
them
really
really
quickly.
So
performance
is
what
they
gave
up
on.
So
what
are
we
doing
here
at
Yoga
byte
we're
trying
to
bring
in
the
best
of
both
worlds
into
a
single
database?
A
So
if
you
want
to
distribute
your
data
across
the
planet,
if
you
want
to
do
it
with
serve
your
data
with
high
performance,
so
if
you
just
want
a
unified,
integrated
serving
tier
and
if
you
want
the
ability
to
do
transactions,
whether
it's
single
row,
acid
or
multi,
row
acid,
like
distributed
transactions,
both
we
wanted
to
make
a
single
data
platform,
because
we
think
that
the
apps
of
the
future
are
going
to
increasingly
look
like
this
okay.
So
we
started
to
build
this
database.
A
What
were
our
design
goals,
transactional,
high-performing,
planet-scale
right
at
the
core
database
layer
and
cloud
native
and
open
source?
So
on
the
transactional
side,
we
built
a
data
platform
that
understands
single
row
versus
multi
row
distributed
and
keeps
the
single-row
extremely
fast
and
the
multirow
extremely
consistent
on
the
high
performance
side.
We
didn't
go
Java
totally
with
Oliver.
We
went
C++
actually
because
C++
is
still
the
language
to
go
to.
A
If
you
want
extremely
high
performance
so
ground
up
written
in
C++
for
data
for
planet-scale,
we
have
the
ability
to
distribute
data
across
to
various
regions
like
far
and
near,
and
we
can
support
reads
from
nearby
data
centers
so
that
you
can
achieve
low
latency
reads
while
being
able
to
write
from
anywhere
and
maintain
consistency
and
and
of
course,
at
the
core
of
the
database,
is
automatic,
sharding
and
rebalancing.
So
you
don't
have
to
think
about
sharding
or
rebalancing.
A
The
database
does
that
for
you
on
the
cloud
native
side,
it's
all
built
for
the
cloud
and
the
container
era
so
makes
it
very
easy
to
deploy,
run
and
manage,
and
it's
self-healing
and
fault
tolerance.
So
if
there
are
any
failures
and
there's
a
lot
of
failures
in
the
cloud,
you
don't
have
to
wake
up
at
3
a.m.
to
go
figure
out
what
was
wrong
with
your
database
and
believe
me:
we've
done
that
as
a
team.
A
Many
times
you
don't
want
to
do
it,
or
some
of
you
may
already
know
that
so
and
the
self-healing
nature
is
that
you
will
still
respect
the
users
promise,
if
possible,
of
how
many
faults
you
can
tolerate
right.
As
long
as
it
is
possible
we're
an
Apache
to
ATO
project,
you
can
check
us
out
on
github
and
we
started
out
not
only
being
open
source
but
also
with
open
api's,
which
means
you
don't
have
to
learn
a
new
language
right,
we're
a
multi
model
database
and
we're
going
to
look
at
what
api
is.
A
We
support
we
support
Cassandra,
query
language
Redis
and
we're
working
on
Postgres,
so
it's
well
known
languages
all
right.
So
what
were
our
core
database
goals
right?
We
want
it
to
be
very
similar
to
Google
spanner
in
the
core.
So
in
terms
of
the
cap
theorem
we're
a
consistent
and
partition
tolerant
database,
a
CP
database,
but
we
can
offer
H
a
right,
isn't
that,
like
bending
laws
of
physics?
Well,
no,
not
really,
because
what
we
give
up
is
just
a
few
seconds
of
availability.
A
So
on
a
failure,
we
are
able
to
reelect
a
master
and
a
leader
in
just
a
matter
of
a
few
seconds,
and
so
you
will
be
able
to
get
your
Chace
right
and
on
the
there's,
another
way
to
look
at
it,
which
is
this.
The
pace
LC
theorem,
but
anyway,
the
long
and
the
short
of
it
is.
You
can
get
all
the
three
elements
in
the
cap:
theorem
consistency,
availability
and
partition
tolerance.
A
As
long
as
you
don't
hit
failures,
the
trouble
really
starts
when
you
hit
a
failure
and
when
you
hit
a
failure,
you
go
by
trades
of
latency
for
consistency
because
it
takes
a
little
bit
of
time
for
a
new
leader
to
get
elected
and
at
the
core
of
the
database.
Is
that
it's
a
document
model?
It
can
support
sequel
like
api's.
On
top
it
has
distributed
transactions,
but
a
single
row
transactional
by
default,
and
it
also
like
has
the
ability
to
support
a
native
JSON
data
type
like
a
lot
of
SQL
MongoDB
type
databases.
A
Okay,
on
the
API
side,
we
started
with
well-known
api's,
because
some
of
these
api's
are
really
good
for
modeling
distributed
systems.
So
we
started
with
a
Cassandra
query
language
which
knows
about
how
to
partition
data
across
nodes
or
knows
about.
When
you
add
a
new
node,
how
can
you
go
and
read
data
the
subset
of
data
that
new
node
now
contains
with
very,
very
low
latency
right
or
a
radius
like
API,
which
is
really
good
at
modeling
data
structures,
except
this
Redis
is
a
full-blown
database.
A
You
don't
have
to
worry
about
persisting
your
data
elsewhere
right
and
we're
working
on
Postgres.
Now
on
Postgres
we
have
to
teach
Postgres
how
to
split
data
across
nodes
or
how
to
make
the
driver
aware
of
multiple
nodes
to
give
performance
out
of
a
multi
node
system
right.
So
those
are
things
that
we're
working
on
yeah,
so
the
core
of
the
database,
irrespective
of
which
API
you
use,
contains
the
same
set
of
features.
A
So
it
you
can
do
tunable,
read
consistency
from
all
of
these
API
s,
which
means
you
can
read
from
your
nearest
data
center
and
get
a
cache
like
experience
or
go.
Read
it
from
the
actual
leader
and
use
it
like
a
database.
It's
write
optimized
for
large
data
sets,
so
we
do
streaming.
Ingest
really
really
well
so
batch
reads:
reads
and
writes:
are
pretty
good
data
expiry
with
time
to
live
that
the
database
supports
it?
Like
most
sequel,
databases
do
not
have
this.
A
A
Okay,
very
very
high-level,
quick
architecture
like
I'm,
hoping
you
guys
will
ask
me
questions
if
something
is
not
clear
at
the
core.
It's
a
scale
out
system,
so
you
just
take
a
whole
bunch
of
commodity
machines
at
our
containers
and
put
them
together,
and
the
database
now
recognizes
it
and
presents
itself
as
a
coherent
whole.
Each
of
these
nodes
or
pods
contains
a
document
store
a
document
store
which
is
a
heavily
extended
and
modified
version
of
rock's
DB.
That
works
well
with
a
raft
based
replication
engine.
A
A
But
we
have
additional
things
like
we
have
extended
Cassandra
to
add
a
begin
transaction
in
transaction
type,
semantics
to
do
distributed
transactions
which
Cassandra
does
not
support
or
a
JSON
data
type
and
so
on
and
so
forth.
So
for
those
you
will
need
to
use
our
drivers,
but
they
are
in
the
open
source
as
well.
C
A
C
A
C
A
Tablet
is
a
uniform
partitioning
of
the
key
space,
so
you
take
the
entire
key
space
and
we
chunk
them
up
into
a
lot
of
small
tablets.
Each
node
hosts
some
set
of
tablets.
The
tablets
have
rep
that
go
across
multiple
nodes,
depending
on
the
replication
factor.
Each
tablet
with
its
peers
will
elect
a
leader
and
they
are
a
completely
independent
unit.
It
doesn't
depend
on
any.
There
has
no
external
dependency,
so
no
zookeeper,
nothing.
It
will
do
its
leader
election
by
by
itself
inside.
D
A
A
Yeah
so
yeah
great
question,
so
if
to
date
sets
of
data
have
a
relationship
is
the
question
was:
do
you
put
them
into
the
same
tablet
if
they're
in
different
tables?
No,
we
do
not
they're
in
different
sets
of
tablet.
However,
we
allow
these
tablets
to
be
co-located
on
the
same
machine.
This
is
a
feature
that
we
are
working
on.
It's
in
our
roadmap.
We
call
them
a
co
partition
table,
so
these
tables
are
co-located
right
and
split
the
same
way:
okay
yeah.
A
A
Okay,
so
all
of
this
runs
on
top
of
any
is
so
no
external
dependencies
just
needs
memory,
CPU
and
disk,
and
it
can
and
go
go
forth,
and
then
you
know
do
its
thing
and
because
of
the
api
compatibility,
it's
very
well
integrated
with
the
ecosystem,
so
it
works.
Sparc
works
on
top
of
the
database,
so
you
can
use
a
Kairos
DB
for
a
time
series
data
or
a
Janice
graph
as
a
graph
database
or
a
spring
book
type
ecosystem
to
build
apps
on
top.
So
it's
really
easy.
Yes,
please
question
yeah.
E
A
Great
question
so
right
follow
laws
of
physics
by
the
way
I
think
you
guys
are
asking
great
question
I'm
supposed
to
be
giving
you
stickers
and
I've
forgotten
anyway.
You
know
who
remembers
with
good
questions
right
so
yeah,
so
it
follows
the
laws
of
physics
you're
right.
If
you're
trying
to
achieve
a
consist
see
across
faraway
geographies,
are
going
to
take
a
longer
time,
there's
just
nothing,
nothing
to
do
to
get
around
that,
but
one
of
the
things
that
we
have
is.
A
We
have
extended
raft
to
have
an
observer
node
like
an
for
an
async
replica,
so
you
can
have
a
read-only
replica,
that's
placed
far
away
or
an
async
replication
pipeline
that
will
not
incur
right
latency,
so
you'll
be
able
to
easily
design
it
as
one
set
of
machines
offer.
You
are
geographically
relatively
close
and
offer
you
strong
consistency.
A
While
you
place
a
whole
bunch
of
replicas
that
are
far
away,
which
will
still
get
the
rights
in
timeline
consistent
fashion,
not
eventually
consistently
timeline
consistent
and
you'd
be
able
to
serve
data
from
there
you'll
be
able
to
write
data
there
as
well.
It
will
like
redirect
it
to
the
right
place
and
get
the
data
grid.
So
I
think
it
was
another
question
or
maybe
not
yeah.
E
A
Yeah
yeah
stateful
says
I'm
getting
to
that
so
I'm
just
telling
you
what
go
so
the
question
was
how
how
does
it
work
with
stateful
sets
and
yeah
we'll
get
to
that?
So
all
right,
so
I'll
quickly
go
through
where
we
are
with
the
project:
0,
dot,
0,
dot,
9
beta
version
publicly
available
on
github.
You
guys
can
try
it
out
on
your
laptop.
You
can
try
the
various
scenarios.
Like
you
know,
it's
pretty
easy
to
do
that.
A
We've
stress
tested
up
to
it
like
50
nodes.
We
could
keep
going
higher,
we're
very
comfortable
as
a
team
running
hundreds
of
nodes
in
a
cluster,
because
we
did
that
at
Facebook.
The
same
design
goes
in
here.
What
you'll
notice
is
even
at
50
nodes
we
don't
give
up
our
load
read:
latencies
are
real
agencies
are
in
the
hundreds
of
microseconds,
depending
for
a
simple
key
value
type
read
so
in
this
case
we're
doing
like
2.6
million
reads
at
200,
microseconds
and
1.2
million
writes
in
a
50
node
cluster
at
like
just
3
milliseconds
right.
A
A
Well,
it
turns
out,
if
you
architect
it
right
and
you
go
all
the
way
down
to
the
bottom,
like
all
the
way
to
how
the
data
is
being
replicated
or
how
it's
being
written,
you
can
actually
get
very,
very
good
performance
out
of
the
system,
so
these
are
our
why's.
Why
CSV
benchmark
results,
we're
written
a
blog
post
about?
How
would
how
is
it
that
we
can
make
our
system
have
high
performance
along
with
consistency,
you
guys
can
check
it
out
blah
gigabyte
com,
we
support
distributed
acid.
A
So
on
top
is
a
classical
bank
accounts
kind
of
table.
There
is
a
row
which
is
an
account
name.
Another
row,
which
is
an
account
type
and
a
third
row,
which
is
a
balance
which
is
the
total.
The
interesting
thing
with
Cassandra
query
language
is:
it
will
tell
you
how
you
can
partition
your
data?
It
looks
very
sequel
like
right,
but
you
can
partition
your
data
across
nodes
right.
We
want
to
keep
all
the
data
for
a
single
account
together
right.
So
that's
what
this
thing
is
doing.
A
It
is
charting
the
data
by
the
account
ID
and
keeping
all
the
account
types
very
close
on
disk.
So
it's
really
really
fast
right
and
the
transaction
that's
being
done
below
is
for
a
user
John
to
transfer
200
dollars
from
his
checking
account
to
a
user
Smith
who
could
like
live
on,
and
these
guys
could
live
on
different
nodes.
A
Ok,
so
there's
a
we're
deployed
in
a
whole
bunch
of
customers,
but
this
is
one
of
the
deployments
that
people
really
like
I
think
this
goes
to
the
question
that
I
think
you
asked
in
this
set
up
is
geo
distributed
two
copies
of
the
data
in
US
West,
two
copies
in
US
east
and
one
copy
in
Tokyo,
so
completely
planet-scale.
This
is
an
example
of
users,
logging
in
and
changing
their
password,
so
with
the
application
running
on
all
the
data
centers
across
the
globe.
A
What
we
find
is
around
200
micro,
second
latencies
for
people
trying
to
log
in
and
about
200
millisecond
latency
for
people
trying
to
change
their
password
on
average
right,
because
your
data
for
consistency
really
has
to
travel
right,
and
in
this
particular
example,
it
can
survive
a
whole
data
center
failure.
So
if
you
think
about
users,
logging
in
and
changing
their
password
slightly
higher
latency
in
the
order
of
200
milli
is
okay
to
change
password.
You
have
to
be
absolutely
quick
to
log
in
200
micro.
A
Does
that
you
don't
need
a
cache
in
front
right,
so
it's
multi
cloud
so
works
with
Amazon,
AWS,
Google
on-premise
and
we're
working
on
Azure.
It
works
with
stateful
sets,
which
is
what
this
whole
talk
is
about,
but
but
anyway,
so
the
point
is,
it
has
no
dependencies
is
easy
to
deploy
anywhere
alright.
So
for
the
demo,
we're
going
to
look
at
yoga
store,
which
is
a
react
Jas
and
nodejs
Express
app
and
the
app
is
a
shopping,
cart
app.
A
So
it's
going
to
be
a
bookstore
type
of
app
right
and
the
app
itself
is
on
github,
so
you
guys
can
download
it.
Try
it
out,
like
you'd
need
to
setup.
You
go,
buy
it
as
a
stateful
set,
there's
a
whole
bunch
of
instructions
there.
You
guys
can
check
it
out
alright,
because
it's
not
very
interesting
to
watch.
A
All
these
things
get
set
up
so
I'd
already
I
have
already
done
these
steps
and
kept
the
system
ready,
but
effectively
that
that's
the
one
line
that
you
need
to
run
in
order
to
bring
up
gigabyte
in
a
stateful
set
right,
and
the
second
thing
that
I
did
was
containerize
the
application.
As
well,
so
it's
like
the
standard
stack,
it's
a
pretty
popular
stack,
so
containerize
that
and
ran
the
command
up
there
again.
You
can
find
these
commands
on
the
github
repo
and
brought
the
app
up
and
running.
A
So
let's
go
take
a
look
at
the
app
so
yeah
before
that
this
is
my
deployment
on
the
localhost,
so
it's
just
showing
the
the
kubernetes
dashboard.
What
you
see
is
this
is
a
three
node
cluster
with
three
tablet
servers,
so
it's
got
three
nodes
or
slaves
that
do
the
actual
work.
Its
replication
factor
is
three.
So
that's
why
there
are
three
masters,
so
it's
replicated
three
ways
and
can
survive
a
failure
of
any
one
node.
And
finally,
this
guy
here
is
the
service.
A
That's
running
the
stateless
application
right
and
you
can
have
a
whole
bunch
of
them.
You
can
scale
them
out
pretty
easily,
so
on
and
so
forth
and
for
the
actual
storage
we
use
a
persistent
volume
plane,
so
we're
mounting
the
disks
into
each
of
the
pods
of
the
database
and
it
uses
that
to
store
data
yoga
by
doesn't
can
just
work.
Fine
with
ephemeral,
local
storage,
as
well
like
just
local
disk,
is
fine.
A
Okay,
so
I'm
going
to
go
into
deployment
that
I've
done
on
on
the
Google
cloud
right.
So
this
is
a
GCP
gke
install,
so
google
kubernetes
engine
based
install.
I
ran
the
same
set
of
commands
after
installing
the
G
cloud,
like
whatever
API.
What
this
shows
is
its.
This
is
a
four
node
cluster
replicated
three
ways
right,
and
these
are
the
masters
that
form
a
highly
like
available
raft
group
and
they
keep
track
of
the
metadata
in
the
system.
There
is
a
one
table
in
the
system.
A
It's
a
a
products
table
in
the
yoga
store
key
space
right
like
it's.
Just
like
your.
If
you're
familiar
with
sequel,
key
space
is
like
a
namespace
and
a
table
just
as
a
table.
Okay.
So
if
we
go
look
at
the
tablet,
servers
or
the
slaves
that
are
doing
the
work,
you
see
that
it's
all
a
single
cloud
single
data
is
like
a
rack
or
region
and
single
availability
zone
install.
A
So
this
is
not
a
multi
cloud
installation
right,
so
it's
just,
but
it
can
very
easily
become
one
right
like
it's
just
like
easy
enough
to
deploy
you
gabite
that
way.
Okay,
so
here's
the
app
itself
right.
The
app
is
a
bookstore
app.
What
it
shows
is
a
whole
bunch
of
books
with
some
data.
A
The
books
are
split
into
categories
like
there's,
some
static
categories
like
business
books,
cookbooks,
mystery
books
and
so
on,
and
there's
some
dynamic
categories
like
show
me
the
books
with
the
highest
rating
right
or
show
me
the
books
that
have
the
most
reviews
right.
Typically,
what
you
would
expect
from
a
a
store
now?
How
do
we
build
this
right,
like?
Let
me
go
back
here,
so
the
multi
API
strategy.
It
turns
out
really
really
simplifies
this,
because
you
have
Redis
as
a
database.
That's
still
highly
performant
and
Cassandra
as
a
highly
consistent
and
performant
database.
A
It's
easy
to
put
the
less
dynamic
pieces
of
data
like
the
title
of
the
book
or
the
description
and
so
on
and
so
forth
into
the
Cassandra
database,
which
just
looks
like
an
SQL
table.
Now,
on
the
Redis
side,
you
can
put
the
very
very
dynamic
pieces
of
data
into
Redis,
for
example,
the
number
of
reviews,
or
the
total
rating
right.
A
All
of
those
go
into
Redis,
so
querying
for
the
top
books
in
a
category
is
just
looks
like
a
sequel
query
right,
select
star
from
something
where
category
equals
something
right
where
you
can
use
a
ready,
sorted
set
to
keep
track
of
your
books,
sorted
by
your
number
of
reviews
or
your
ratings
right
so,
and
that
makes
it
really
easy
to
model
the
app
because
you
have
the
right
data
structures
at
your
fingertips,
and
you
don't
have
to
worry
about
now.
I've
put
my
data
in
Redis.
A
A
A
The
data
will
I'm
actually
going
to
show
you
actually
I'm
going
to
show
you
with
respect
to
the
queries,
but
the
data
is
split
into
small
shards.
The
shards
are
automatically
moved
across
different
nodes,
so,
as
you
add
nodes
and
remove
nodes,
the
shards
of
data
automatically
rebalance
themselves.
So
you
don't
have
to
worry.
A
Yes,
so
question
is:
if
you
add
nodes,
would
you
need
to
pay
the
balancing
cost
so
in
yoga
bite?
Because
it's
strongly
consistent?
We
actually
copy
compressed
file
so
as
the
equivalent
of
that,
so
the
database
automatically
takes
compressed
files
and
copies
them
over
exactly
the
subset
of
files
that
are
moving
to
the
new
node.
So,
yes,
there's
data
copy
cost,
but
the
data
is
very
efficiently
copied
and
this
does
not
affect
your
foreground
app
right.
So
you
because
you're
not
reading
through
the
database
you're
just
taking
the
file.
A
So
a
database
has
to
read
some
subset
of
data
from
the
disk
uncompress
it
and
then
do
a
whole
bunch
of
reads
and
resolves
on
them
in
order
to
serve
a
user
query
right
and
most
typical
databases
today
like
if
you
take
a
Cassandra
or
a
MongoDB
or
most
of
these
databases.
If
you
add
a
new
node,
will
read
almost
through
the
database
to
resolve
the
data
and
then
get
the
data
out
and
copied
into
the
new
node
while
with
yoga
bite.
A
A
Decides
that
you
don't
have
to
worry
about
any
of
that
stuff,
because
if
I
mean
typically
an
expectation
is
that
if
you
enable
like
distributed
transactions
right,
you
will
be
subject
to
clock,
skew
one
of
the
things
that
yoga
byte
can
do
and
we're
working
on.
It
is
to
integrate
with
an
external
like
atomic
clock
service
like
an
amazon's
time.
Sync
service
kind
of
thing
in
at
which
point
you're
distributed
transactions,
are
reasonably
performant
as
well.
A
But
if
your
query
pattern
automatically
becomes
single
row,
it
automatically
starts
getting
the
benefit
of
high
performance
and
we
do
the
work
to
compute
at
what
point.
A
single
row
asset
conflicts
with
an
ear
like
a
provisional
right
for
a
multi
row.
We
do
all
of
that
tracking
sure
all
right
so
yeah.
So
what
I
was
going
to
show
is
that
you
can
easily
run
a
query
like
that
right
and
it's
gonna
and
that's
so
that's
the
type
of
query
that
you
would
run
to
say.
Hey,
give
me
books
in
the
business
category.
A
Give
me
3
books
in
the
business
category
or
you
could
say,
give
me
3
books
in
the
give
me
all
the
books
in
the
business
category
that
are
hardcover
books
right,
because
your
app
keeps
evolving.
You
have
to
show
different
things
right
and
that's
what
we
just
did
with
a
Kassandra
query
language,
but
connecting
to
the
same
cluster
right,
no
difference
right,
I'm
now
connecting
to
the
same
cluster
and
as
a
database
I
can
interact
with
it
using
the
Redis
CLI,
which
tells
me
give
me
the
top
10
books
sorted
by
the
number
of
stars.
A
It's
not
so
the
question
was:
is:
is
this
table
the
same
product
table?
That
was
there
on
the
cassandra
side
and
the
answer
is
no.
It's
not.
These
are
two
different
tables
internally,
but
they're
both
persisted
with
strong
consistency,
so
you
will
be
able
to
write
to
Redis
and
to
Cassandra.
If
you,
what
you
would
have
traditionally
done
is
you
would
have
written
to
Cassandra
for
one
step
or
sequel
for
one
type
of
like
data,
the
other
type.
A
You
would
keep
it
in
memory
in
Redis
and
you
would
have
figured
out
another
place
to
persist
their
underlying
data
for
Redis.
So
if
reddy's
dies,
you
can
go,
read
it
from
there
repopulated
right.
What
here,
what
it
happened.
Redis
is
a
different
DB
Cassandra,
our
sequel
is
a
different
DB
and
you'll
have
a
third
DB.
When
you
do
a
backup,
you'll
have
to
figure
out
all
of
this
stuff.
In
this
case,
you
just
write
into
one
database.
A
D
A
Use
a
different
format,
but
the
files
are
in
the
similar
format
that
the
core
database
understand.
So
the
core
database
is
the
same.
It
has
different
API
layers
on
top,
so
think
of
it
as
yoga
byte
tables
at
the
core
there's
a
different
yoga
byte
table
for
Cassandra
products
and
there's
a
different
yoga
buy
table
for
Redis,
and
they
both
store
data
and
like
they
both
write
data
into
the
underlying
doc
dB
in
a
common
format.
But
they
will
lay
data
out
the
way
they
want,
because
the
way
it
is
optimal
for
Redis
and
Cassandra.
C
A
Question
so
in
case
a
node
fails.
What
is
the
reconstruction
time?
So,
let's
split
that
into
two
parts
right,
a
node
fails
how
how
will
my
queries
be
affected
right?
That
that
question
is
if
a
node
fails
only
the
tablets
that
are
leave
that
have
leaders
on
that
node
will
get
affected,
but
in
a
matter
of
seconds
right,
which
is
a
configurable
value.
But
but
currently
it's
like
a
few
seconds
two
to
three
seconds.
A
You
will
have
new
guys
serving
the
values,
so
your
so
if
you
have
three
nodes,
so
let's
take
a
3-node
example
right
and
but
each
of
the
nodes
are
replicated
each
of
the
pieces
of
data
is
replicated
three
ways
and
each
of
the
pieces
of
data
is
in
a
raft
group.
So
the
minute
or
node
dies
in
in
a
few
seconds
like
in
two
to
three
seconds.
The
next
guy
will
take
over
as
the
leader
and
start
serving
and
the
client
is
smart
enough
to
know
this
guy
died.
D
A
Know
yeah
we'll
get
to
that
yeah,
that's
a
great
yeah!
So
if
you
don't
have
a
fourth
node,
you
can't
do
much
right.
But
let's
say
you
had
four
nodes
with
replication
factor
of
three.
If
a
node
fails,
there's
a
configurable
time
up,
we
wait
for
the
node
to
come
back.
We
don't
do
anything
right
because
you
don't
want
to
simply
do
unnecessary
work.
That
configurable
timeout
is
five
minutes
at
this
point
after
five
minutes
will
automatically
read
applicate
as
its
data
to
an
existing
node.
If
there
is
one.
C
A
Replication
will
go
on
in
the
background.
Your
foreground
queries
are
not
affected.
Your
app
is
still
running.
He's
asked
me
to
speed
up,
but
I
will
take
your
question
just
give
me
one
second.
So
the
other
thing
I
wanted
to
show
you
guys
is
the
load
as
users
or
the
bangladeshi
click
form
for
those
of
you
who
see
Silicon
Valley.
So
so
there
you
go
farm
activated
right.
So
now
we
can
go
and
see
what's
going
on
in
the
database,
so
the
minute
I
did
a
refresh.
A
What
you
see
is
that
the
reads
and
write
start
getting
scattered
across
two
different
nodes
right
so
and
depending
on
the
app
and
the
type
of
keys,
you
will
see
that
the
load,
then
the
distributions
will
wave.
You
will
vary
right
so
and
one
of
the
things
that's
that
I
would
also
like
to
do
is
we
can
actually
scale
the
number
of
replicas
sets
on
the
fly
so
we'll
be
able
to
just
add,
like
a
new
guy
right.
A
A
So
what
you
notice
is
that
a
new
node
just
got
added
to
the
system
right
and
the
new
node
is
now
so
at
yoga
bite
we're
all
about
running
in
production
right.
So
we
won't
like
go
and
rush
into
copying
all
the
data
you
can
throttle
how
slowly
you
want
the
new
node
to
start
taking
data.
So
it's
going
to
happen
in
the
background
in
a
way
that
your
app
should
not
get
affected.
That's
the
main,
the
most
important
thing
right,
so
so
what
you'll
notice
is
that
like?
A
If
you
look
at
the
number
of
shards
right,
you
have
a
certain
number
of
shards
and
you're,
seeing
that
the
load
is
slowly
getting
moved
to
the
new
node
while
the
load
is
like
the
data
is
being
served
by
all
the
other
guys.
So
if
I
refresh
you'll
see
that
more
shards
slowly
start
moving
and
the
data
gets
copied,
the
node
now
comes
into
the
system.
It
starts
serving
data,
and
all
of
this
is
online
right.
A
I
mean
we
can
wait
for
it
to
finish,
but
I
think
you'll
probably
kill
me
so
so,
let's
yeah,
so
our
database
is
Apache
2.0,
it's
in
the
open
source,
we'd
love
for
you
guys
to
get
involved
like
it's.
It's
C++,
but
hey,
it's
still
a
good
language.
So
it's
highly
performant
so,
but
but
anyways
download
it
try
it.
You
can
play
with
a
lot
of
failure
scenarios
on
your
laptop.
A
A
Is
a
multiple
modes?
It's
a
great
question.
We
offer
tunable
consistency,
so
if
you
want
to
treat
it
as
a
database,
it
has
to
go
to
the
leader
node,
because
that's
just
the
way
it
works.
But
if
you
want
to
read
from
a
follower,
we
allow
that
option
and
you
can
also
optionally
read
from
the
follower
in
the
nearest
data
center
to
the
app
so.
F
A
Is
a
ok,
so
this
is
a
yeah.
It
depends
on
a
lot,
but
I'll
tell
you
at
a
slightly
different
way.
We
do
a
prefix
compression
for
the
data
in
memory
and
then
we
recompress
it
using
on
top
of
the
prefix
compression.
We
do
snappy
compression
and
store
it
in
disk.
So
it's
pretty
well
compressed,
but
that's
a
hard
question
to
answer,
because
yeah
yeah,
please
it's
a
modified
rocks
TB.
So
in
the
sense
we
have
taken
out
Roxie
B's
right
ahead
log
we
took
out
Roxie
B's
MVCC.
A
Yeah,
so
his
question
is:
is
it
a
Roxie
B
at
the
bottom
and
and
the
second
follow-up
question?
Was
it's
still
a
level
DB
fork,
so
it
has
a
very
large
write
amplification.
We
have
changed
the
way
the
compactions
work
based
on
our
HBase
learnings,
so
we
actually
do
a
size
tiered
compaction.
So
we
try
to
strike
a
strike,
a
trade-off
between
size,
amplification
and
number
of
rewrites
right
so
and
today's
systems,
IO,
is
the
bigger
bottleneck,
then
draw
disk
size
right,
but
but
there's
I
mean
the
answer.
A
Also,
is
not
that
straightforward
because,
like
it's
not
a
single
tablet
in
a
machine
like
suppose,
you
have
500
gigs
of
data
in
a
machine
if
it's
a
simplified,
if
it's
a
very
simple
single
shard
system,
you
would
need
another
500
gigs
of
data
to
do
compaction,
so
you
can
only
do
50%,
but
we
do
sharding
inside.
So
it's
it's
really
like.
The
now
is
a
500
gigs,
divided
by
the
number
of
shards
divided,
so
there's
a
whole
bunch
of
computations
which
decreases
the
size
and
I.
C
A
Question
so
it's
in
our
roadmap,
but
here's
our
main
thing
with
joins.
As
you
add
nodes,
your
joint
performance
could
get
worse
for
the
wrong
kind
of
joint
right.
So
this
is
the
one
sticking
point
that
a
lot
of
companies
like
like
Facebook,
for
example,
has
an
entire
sequel
tier.
They
don't
do
joints,
they
don't
do
cross
no
joints
right,
but
they
try
to
place
data
in
such
a
way
that
your
data
that
you
want
to
join
comes
together,
which
was
the
co
partition
table
kind
of
example.
A
B
A
Yeah,
so
we
assume
that
we
yeah,
so
the
question
was:
what
kind
of
assumptions
does
yoga
byte
make
about
storage?
And
what
do
we
qualify
for
performance
right?
How
do
we
qualify
for
performance
for
transaction
felicity
yeah?
So
we
assume
we
run
on
top
of
a
file
system
if
we
typically
recommend
XFS,
because
we
have
a
lot
of
experiences
about
any
file
system
will
do
we
assume
that
the
file
system
has
sink
support,
so
we
are
able
to
store
data
it
like,
etc,
etcetera
right
at
the
core?
A
We
assume
that
you
have
a
SSD
type
device,
because
SSD
is
the
norm
now
we
are
not
like
you
know,
still
designing
for
the
hard
disk
age.
That
said,
however,
we
have
enough
architectural
depth
in
the
inside
gigabyte
to
enable
tearing
of
data
to
hard
disk
like
tears
automatically
like
in
the
background,
so
you
will
still
be
able
to
move
older,
colder
data
to
cheaper
tiers,
but
keep
your
heart
data
in
the
in
the
serving
tier
right.
A
So
and
now,
as
far
as
transactions
go,
it's
a
split
like
transactions
depend
on
like
distributed,
transactions
depend
on
clock,
skew
single
row.
Transactions
depend
on
your
network
as
well,
so
it's
not
just
storage.
So
from
a
storage
perspective
we
just
require
SSDs
the
faster.
It
is
the
faster
you
get,
but
at
a
certain
point
it
will
be
diminishing
returns
because
these
other
guys
will
take
over
sure.
Oh.
A
Yeah
so
yeah
I
think
one
point
that
Curran
was
mentioning
is
that
we
don't
depend
on
software
rate,
so
we
can
just
use
j-bot
like
just
disks,
just
a
bunch
of
disks
from
more
questions,
guys
all
right,
I've
got
stickers,
I
forgot
to
hand
them
out
to
all
the
people
who
asked
good
questions,
but
please
do
the
others
and
they're
in
the
back
as
well.
Apparently,
so
please
do
stop
by
say
hi
and
yeah.
Thank
you.
Thanks
for.