►
From YouTube: The Evolution of TiKV
Description
No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).
A
Hello:
everyone
thanks
for
joining
us
today.
I
am
Charles
software
engineer
at
pinkap.
Today,
I'm
going
to
talk
about
the
evolution
of
Thai
KB,
the
talk
will
be
divided
into
four
sections.
First,
we
will
briefly
introduce
the
history
of
Thai
KV
and
summarize
the
major
requirements
for
cloud
native
key
Value
Store.
A
Ok,
let's
get
started
first,
why
we
wanted
to
build
Thai
KV
at
the
first
place.
There
are
so
many
key
values
store
out
there
while
it
does
not
fit
for
our
case
and
what
are
some
requirements
for
a
cloud
native
key
Value
Store,
then
we
have
to
talk
a
little
bit
history
of
Tidy
B.
The
cloud
native
database
support
both
oltp
and
olap
workloads
back
in
2015.
When
we
build
up
the
first
version
of
Thai
DB,
we
build
it
up
on
hbase
and
hdfs.
A
If
you
have
ever
dealed
with
the
hadoops
of
the
wear
Stacks,
you
may
have
the
same
feeling
that
they
are
not
easy
to
work
with.
In
addition,
as
we
want
to
develop
a
database
support
distributed
transaction,
then
we
have
to
add
distributed
transaction
support
on
top
of
the
hbase
which
cause
poor
performers
and
high
operation
costs.
A
After
suffering
enough.
We
finally
designed
to
build
our
own
storage
Engine
with
following
requirements.
First,
it
must
be
able
to
leverage
in
modern
Cloud
infrastructure
to
easily
scale
out
like
other
nosql
system.
Second,
it
should
support
high
performance.
I
o
third,
it
should
support
distributed
transaction
inherently
force.
It
should
use
modern
data
replication
protocol,
which
we
choose
to
wrap
the
protocol
fifths.
We
want
to
make
sure
it
is
easy
to
maintain,
so
it
must
use
a
clean
architecture
and,
last
but
not
least,
it
should
be
easy
to
use.
A
Anyone
who
use
the
other
key
value
store
should
be
able
to
pick
it
up
very
quickly
so
from
day
one.
We
designed
to
develop
a
storage
engine
that
can
be
solid,
building
block
for
other
distribute
system
and
really
help
people
out
and
on
this
principle
can
be
applied
to
all
software
from
the
Ping
cap
community
in
general,
we
can
divide
the
aforementioned
requirements
into
four
categories,
which
is
to
say
the
cloud
native
storage
engine
we
wanted
to
build
should
first
be
highly
scalable.
A
Second,
ensure
data
consistency,
as
we
mentioned
before,
we
want
to
build
storage
engines
supporting
distributed
transaction
inherently
and
they
use
modern
data
consensus
protocol
third
support
high
performance.
I
o,
if
we
have
hundreds
of
clients,
try
to
read
or
write
the
storage
at
each
end
time.
It
would
be
critical
that
they
can
get
a
response
within
a
short
time
and,
of
course,
be
extremely
reliable.
A
key
value
store
is
usually
played
as
the
brand
of
the
host
system.
A
Next
I
will
try
to
elaborate
how
Thai
KV
meets
all
these
requirements,
but
before
that,
let
me
briefly
introduce
the
overall
system
architecture
of
Thai
KB
from
high
level,
and
hopefully
it
can
help
you
to
better
understand
the
rest
of
this
talk.
A
tech,
AV
cluster
looks
like
this.
The
KV
nodes
are
on
the
button.
You
can
have
as
many
as
you
like,
but
we
usually
suggest
to
have
at
least
three.
We
split
the
data
into
regions.
A
I
know
the
name
is
confusing,
but
you
can
just
treat
a
region
as
a
subset
of
the
data,
as
shown
in
the
graph.
We
split
the
complete
data
set
into
five
subsets,
which
are
five
regions.
Each
region
contains
multiple
replicas,
which
are
spread
across
different
High
KV
nodes.
In
most
of
the
cases,
three
replicas
should
be
enough.
On
the
top.
We
have
the
Thai
KV
client
communicating
with
the
high
KV
cluster
through
the
grpc
protocol
and
on
the
top
right.
We
have
the
placement
driver,
which
is
the
scheduler
of
the
Thai
KV
cluster.
A
A
A
A
A
One
level
up.
We
have
the
consensus
layer.
This
layer
presents
the
abstraction
representing
multiple
replicas
which
allow
upper
layer
to
feel
like
they
are
dealing
with
one
piece
of
data
instead
of
multiple
replicas,
like
etcd,
we
use
Raptor
consensus
protocol
to
ensure
the
data
consistency
between
replicas
across
multiple
type
KV
nodes,
but
different
from
egcd.
We
use
multiraptor
group
instead
of
single
router
group
to
improve
the
scalability
and
I
O
throughputs,
which
we
will
cover
in
the
following,
slides
on
the
top
of
the
consensus
layer.
A
We
have
the
transaction
layer,
we
use
the
multiversion
concurrency
control
mvcc
to
implement
the
Google
percolator
protocol,
which
can
meet
transaction
using
two-phase,
commit
algorithm
a
lot
of
database
terms
right.
If
you
feel
like
it,
is
a
little
bit
too
much,
that's
normal.
It
is
not
easy
for
database
experts
as
well
and,
from
my
opinion,
the
distributed
transaction
is
one
of
the
hardest
part
to
understand
in
database
management
system,
but
fortunately,
Pi
KV
has
already
implemented
and
we
can
simply
rely
on
it.
A
A
First,
let's
try
to
find
out
is
Thai
KV
highly
scalable
in
Thai
KV,
we
divide
large
data
sets
into
small
regions
which
each
region
containing
several
replicas
making
up
their
own
Raptor
group.
The
number
of
replicas
may
change
depends
on
how
Thai
KV
nodes
are
spread
geographically,
but
in
most
of
the
cases
each
single
Target
node
does
not
need
to
store
the
full
copy
of
the
data,
that
is
to
say
Thai
KV
is
horizontal
scalable.
A
A
If
you
remember,
we
have
a
central
component
called
placement
driver
which
will
act
as
a
scheduler
of
the
Thai
KV
cluster
as
replicas
of
each
region,
making
up
their
own
raft
group,
the
leaders
of
each
group
will
hold
the
complete
information
of
the
region,
like
the
number
of
active
replicas.
The
topology
of
the
replicas
Etc
and
a
leader
of
each
group
will
send
harpit
to
the
placement
driver
periodically,
including
all
this
information.
A
Then,
when
a
Thai
KV
cluster
service
can
service
and
request
it
from
the
client
it
will
first
Theory
the
placement
driver
get
a
Target
note
and
then
route
the
request
to
the
node.
The
placement
driver
also
support
users
to
create
highly
customized
configurations
like
maximum
number
of
replicas
in
each
region.
How
can
we
rebalance
the
leaders
across
nodes
and
in
the
scheduling
throughputs,
because
too
much
data
balancing
and
migration
may
affect
the
performance
of
online
servers
to
sign
up
by
breaking
up
the
data
into
partition
and
the
spread
replicas
of
the
partition
across
nodes?
A
A
tech
AV
cluster
can
be
easily
scaled
out
to
hundreds
of
nodes.
Here
are
two
large
real
world
type
AV
clusters:
the
left
is
one
of
our
user
streets.
It
is
a
Thai
DB
cluster,
backed
by
a
large
Thai
KV
cluster,
consists
of
168
nodes
with
1820
billion
rows
in
the
318
terabytes
of
data,
which
need
to
support
100
million
reads
and
87
000
writes
per
second,
the
users
of
it
is
the
largest
one,
but
soon
we
had
a
larger
one,
which
consists
of
212
nodes
and
can
hold
up
to
827
terabytes
of
data.
A
We
have
not
tried
yet,
but
we
did
not
experience
any
pressure
with
this
cluster,
so
I
believe
we
can
actually
grow
larger.
Okay.
Now
we
map
the
scalability
requirement.
Next,
we
need
to
ensure
the
data
consistency.
When
we
are
talking
about
the
data
consistency
of
a
distributed
database
or
data
storage,
we
usually
need
to
achieve
two
goals.
One
is
isolating
transaction
from
each
other.
Another
is
keep
the
data
consistent
between
replicas.
A
These
are
two
complicated
topics
was
a
whole
session
to
discuss
and
we
do
not
have
time
to
do
that
today.
So
let's
try
to
make
it
a
simple.
In
short
transaction
is
an
operation
unit
containing
multiple
operations,
for
example,
you
may
want
to
read
an
entry
and
update
it
depends
on
the
existing
value.
This
involves
one
read
operation
and
one
right
operation.
Another
isolation
between
transaction
means
two
transactions
will
not
interfere
with
each
other,
and
meanwhile
we
have
multiple
replicas
for
each
date
partition.
A
So
we
need
to
make
sure
the
contents
of
each
replicas
is
consistent
with
each
other.
As
we
can
see
in
the
graph.
The
left
subtree
contains
different
transaction
isolation
level
and
the
right
substrate
contains
all
the
local
object.
Consistency
model,
if
you
still
remember
Thai
KV
internally,
contains
four
layers
in
the
middle.
Two
layers
are
the
transaction
layer
and
in
the
consensus
layer,
and
then
a
transaction
layer
ensure
two
things.
One
is
the
snapshot,
isolation,
which
means
we
create
an
independent,
consistent
snapshot
of
the
database
for
each
transaction.
A
Its
change
are
visible
only
to
that
transaction
until
commit
time.
This
allows
us
to
handle
transaction
concurrently
if
they
are
not
interfere
with
each
other.
Another
is
derivatable
read,
which
means
any
data
read,
cannot
change.
If
the
transaction
reads
the
same
data
again,
it
will
find
a
previously
read
data
in
place,
unchanged
also
for
the
consensus
layer.
We
use
the
rafter
consensus.
Protocol
type
KV
ensures
strong
consistency,
which
is
also
known
as
the
linear
realizability
among
replicas
of
each
object.
A
The
next
requirement
we
need
to
meet
is
the
high
performance.
I
o
wanna
read
as
we
use
the
multiversion
concurrency
control
to
implement
two-phase
commit,
which
means
reversion
in
each
key
value
pair
and
create
isolation,
snapshot
for
each
single
read
and
write,
which
prevents
a
read
request
from
being
blocked.
If
there
is
an
ongoing
ride
on
a
send
entry,
we
also
apply
many
other
optimization
approach.
A
For
example,
we
support
follower
reads
that
allow
us
to
spread
the
read
workload
across
all
replicas,
or
we
prioritize
small
reads:
to
prevent
the
overall
throughputs
from
being
affected
by
several
large
reads
from
the
rights.
Since
we
divide
data
into
partition,
with
replicas
of
each
partition,
make
up
their
own
raft
group
then
write
a
request
on
key
value.
Pairs
belonging
to
different
partition
can
be
handled
concurrently.
A
Titan
avoid
this
by
storing
a
large
value
outside
of
the
rocksdb
and
the
phone
scan.
Since
we
do
the
range
based
sharding
that
divide
data
into
continuous
range
determined
by
the
short
key
scanning.
Keys
sharing
same
prefix
can
be
very
efficient
since
they
are
usually
located
in
a
handful
of
regions.
A
Last
but
not
least,
we
need
to
make
sure
the
Thai
KV
is
reliable.
Otherwise,
all
the
aforementioned
Advantage
will
be
minimalist.
We
usually
recommend
users
to
use
at
least
the
three
Target
nodes
and
the
three
replicas
for
each
partition.
By
default,
we
will
spread
replica
across
nodes.
Therefore,
we
can
tolerate
the
filter
of
some
of
the
nodes.
If
you
plan
to
deploy
Thai
KV
on
public
Cloud,
you
can
also
spread
Thai
KB
nodes
across
multiple
available
zooms
and
on
the
placement
driver
will
place
a
complete
copy
of
data
in
each
Dom.
A
A
In
addition,
we
also
make
sure
that
the
Thai
KV
is
well
tested.
We
run
regression
tests
frequently
to
ensure
the
new
feature,
is
Backward
Compatible
and
will
not
break
the
historical
version.
We
also
run
large
multi-day
performers
tests
that
use
real-world
data
to
ensure
the
performance
is
predictable,
since
many
users
adopt
height
KV
as
the
backhand
storage
engine
to
build
their
own
software
as
a
service
products.
A
Predictable
performers
is
critical
to
make
service
level
agreements
for
higher
level
servers.
If
you
are
familiar
with
Japan
organizations,
they
run
tests
on
various
distributed
systems
to
validate
the
different
consensus
clan.
We
also
run
a
range
of
Jepson
tools
on
Thai
KV
to
ensure
we
deliver
the
transaction
promise
we
have
made.
A
A
As
we
can
see,
takeavi
is
a
highly
scalable,
distributed
key
value
store
with
high
performance
I
o
and
high
reliability.
It
works
pretty.
Well,
in
an
on-premise
environment,
however,
we
are
seeing
an
ever-growing
demand
of
deploying
Thai
KV
on
a
public
cloud.
The
public
Cloud
infrastructure
is
very
different
from
the
on-premise
infrastructure.
A
Nowadays,
most
the
public
Cloud
vendors
provide
virtualized
disk.
Those
disks
can
be
mounted
to
a
local
file
system
and
the
lay
appear
like
a
local
disk
as
well,
but
internally
they
are
forwarding.
I
o
to
multiple
remote
disk
that
are
potentially
shared
by
multiple
users.
Aws
EBS,
for
example,
will
replicate
any
right
IO
to
three
different
locations.
A
A
A
Ideally,
we
want
the
large
Kai
KV
cluster
to
behave
similar
to
a
traditional
database
relational
database
system,
but
unfortunately
it's
actually
hard
to
accomplish
that
on
cloud
storage.
First,
we
want
to
build
a
scalable
service,
but
scale
means
more
fails
to
be
more
specific.
We
want.
We
are
worried
that
as
a
system,
it's
storage,
Hardware
performance
is
very
likely
to
degrade.
A
The
cost
is
problem
two,
because
every
storage
operation
is
charged
by
the
cloud
vendors.
The
user
now
have
more
reason
to
care
about
exactly
how
and
why
our
citizen
is
using
those
resources
and
by
that
I
mean
read
and
write
amplification,
which
is
the
amount
of
I
o.
The
system
needs
to
send
before
finishing
one
user
request.
A
Here
is
a
simple
graph
to
demonstrate
our
systems.
Runtime
usage
of
I
o
resources
over
time.
The
user
rights
are
very
stable,
as
you
can
see
from
the
yellow
bar
here,
but
there
are
Amplified
multiple
times
because
of
background
rights
that
includes
compaction
as
Thai
KV
use,
rocksdb
internally
and
the
garbage
collection.
In
addition
to
that,
large
events
incur
extra
iOS
that
are
usually
not
predictable
from
this
graph.
A
A
A
Now,
let's
talk
about
how
we
accomplish
the
primary
goal.
Rough
engine
maintains
an
in-memory
index
of
all
log
entries.
The
reason
we
do
that
is
not
to
improve,
read.
Performers
is
actually
about
reducing
the
background.
Work
in
roxdb
compaction
is
needed
to
keep
all
data
stored
and
the
cleanup
deleted
the
data,
but
in
rough
engine
we
don't
need
to
sort
anything
in
the
garbage.
Collection
doesn't
need
to
read
out
obsolete
data
because
we
have
a
map
of
all
active
data
in
memory,
then
rough.
A
A
A
No
addition
I
o
cuning
is
required
with
no
extra
overheads.
We
trace
the
character,
rights
or
system.
I
o
into
three
different
priorities:
let's
use
priority
a
priority
B
and
the
priority
C
here
during
the
execution
we
periodically
assign
individual.
I
o
limits
to
those
priorities.
At
the
beginning,
the
I
o
limits
are
high,
all
iOS
runways,
no
restrictions
eventually
here
at
approach,
2
the
I
o
usage
access,
a
free
data
mined
Global
limits.