►
Description
Casey Bodley walks us through RGW Multisite, particularly the replication path.
A
Welcome
everybody,
I'm
gonna,
be
doing
a
code
walkthrough
of
multi-site
and
I'm
gonna
be
focusing
mostly
on
the
replication
side,
though
I'll
start
by
talking
through
some
high-level
stuff
about
how
the
log
based
replication
works.
What
the
different
multi-site
logs
are,
what
the
reading
model
at
sync
is
and
how
the
co-routines
work
and
then
I'll
get
into
some
of
the
code
for
metadata
and
data.
Sync.
A
So,
to
start
out,
this
is
log
based
replication,
though,
if
we're
talking
about
two
different
clusters
and
at
different
parts
of
the
world,
we're
going
to
put
a
Zone
on
each
one
of
them
and
link
them
together,
so
that
one
zone
a
makes
a
change.
It
writes
that
change
to
a
log
locally
and
zombie
will
be
reading
the
laws
from
Zone
A
and
each
time
that
it
sees
something
change.
A
There's
the
bucket
index
log,
which
is
stored
inside
the
bucket
index,
which
records
all
of
the
changes
to
objects
and
then
there's
another
data,
changes
log
which
records
which
buckets
and
their
shards
have
changes
in
them,
because
you
can't
really
pull
every
single
shard
of
every
single
bucket.
So
we
pull
the
data
changes
log
to
figure
out
which
buckets
have
changes
so
that
we
can
spawn
sync
and
the
buckets
that
we're
focused
on
each
of
these
multi-site
logs
are
sharded
across
multiple
Rados
objects.
A
A
A
A
A
A
A
So
a
stack
is
just
a
list
of
cartoons
where
it
runs
each
one
to
completion
before
starting
on
the
next
one,
and
then
there
is
a
kuru
teens
manager
here,
which
acts
as
a
scheduler
of
the
co-routine.
Stacks,
though,
keeps
a
list
of
stacks
that
are
ready
to
run
loops
through
them
and
cause
they're
operate
functions,
and
so
the
corrodium
manager
just
has
a
run
function,
which
is
synchronous
and
it
just
transfers,
control
and
runs
all
of
the
co-routines.
Until
everything
completes.
A
A
A
A
And
so
here
we
have
an
example
of
kuru
teens
apparate
function,
so
this
this
function
just
keeps
getting
called
by
the
scheduler,
and
we
have
this
reenter
macro,
which
is
implemented
as
a
switch
statement
under
the
hood.
But
basically
it
means
that
each
time
that
you
filled
the
next
time
you
get
called
you'll
you'll
resume
from
where
you
yielded.
So
that's
what
the
reenter
is
for.
A
A
So
here
we
use
yield,
call
which
means
that
this
go
routine
will
suspend
and
we'll
only
resumed.
What's
the
lock
completes
and
because
it's
synchronous,
we
can
check
its
return
code
here
to
see
whether
it
succeeded
or
failed
and
if
it
failed,
we
set
the
error
state
which
will
tell
the
scheduler
not
to
call
our
operate
anymore
and
it'll
unwind,
similar
with
the
setting
state
to
done
means
that
we're
done
processing
and
we
won't
run
anymore.
A
But
the
interesting
thing
about
about
the
continuous
lease
is
that
it's
a
while
loop.
So
we'll
take
the
lock
or
a
given
interval,
which
is
I
believe
something
like
60
seconds
minutes.
Maybe,
and
so
we
just
lock,
we
will
wait
for
half
of
that
interval
then
renew
the
lock.
So
the
idea
is
that,
as
long
as
the
continuously
CR
is
running,
we'll
take
the
least
and
keep
renewing
it
and
if
it
ever
fails,
then
we'll
set
that
we're
not
locked
and
it's
up
to
the
Collin
co-routine
to
detect
that
and
stop
working.
A
A
And
the
first
thing
that
we
try
to
do
is
continuously,
though
we
allocate
the
co
routine
here
and
we
call
spawn
to
spawn
it
in
a
separate
stack
and
we
we
keep
track
of
that.
That
stack
for
a
reference
counting
here
and
there's
this
while
loop
that
basically
waits
until
it
succeeds
in
getting
the
lock.
If
it
fails,
we'll
see
done
here
and
we'll
return
the
error.
C
A
A
E
A
So
we
have
a
marker
tracker
here
which
basically
keeps
track
of
the
entries
that
we're
trying
to
sync
and
the
entries
that
we've
succeeded.
The
goal
is
to
make
sure
that
we
update
the
sync
status
marker
to
reflect
the
complete
progress
that
we've
made,
so
we
get
into
a
do-while
loop
here
each
time
we
make
sure
that
we
still
have
the
lease
otherwise
we'll
drop
out,
and
then
we
list
some
OMAP
keys
here.
We're
listing
the
the
full
sync
map
so
metadata
full
sync.
B
A
So
the
field
in
a
black
it
wraps
a
call
to
spawn
so
it's
very
similar
to
just
saying
yield
spawn,
except
it
gives
you
a
scope
where
you
can
have
local
variables,
because
the
reenter
macro
is
implemented
as
a
switch
statement.
It's
complicated
to
have
local
variables
because
they
can't
cross
cases
in
a
switch
statement
so
anytime
that
you
need
local
variables.
A
B
A
C
A
A
E
E
Noticed
earlier
in
this,
there
was
that
there
was.
There
was
one
of
you.
It
has
found
defiance
of
a
magic
number
of
100
euro
map
keys,
yes,
which
didn't
seem
like
a
great
number
and
that's
written
in
the
code.
Isn't
that
how
that
grade?
Could
we
maybe
have
to
sweeping,
through
at
some
point
and
lifting
some
tags
like
that
yeah.
A
A
C
A
A
We
create
a
marker
tracker
which
will
update
the
incremental
sync
marker
as
we
make
progress,
and
we
resume
from
the
marker
position
that
we
have
stored
in
our
local
st.
marker
variable,
though
the
loop
for
incremental
sync
looks
fairly
similar,
except
that,
instead
of
reading
from
OMAP,
we
are
reading
from
the
remote
zones
metadata
log-
and
here
we
use
clone
meta
log
goroutine,
to
read
that
reader
listing
from
that
log
and
we
store
it
locally.
A
If
we
don't
store
the
metadata
log,
then
we
wouldn't
be
able
to
serve
metadata
sync
to
other
zones
in
the
event
that
were
promoted,
though
we
clone
a
list
of
metadata
log
entries
and
I
will
read
from
that
in
a
loop
and
process
them.
Similarly
to
full
sync,
where
we
use
medicine
single
entry
see,
are
a.
D
D
A
D
A
Okay
and
so
we're
running
a
lot
of
these
medicines,
single-entry
co-routines
and
parallel
we're
tracking
their
their
markers
and
looking
for
their
completions.
If
everything
succeeds,
then
we
can
update
our
marker
position
to
reflect
how
how
far
we
got
and
we'll
just
keep
looping
over
the
clone
mat
of
data
log
read
metadata
log
and
medicine
single
entry,
loops.
A
A
A
A
Similarly,
to
metadata
sync
we're
still
doing
a
full
sync
and
an
incremental
sync.
Full
sync
is
going
to
build
a
list
of
all
of
the
bucket
shards
and
it
will
spawn
buckets
sync
on
each
one
of
those
and
incremental
sync
will
be
watching
this
data
log
and
spawning
buckets
sync
and
bucket
shards
that
have
changes.
A
So
the
full
sync
I
think
that
the
structure
is
very
similar
to
metadata
sync:
the
list
distort
and
OMAP.
We
read
a
batch
of
keys
loop
over
them,
Spahn
a
data
sink
single-entry
co-routine
for
them,
and
here
we
enforce
a
spawn
window.
So
if
we've
spawned
more
than
that,
many
will
wait
for
the
next
one
to
complete
and
collect
its
result.
A
There's
also,
this
error,
repo
that
we
have
here
where
we'll
store
bucket
shards,
that
failed
to
sink
in
the
past,
and
we
will
retry
batches
of
these
as
we
make
progress
in
incremental
sync.
This
is
different
from
metadata
sink,
because
metadata
sink
will
black
when
it
hits
an
error
and
keep
retrying,
but
for
data
sink.
We
want
to
keep
trying
all
of
the
of
the
buckets
in
the
data
log
and
if
something
fails,
we
don't
want
to
stop
progress.
We
just
want
to
make
sure
that
we
try
it
again
later
most
of
the
time.
A
A
A
A
A
A
A
C
Timestamp,
so
is
there
a
necessity
that,
like
would
the
zone
should
have
a
similar
timestamp
like,
for
example,
what
happens
if
the
other
zones
clock
is
not
in
sync
and
it's
in
a
more
advanced
timestamp?
So
do
we
resolve
inversion
numbers,
then.
C
C
A
A
Yeah,
so
if
you
upload
a
multi-part
upload
to
a
zone,
a
for
instance,
Zone
B
is
just
going
to
see,
isn't
a
single
index
log
entry
for
that
entire
upload
and
we'll
just
use
a
get
request
to
fetch
the
whole
thing.
So
if
you
get
a
multi-part
outlet,
object,
you'll
still
just
get
the
entire
contents
in
a
single
body,
so
we
replicate
multi-part
objects
as
a
single
object
and
we'll
store
them
as
a
single
object
and
the
target
zone.
D
A
A
D
A
A
A
The
bucket
entry
point
uses
a
radio
set
object
name
of
the
bucket
itself.
So
if
you
get
a
request
for
a
bucket,
you
look
up
its
entry
point
first
and
the
entry
point
will
say
what
the
current
instance
of
that
bucket
is.
So
that
points
you
to
the
the
bucket
instance
metadata
to
actually
get
its
attributes.
E
A
F
A
F
A
Okay,
so
on
the
and
the
replication
side,
there's
a
config
variable
called
rgw
enable
pink
threads,
and
if
you
set
that
to
true,
then
we'll
create
the
threads
to
run
data
sync
by
default.
That's
on,
though
every
every
gateway
in
his
own
would
be
running
these
threads
and
trying
to
get
leases
to
run
the
processing,
though
the
idea
is
that
releases
just
help
spread
the
work
across
all
of
those
gateways
and
then
on
the
on
the
right
side.
Every
right
would
would
end
up
generating
a
log.
Entry
would
replicate.