►
From YouTube: Migrating Airbnb to Istio
Description
Airbnb has been using an in-house service mesh called SmartStack since 2013. In the past 12 months, they migrated hundreds of production services onto our next generation of service mesh based on Istio. Come learn more about how we achieved service mesh migration with minimal service owner involvement and zero downtime. They will cover their migration strategy for different kinds of workloads and showcase the migration tool they built.
A
Yes
introduced,
we
are
engineers
for
imb,
and
today
we
are
very
excited
to
talk
about
how
we
migrate
airbnb
onto
you
still
so.
First
of
all,
here's
the
agenda.
For
today
we
will
first
start
with
the
brief
introduction
and
that
ed
and
I
will
focus
on
our
migration
strategies
and
at
the
end
we
will
have
a
quick
recap
and
a
q
a
session.
A
A
We
finally
decided
to
stop
patching
our
smart
stack
and
started
to
search
for
modern
service
mesh
solution
and
after
some
evaluations,
we
quickly
landed
on
istio
as
the
foundation
and
internally
we
use
air
mesh
as
the
name
for
our
next
generation
service
mesh,
and
we
will
use
this
term
throughout
the
presentation
and,
if
you're
interested
in
finding
out
more
about
why
we
ended
up,
choosing
histo
feel
free
to
check
out
the
istio
count.
A
Talk
earlier
this
year
and
as
of
today,
we
have
migrated
almost
all
the
kubernetes
services
onto
easter
and
migrated
about
half
of
inter
service
product
production
traffic
onto
istio,
and
our
plan
is
to
fully
migrate
to
east
hill
in
next
year
and
sunset.
Our
legacy
system.
A
A
So,
as
a
result,
our
mesh
users
do
not
directly
interact
with
istio
customer
resources.
We
provide
a
simple
config
file
that
our
user
use,
for
example.
Here
I
show
the
mesh
config
file
for
the
service
banana
and
service
owner
is
checking
this
mesh
config
file
alongside
their
service
code
after
code
review
and
during
ci
time
we
generate
the
instill
customer
resources
from
this
config
file
and
the
generated
resources
are
managed
by
our
deployment
system
and
all
the
conflict.
Changes
are
made
by
a
deploy
which
is
monitored
and
can
be
easily
rolled
back.
A
And
we
provide
several
different
mesh
objects
in
our
mesh
api.
We
provide
the
app
which
is
workload
that
only
have
outgoing
traffic,
and
we
also
provide
service
which
is
basically
app
with
ports
and
then
vm,
app
and
vm
service
is
the
ec2
version
of
app
and
service
and
external
is
used
for
defining
external
services
into
the
mesh.
A
Like
our
mysql
databases
on
aws,
and
then
we
also
have
virtual
service,
which
allows
users
to
control
routing
to
a
set
of
those
real
services,
and
we
also
allow
extension
and
override
between
mesh
objects,
for
example,
in
this
example,
banana,
canary
and
banana
canary
baseline
both
extend
a
production,
so
they
get
the
same
config
as
the
production,
and
this
help
us
to
reduce
the
velocity
of
the
mesh
config
file.
A
Yeah,
so
I
want
to
talk
a
little
bit
more
about
virtual
service
that
we
provide,
so
the
most
straightforward
use
case
of
the
virtual
service
is
for
static
canary
with
we
start
in
canary
at
airbnb.
We
at
all
time
route
a
certain
percentage,
for
example,
ten
percent
of
traffic
to
cannery
and
the
rest
to
production,
and
then
new
changes
are
always
deployed
to
canada,
refers
to
verify
before
proceed
to
production
and
to
achieve
this
kind
of
traffic.
A
Routing
user
can
just
simply
define
this
virtual
service
in
their
mesh
config
and
a
little
bit
more
complex
than
static
gallery
is
aca,
which
means
automated
canary
analysis
and
a
lot
of
our
airbnb
services
have
adopted
this
and
for
aca
during
non-deployment
time.
All
the
traffic
goes
to
production,
but
during
deployment
time
we
scale
up
the
canary
and
canary
baseline
pods,
and
then
we
brought
a
certain
percentage
of
traffic
to
them
to
do
a
side-by-side
comparison
and
after
we
verify
the
metrics
and
everything
looks
good.
A
A
As
the
top
left
user
will
define
the
mesh
object,
keys
and
the
percentage
they
want
in
their
aca
deploy
stage
and
based
on
the
user
input.
Our
tooling
will
generate
the
istio
custom
resource,
which
is
a
virtual
service
and
deploy
it
during
the
applied
traffic
routing
stage
and
after
aca.
Our
touring
will
also
generate
the
resource
to
restore
back
the
traffic.
A
So
this
whole
process
is
completely
hidden
away
from
user.
All
they
need
to
configure
is
the
deployment
config
on
the
top
left
side.
A
Yeah,
so,
besides
making
sure
that
mesh
api
is
user
friendly,
we
also
another
top
priority,
for
us
is
safety.
So
every
day
there
are
hundreds
and
hundreds
of
changes
going
on,
and
we
don't
want
to
stop
the
buses
to
do
the
migration.
Instead,
we
want
to
seamlessly
migrate
while
keeping
the
process
safe
at
the
same
time.
A
So,
in
order
to
accomplish
that,
we
provide
the
following
features:
we
have
edge
by
edge
migration,
so
we
believe
that
a
service
should
not
require
a
leap
of
faith
while
onboard
the
air
mesh.
It
should
be
able
to
migrate
each
of
its
inbound
and
outbound
edges
one
by
one
and
for
those
critical
address.
We
support
percentage
based
traffic
shifting
from
smart
stack
to
air
mesh.
This
allows
side
by
side
comparison
of
arrow
rates
and
latency
and
in
case
anything
goes
wrong.
A
We
want
a
traffic
to
be
able
to
roll
back
quickly
within
a
second
and
here's
how
traffic
shifting
work.
First,
we
run
the
issue
on
proxy.
Alongside
with
the
smart
stack
side
car
in
shadow
mode,
we
then
configure
each
to
proxy
to
intercept
traffic
going
to
the
reserve
center
range
for
the
new
service
mesh
and
then
we
also
add
the
traffic
shifting
capability
into
airbnb
standard
client
frameworks.
A
After
that
to
shift
traffic
from
smartsec
to
air
mesh.
A
service
owner
can
simply
increase
the
traffic
percentage
using
our
dynamic
config
system,
and,
if
anything
happens,
you
just
change
back
the
traffic
percentage
to
zero
and
then
in
seconds,
traffic
will
be
routed
back
to
smart
stack
unless
traffic
is
being
ramped
up.
A
Service
owners
have
access
to
this
migration,
dashboard
tracking
the
changes
in
arrow
rate
and
latency
for
those
critical
services.
We
normally
gradually
ramp
up
traffic
to
a
50
50
and
leave
it
overnight
to
have
a
side-by-side
comparison
to
make
sure
there
is
no
regression
and
also
during
gradual
rollout.
We
can
monitor
issue
sidecar
resource
usage
and
adjust
the
cpu
and
memory
during
the
process
to
avoid
like
cpu,
throttling
and
om
issues.
B
B
B
B
All
these
toolings
are
seamlessly
integrated
with
the
existing
airbnb
development
environment
so
that
new
service
will
just
onboard
air
match
transparently.
As
we
all
know,
rom
wasn't
built
in
a
day.
The
transition
period
of
migration
can
be
very
messy,
so
we
provide
full
compatibility
for
a
service
to
be
in
dual
mode,
that
is
well-being
on
air
mesh.
The
services
can
still
fall
back
to
communicate
via
legacy
smartstack,
so
that
without
stopping
the
service
development,
we
are
pushing
all
the
service
to
onboard
air
mesh.
Still.
B
B
B
After
this
step,
the
service
will
be
considered
as
air
mesh
ready,
that
is,
the
service
is
registered
in
air
mesh
control
plan
and
the
service
is
able
to
communicate
with
another
air
mesh
ready
service.
Once
the
service
is
air
mesh
ready,
we
are
able
to
migrate
the
edge
for
it.
That
means
we
will
migrate
the
edge
from
this
service
to
another
service
to
be
able
to
communicate
on
air
mesh
when
all
the
inbound
and
outbound
edges
of
the
service
are
migrated.
We
consider
this
service
as
airmesh
complete.
B
B
Now,
if
we
start
migration,
naturally
we
will
start
to
migrate
service
a
first
to
make
it
air
mesh
ready,
but
in
order
to
migrate
the
edge
to
service
b,
we
also
need
to
migrate
service
speed
so
that
they
can
communicate
with
each
other
on
air
mesh.
And,
finally,
we
can
migrate
edge
between
a
and
b
in
order
to
make
service
a
air
mesh
complete.
We
will
need
to
do
the
same
thing
again
and
again
to
all
of
its
outbound
and
inbound
edges.
B
So,
as
you
can
see,
this
process
can
be
very
long
and
entangled
if
we
make
the
step
depends
on
one
another
and
so
on,
so
we
clearly
define
each
step
to
be
independent,
so
we
can
easily
pipeline
the
whole
process.
As
you
can
see
from
this
graph,
we
just
migrate
all
services
to
be
air
mesh
ready
first
in
parallel,
this
will
just
pave
the
way
for
migrating
any
ages
for
any
services
by
pipelining.
We
greatly
accelerate
the
migration
speed
as
well
as
avoid
many
process
complexities.
B
This
pipeline
approach
also
allows
us
to
use
a
white
glove
approach
and
utilize
the
economy
of
a
scale
better,
as
we
make
the
migration
process
repeatable
and
with
the
help
from
contractors,
we
achieve
a
high
speed
of
migrating
more
than
40
service
per
week,
and
in
this
short
quarter
we
have
tens
of
edges
being
migrated
in
parallel.
Every
day,
as
of
today,
there
are
more
than
50
percent
traffic
on
air
mesh.
Already
speed
is
not
enough.
We
also
want
to
make
the
migration
process
transparent.
B
B
B
B
This
helps
us
to
make
sure
the
edge
is
100
percent
accessible
before
actually
shifting
the
traffic
and
and
if
there
are
arrows
it
will
diagonals
and
give
suggestion
how
to
fix
it
with
all
this
migration
tooling,
not
not
only.
We
make
our
migration
faster
and
safer,
but
also
it's
more
transparent
that
service
owners
barely
need
to
know
anything
about
air
mesh
migration.
The
behavior
stays
unchanged.
It
just
works
with
a
minimum
of
configuration,
changes
which
are
automatically
generated.