►
Description
Service Mesh is becoming a key component in the Cloud Native world. It allows teams to connect, secure and observe complex microservices environments built on containers and container orchestration tools. But most Service Mesh tools are also very complex and require a lot of engineering overhead to deploy and maintain. In this talk, we will explore the considerations you have to take into account before you commit to a Service Mesh-based stack. I had the opportunity to help customers design around Istio (one of the most famous Mesh tools) and learned a lot through the years. This talk is a distilled experience of the learnings I had.
A
Hi
everyone
thank
you
for
having
me
usually
when
I
do
this
presentation
outside
of
a
dedicated
service
mesh
community.
I
I
it
takes
me
much
longer
time,
because
most
people
don't
actually
understand
what
service
mesh
is.
A
So
I
have
to
start
with
some
introductions,
but
I'm
gonna
skip
all
the
introductions
for
this
specific
meetup
and
this
specific
talk
so
the
content,
I
don't
have
actually
a
lot
of
content,
it's
more
like
a
detailed
experience
of
couple
of
years
implementing
certain
service
mesh
in
certain
projects,
but
I
I
want
to
make
this
as
interactive
as
possible.
So
if
you
have
any
questions,
just
please
feel
free
to
ask
me
to
chat.
A
So
the
title
of
the
talk
is
quite
provocative
and
the
idea
here
is
to
really
make
people
cautious
about
implementing
service
mesh
in
general,
and
I
also
modified
it
to
make
it
specific
to
istio,
because
you
know
different
service
mesh.
Does
things
different
ways,
but
pretty
much
all
the
learnings
from
this
talk
really
are
applied
to
pretty
much
any
implementation
or
any
cloud
products.
To
be
honest,
so
I'm
a
cloud
developer
advocate
at
google.
I've
been
doing
before
this.
A
A
couple
of
years
or
five
years
of
consulting
as
part
of
something
called
pso
professional
services
so
where,
where
we
work
directly
with
external
customers
of
all
sorts
from
startups
to
big
tech
companies
to
banks
and
financial
institutions,
helping
them
implementing
things,
I
did
quite
a
lot
of
gke
and
service
measure.
Google
managed
offering
for
communities.
A
And
overall
I
did
quite
a
lot
of
infrastructure
experience
right
and
usually
when
we
talk
to
customers
or
we
talk
to
people
about
service
mesh.
This
kind
of
tends
to
be
the
slide
that
we
use
it
to
sell
the
technology.
A
The
benefits
of
a
service
mesh
right,
like
the
the
the
fact
that
it
allows
you
to
connect
a
bunch
of
micro
services
in
a
flexible
way,
secure
the
communication
between
these
microservices
control.
The
policies
between
these
microservices
so
basically
control
both
policies
in
terms
of
who
can
do
what
and
who
can
talk
to
what,
but
also
control
the
traffic
between
these
policies
and
this
microservices
and
then
observe
by
getting
a
bunch
of
metrics
and
traces
from
this
communication
and
dump
it
somewhere.
A
Think
about
them
as
like
incremental
steps
into
I
want
to
go
from.
I
don't
have
a
service
mesh
to
now.
I
want.
I
have
a
service
mesh.
What
are
the
four
main
things?
I
need
to
keep
in
mind
right
and
so
the
first
fourth
main
thing
is
the
capacity
and
resources
a
service
mesh
data
plane
and
the
service
mesh
control
plane
does
consume
cpu
and
memory.
So
you
have
to
keep
that
in
mind.
A
I
have
seen
implementations
where
we
started
with
a
few
hundred
pods
in
a
micro
service
based
environments
and
by
introducing
to
be
to
be
honest
by
introducing
earlier
version
of
istio
is
to
have
improved
drastically
in
the
last
few
years,
or,
I
would
even
say
the
last
few
releases,
but
before
it
is
to
consumption
lots
of
resources,
it
still
does
and
it's
not
cheap.
A
But
I
have
seen
situations
where
we
by
introducing
a
service
mesh,
we
managed
to
get
to
a
point
where
the
service
mesh
side
cars
actually
consume
more
resources
than
the
actual
application.
A
A
There
are
certain
things
that,
if
you
implement
a
service
mesh
with,
you
can
just
change
them
like
most
tools
are
not
as
flexible
as
most
people
think
they
are,
and
then,
once
you
take
once
you
deploy
your
service
mesh
and
start
using
it,
it
will
start
generating
a
bunch
of
data
and
a
bunch
of
metrics
and
logs
and
stuff,
and
you
need
to
deploy
extra
software
to
take
advantage
of
those
extra
things
that
are
available
to
you
right,
and
so
that's
that's
what
I
called
the
auxiliary
infrastructure.
A
So,
let's
start
with
the
capacity
and
resources.
This
is
kind
of
straightforward.
I
guess
for
a
lot
of
people
side,
cars
are
containers,
containers
need
cpu
memory
to
run
right.
I
just
pulled
some
numbers
from
the
last
benchmark
that
is
to
have
released
for
a
thousand
services,
which
means
2000
sidecars
in
70
000
mesh,
wide
requests
per
second,
so
70
000
rps
with
1.14
the
last
release.
The
invoice
proxy
consumes
approximately
0.35,
vcpu
and
40
megabytes
of
memory
per
thousand
requests
per.
A
Second
now
put
this
way.
It
might
seem
like
not
a
lot
but,
in
my
opinion,
0.35
cpu
for
a
thousand
requests
per
second
is
quite
high.
So
if
your
services
are
not
very
chatty,
you're,
probably
gonna
be
okay
and
you're
gonna,
probably
be
fine,
paying
almost
half
a
cpu
and
40
megabytes
of
memory
to
have
mtls
handled
for
you
by
the
service
mesh.
A
But
once
you
start
scaling
up
your
application,
both
in
terms
of
the
number
of
total
pods
you
run
in
your
cluster
and
also
in
terms
of
the
the
amount
of
traffic
you're
handling,
while
that
that
footprint
will
start
increasing,
and
you
have
to
basically
keep
in
mind
both
from
the
perspective
of
it
costs
money,
but
also
from
the
perspective
of
capacity
planning.
As
you
all
probably
know,
most
of
service
mesh
implementations
are
done
in
cloud
and
cloud
is
not
unlimited.
Like
right.
Stock
outs
are
very
common
people
not
being
able
to
provision.
A
Virtual
machines
is
a
very
common
thing.
In
most
cloud
providers,
different
cloud
providers
are
doing
different
things
to
try
to
handle
it.
But
you
know
it's
it's
it's
something
you
have
to
keep
in
mind
is
2d
itself,
consumes
one
vcpu
and
1.5
gigabytes
of
memory,
which
is
not
a
lot,
but
if
you
have
a
big
mesh,
you
will
probably
need
to
horizontally.
Scale
is
2d
and
that
could
also
cost
money
right.
A
So
that's
the
first
thing,
the
little
network
latency
now
I
mean
some
people
might
argue
that
1.7
milliseconds
and
2.7
under
1999
percentile
latency,
is
not
a
lot.
That's
probably
not
a
lot,
but
it
it
really
depends
on
the
application.
It
really
depends
on
how
responsive
the
app
have
to
be.
So
if
you
have
a
proxy,
you
will
have
latency,
that's
just
a
fact
right.
There
is
no
no
way
around
it.
Now
I
have
seen
some
sort
of
new
or
about
to
be
either
already
released
or
about
to
be
released.
A
A
A
Here
I
mean,
like
you,
know:
non-tech
savvy
customers,
non-tech
savvy
companies
like
companies
that
are
not
not
necessarily
talking
about
specific
industries,
but
you
get
what
I
mean
so
once
you
tell
them
how
we're
going
to
move
from
a
sidecar
to
to
a
node
based
proxy
and
the
question
is
well:
how
do
we
secure
the
communication
between
the
sidecar
and
the
proxy
in
the
node
right,
because
that
portion
is
unencrypted
in
a
sidecar
based
service
mesh?
A
So
if
you
move
the
proxy
far
away
from
the
actual
pod
application,
so
how
do
you
secure
that
that
portion
of
the
traffic,
if
you
move
toward
pbpf,
then
the
same
question
comes
up
like
how
do
we
secure
it?
Who
is
responsible
for
the
ebpf
modules
as
part
of
the
kernel
etc?
So
these
are
all
like.
There
are
good
efforts.
A
I
think
what
I'm
trying
to
say
is
a
good
effort
to
try
to
mitigate
both
the
resource
footprint
and
the
network
latency
problem,
but
they
come
with
some
drawbacks,
and
you
have
to
basically
be
ready
to
answer
this.
These
questions
design
an
architecture
I
took
here
the
most
extreme
case,
in
my
opinion
that
I
came
across.
There
are
many
many
other
cases
where,
like
you
have
to
decide
on
day
zero
before
you
even
implement
service
mesh.
How
is
your
target
architecture
can
be?
How
are
gonna?
A
Look
like
right
as
as
much
as
we
want
to
believe
that
people
who
are
doing
cloud
are
flexible
in
terms
of
oh
yeah,
I
mean
if
the
architecture
on
day
zero
was
not
the
right
one,
then
we're
just
gonna.
You
know,
delete
everything
and
then
create
it
again,
because
you
know
a
thermal
infrastructure
and
that's
what
cloud
allows
you
to
do
as
long
as
as
as
much
as
we
want
that
to
be
through.
That's
not
necessarily
through
all
the
time,
a
lot
of
people.
A
A
lot
of
customers
tend
to
think
about
cloud
infrastructure
as
a
static
over
time
and
for
them
deleting
everything
and
rebuilding
it
from
scratch,
because
they
have
made
a
wrong
architecture
and
design
decision
on
day.
Zero
is
probably
not
not
even
an
option
on
the
table
right,
so
the
extreme
case
I
took
I
I
that
I
came
across
was
doing
a
multi-cluster
multi-cluster
implementation,
so
multi-region
implementation
with
clusters
span
across
multiple
regions
on
the
same
vpc,
specifically
google
cloud,
but
I
think
it's
kind
of
the
same
for
all
other
cloud
providers.
A
You
can
like
these
kind
of
things
you
can
just
choose
and
and
to
be
fair,
we
did
it
with
the
remote
control
plane,
which
is
not
supported
in
morning
initial,
so
this
kind
of
decisions
you
have
to
choose
them
from
day
one.
You
can
just
switch
things
around
just
like
that.
Right,
like
like
there
are
certain
decisions
that
you
have
to
keep
in
mind
and
you
have
to
take
from
day
one,
and
one
of
them
mainly
is
the
multi-cluster
sort
of
multi-control
plane
or
single
cluster
single
control,
plane
type
decision.
A
It
became
easier
slightly
with
issue
recently,
but
it's
not
and-
and
this
becomes
even
problematic
later
on-
if
you
want
to
introduce
you
know-
hybrid
connectivity
with
on-prem
and
cloud
or
oh
yeah
and
and
all
the
other
things
I
talked
about
before
the
resource
for
sprints
the
latency
etc
is
is
a
factor
in
this
design
decision
right,
because
the
farther
you
have
to
shoot
packets
around
regions,
the
more
latency
you're
gonna
have.
A
The
last
one
is
the
auxiliary
infrastructure.
I
have
a
couple
of
more
things
and
I
want
to
talk
about,
but
the
one
thing
is
the
exhaust,
the
auxiliary
infrastructure.
A
The
way
I
like
to
think
about
it
is
that
a
service
mesh
doesn't
live
in
a
vacuum,
and
you
need
to
run
this
on
top
of
a
cluster
specifically
iso,
and
you
need
to
deploy
a
bunch
of
extra
software
to
be
able
to
take
advantage
of
all
the
things
the
service
mesh
gives
you
like:
tracing
monitoring
and
login,
the
graph
interface,
the
gui,
to
get
the
graph,
etc,
etc.
So
these
are
all
software
you
have
to
deploy.
A
I
know
that
if
some
people
are
in
this
call
from
the
issue
community,
they
might
argue
that
we
made
it
easy
in
issue
to
deploy
these
things.
Yes,
we
did
but
deploying
something
from
the
command
line
and
running
it
in
production
of
two
different
things
right.
So
these
are
all
interesting,
like
pieces
of
infrastructure
that
you
will
have
to
deploy,
maintain
fix,
patch,
etc,
etc.
So
so
yeah,
so
keep
that
in
mind.
It's
not
like
a
it
is
here.
Is
a
service
mesh
deploy?
A
So
in
a
nutshell,
I
told
you
it's
gonna
be
short
in
a
nutshell.
The
the
kind
of
five
lessons
learned
for
me,
at
least,
were
do
not
take
a
simplest
approach.
I've
seen,
especially
in
some
security
conscious
companies,
they
just
go
like
oh,
we
need
security,
can
mtls,
give
us
what
we
need
to
meet
certain
securities
requirements.
Yes,
is,
is
too
painful
to
man.
Sorry
is
mtls
painful
to
manage
by
hand.
Of
course
it
is
especially
at
scale.
Can
a
service
mesh
give
us
mtls
easily?
A
Yes,
okay,
let's
use
it
right
and
specifically,
in
this
case,
what
I
want
to
actually
mention
is.
A
This
problem
is
specifically
exactly
exaggerated
if
you
are
working
in
a
distributed
environment
with
some
sort
of
central
platform
team,
which
is
actually
a
trend
that
that
we
are
seeing
these
days,
which
is
one
team
handling,
all
your
central
infrastructure,
all
your
central
communities
clusters
and
all
your
service,
mesh
kind
of
control,
plane,
deployment,
etcetera,
etcetera
and
the
the
problem
we
see.
We
see
there
is
well
one
of
the
problems
I
see
there
is
the
platform
team
will
brand
the
clusters.
They
will
deploy
the
service
mesh
for
you.
A
While
the
service
mesh
can
do
that
for
them
right,
so
these
are
all
things
to
keep
in
consideration.
If
you
end
up
in
a
situation
where
you
have
to
deploy
a
service
mesh
in
a
centralized
way,
I
think
you
should
be
able
to
surface
that
information
to
everybody
who's
involved
and
let
them
know
that
there
are
things
that
can
be
done
using
a
simple
yammer
file
instead
of
implementing
it
in
code.
A
I've
seen
this
a
couple
of
times,
but
I
think
the
one
of
the
kind
of
common
use
case
that
you
might
have
came
across
is
using
a
data
transformation,
a
data
transformation
pipeline.
Sorry,
an
example
that
came
to
mind
that
this
is
not
specifically
to
to
to
point
out
a
specific
tool
is
just
an
example.
A
Is
the
argo
pipeline,
the
argo
not
cd,
not
the
csd
partner,
but
the
argo
tool
that
allows
you
to
do
data
transformation
on
data
so
that
the
way
that
works
is
that
you
have
a
control
plane
that
you
configure
to
spin
up
a
pod
to
run
a
specific
piece
of
custom
code
on
a
specific
piece
of
data.
A
When
we
introduce
a
service
mesh
well,
the
problem
there
became
a
the
the
pod
itself
for
the
application
takes
less
time
to
run
than
the
service
mesh
than
the
sidecar,
so
the
sidecar
actually
takes
more
time
to
to
get
up
be
ready
before
the
actual
application.
Pod
can
access
the
network
and
run,
and
then
finish
so
basically
like
by
adding
a
sidecar
we're
extending
the
runtime
of
specific
job
from
few
seconds
to
a
minute,
maybe
and
then
we're
increasing
the
footprint.
A
Obviously,
this
is
not
a
problem
so,
for
this
kind
of
fast
run
dynamic
environments
that
are
either
data
transformation
or
ci
cd
or
something
like
that.
Introducing
a
sidecar
based
service
mesh
could
be
a
problem.
The
other
problem
you,
I
think,
you're
aware
of,
is
the
the
ordering
in
which
things
start.
You
have
to
create
dependency
between
your
actual
application,
pod
or
container
and
site
car
container,
so
that
your
application
container
does
not
try
to
connect
to
the
network
or
resolve
some
dns
or
something
before
the
site.
Car
is
actually
available.
A
So
basically,
this
problem
of
whether
my
application
can
run
with
the
service
mesh
in
a
nice
way
is
a
problem
you
have
to
figure
out
before
you
actually
commit
to
using
a
service
mesh.
The
example
I
gave
earlier
we
just
end
up
excluding
the
argo
pipelines
from
the
service
mesh
because
it
was
not
compatible,
but
that's
just
one
example.
A
As
I
said
in
large
organizations,
one
of
the
problems
I've
came
across
is
people
having
theoretical
and
practical
experience,
especially
in
debugging,
because
one
thing
is
deploying
and
labeling
and
having
site
current
injection
working
and
all
those
kind
of
things
which
is
nice.
But
when
things
break,
you
have
to
have
enough
theoretical
and
practical
knowledge
to
know
how
to
debug
them
right,
and
I
have
spent
countless
hours
in
front
of
the
terminal
trying
to
figure
out
why
certain
things
didn't
work.
A
It
still
specifically
issue
ctl.
The
command
line
makes
that
slightly
easier,
but
when
you
are
trying
to
debug
issue
using
the
ctl,
you
have
to
understand
what
are
you
looking
at,
especially
when
you
use
the
proxy
status
or
whatever
subcommand
line?
That
tells
you
how
the
mesh
is
working?
You
have
to
understand
these,
like
terminologies.
A
You
have
to
understand
them
properly
to
understand
why
something
is
broken,
and
this
again
is
another
problem
that
is
exaggerated
if
you
are
working
in
distributed
environments
like
in
a
central
like
central
multi-talent
environment,
so
the
question
is:
who
is
the
responsibility
to
debug
and
fix
the
service
mesh
when
it
breaks
it?
Will
implementing
service
mesh
will
increase
your
technical
depth,
regardless
of
whether
you
want
or
not,
because
simply
you
have
to
while
maintain,
monitor
and
manage
the
service
match
itself
and
maintain
all
the
other
components?
A
I
have
written
an
article
which
goes
slightly
more
into
details.
This
is
quite
old.
It's
from
january
this
year,
you
can
find
it
to
my
medium
blog
and
it
it
goes
into
more
details
about
the
stuff
I
talked
about,
so
you
can
feel
free
to
read
it
if
you
want
and
with
that
I'd
say.
Thank
you
very
much
for
having
me
and
if
you
have
any
questions.