►
Description
#IstioCon2021
Presented at IstioCon 2021 by Shota Shirayama.
We introduced Istio on our microservices. Istio’s logs, metrics and features are very helpful for us to investigate in detail in case of failures.
One day we had big trouble due to a node failure, and it was very hard to find the root cause about why our application had not been recovered automatically. At that time, we finally found the root cause of it on our application logic thanks to Istio and we could reproduce the same failure on development environment with Istio as well. I’d like to share this story.
A
Today,
I'd
like
to
talk
about
how
istio
has
helped
in
troubleshooting
the
microservices
we
operate,
I'm
short
ashraem
from
japan
and
working
as
a
software
engineer
at
rockton.
Rockton
is
a
company
that
runs
several
businesses,
such
as
e-commerce,
mainly
for
japan.
Okay,
let
me
start
my
presentation.
A
A
A
First
of
all,
here
is
a
simple
system
diagram
to
help.
You
understand
our
system's
outline
main
service
and
each
micro
services
were
combined
to
configure
the
back
end
on
gke
with
istio
installed
main
service
uses.
Graphql
main
service
receives
a
user's
request
and
calls
each
microservices,
but
there
are
some
places
where
some
microservices
call
may
service.
A
Communication
between
services
will
be
performed
via
these
proxies
sto
will
provide
various
functions
such
as
authentication,
encryption
and
logging,
or
monitoring
at
this
layer.
No
application
changes
are
required
required
to
install
steel
from
now
I'll
talk
about
the
failure
that
happened
to
our
system.
A
A
The
nodes
and
parts
recovered
automatically
after
a
while,
due
to
kubernetes
and
gk's,
auto
healing
mechanism
after
the
automatic
recovery,
both
node
and
pods
were
running
normally,
so
we
thought
our
application
ran
correctly,
but
unfortunately,
it
didn't
one
of
main
service.
Endpoints
has
stopped
responding.
A
This
endpoint
calls
services
internally,
since
services
wasn't
directly
affected
by
the
node
failure.
The
situation
looked
very
mysterious
for
us
and
we
couldn't
find
the
root
cause
of
the
problem.
At
that
time,
we
decided
to
restart
the
service
part
manually.
Then
the
problem
disappeared
after
that.
A
There
were
a
lot
of
upstream
connection
terminations
when
the
service
a
part
restarted,
in
other
words,
main
service
connected
to
service
a
and
waiting
for
response.
Long
time,
then
disconnected
by
restarting
service.
A
this
response.
Flag
logged
in
the
steel
proxy
often
gives
us
valuable
information
about
what
was
happening
at
the
time
of
the
failure.
A
A
A
A
A
Using
this
is
your
fault
injection
feature.
We
set
a
fixed
response
delay
only
for
requests
from
services
to
main
service,
then,
as
expected,
data
gradually
accumulated
in
the
queue
the
api
of
services
also
became
waiting.
As
a
result,
we
were
able
to
reproduce
the
situation
that
main
service
also
became
waiting.