Istio IstioCon 2021, 10 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How Istio helped us investigate failures on our microservices

Description

#IstioCon2021

Presented at IstioCon 2021 by Shota Shirayama.

We introduced Istio on our microservices. Istio’s logs, metrics and features are very helpful for us to investigate in detail in case of failures.

One day we had big trouble due to a node failure, and it was very hard to find the root cause about why our application had not been recovered automatically. At that time, we finally found the root cause of it on our application logic thanks to Istio and we could reproduce the same failure on development environment with Istio as well. I’d like to share this story.

A

Today, I'd like to talk about how istio has helped in troubleshooting the microservices we operate, I'm short ashraem from japan and working as a software engineer at rockton. Rockton is a company that runs several businesses, such as e-commerce, mainly for japan. Okay, let me start my presentation.

A

Today's session aims to make those who are currently considering service meshes think istio looks good by presenting the actual story, that is to help us investigate on our micro services.

A

The content is not advanced, but I would like to talk about a specific experience as one of istio user.

A

First I'd like to explain our background. We run microservices and it is generally said that microservices increase the complexity of the system without system observability. It will be challenging to grasp the overhaul picture of where and what is happening and how it is connected.

A

One of the biggest concerns is that it will be difficult to isolate when a failure occurs.

A

Our team had a situation where the development team was busy developing features and couldn't afford to improve logs and architecture.

A

Therefore, the sre team decided to use istio which can be installed without changing the application to counter this system's complexity.

A

Ester has improved the of the ability of the system and also expanded the range of network related tests. Today, I would like to talk about how these benefits of istio have helped us isolate complex system failure in our micro services.

A

First of all, here is a simple system diagram to help. You understand our system's outline main service and each micro services were combined to configure the back end on gke with istio installed main service uses. Graphql main service receives a user's request and calls each microservices, but there are some places where some microservices call may service.

A

As you might know, when we install istio sql proxy is deployed as a side container together with each application.

A

Communication between services will be performed via these proxies sto will provide various functions such as authentication, encryption and logging, or monitoring at this layer. No application changes are required required to install steel from now I'll talk about the failure that happened to our system.

A

One day, one walker node on gke went down due to a network failure. One part of redundant main service was running on that node.

A

The nodes and parts recovered automatically after a while, due to kubernetes and gk's, auto healing mechanism after the automatic recovery, both node and pods were running normally, so we thought our application ran correctly, but unfortunately, it didn't one of main service. Endpoints has stopped responding.

A

This endpoint calls services internally, since services wasn't directly affected by the node failure. The situation looked very mysterious for us and we couldn't find the root cause of the problem. At that time, we decided to restart the service part manually. Then the problem disappeared after that.

A

The fact so far did not tell us what was happening service a did not have an access log and the application log did not output useful information that could identify the cause- it's not only for microservices, but the lack of useful log information is fatal in this kind of system trouble.

A

Next we decided we checked histoproxy's log and we found the abnormal records when connecting from main service to service a.

A

There were a lot of upstream connection terminations when the service a part restarted, in other words, main service connected to service a and waiting for response. Long time, then disconnected by restarting service. A this response. Flag logged in the steel proxy often gives us valuable information about what was happening at the time of the failure.

A

With this, it seems that we could dig a little deeper into the cause of the failure.

A

Next, we made a hypothesis about why services could not return a response to main service based on the processing flow of main service and service, a service a consists of api, in-app queue and single process. Daemon.

A

First of all, the node running main services went down and probably the processing of the single process daemon was waiting for the response from main service.

A

In that case, since the queue data is no longer consumed, it is considered that the apps queue gradually became full and the processing of the api of service a pushing the data to the queue was also waiting. As a result, services api may no longer be able to return a response to main service.

A

Both main service and services didn't have the proper timeout settings when calling the api. So this hypothesis was entirely possible.

A

We made a hypothesis, but we haven't verified it yet so we tried to reproduce the same situation in a testing environment. However, when we tried it, it wasn't easy to cause the same situation.

A

We wanted to reproduce the situation where main service does not return the response to service a so we thought we could use istio's fault injection function, fault injection is a feature that can return an error or cause a response delay at a specific rate in any http connection, without changing the application.

A

Using this is your fault injection feature. We set a fixed response delay only for requests from services to main service, then, as expected, data gradually accumulated in the queue the api of services also became waiting. As a result, we were able to reproduce the situation that main service also became waiting.

A

This reproduction procedure, using this tool provided using steel, proved that the hypothesis made earlier was correct.

A

In summary, in situations where it was difficult to identify the cause from the upload alone, the proxy log provided by istio allowed us to make a hypothesis about the cause of the failure using the function provided by istio. We were able to reproduce the failure in tests environments.

A

Today I talked about how the functions provided by istio were valuable in the actual system. I hope this experience of ours will be helpful to those who are considering the introduction of istio.

A

That's it from me. Thank you for your listening. Let's have a fun time for the rest of stockholm.

A