Istio Community, 28 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Istio September Meetup/ How Istio bring excellent observability to HP by John Zheng and Lisa Xu

Description

For large platforms that support hundreds of projects, it is vital to gain insight into the api response code, api response time, api numbers, while monitoring and precisely alerting. Istio can easily achieve these requirements, without any changes to service code. In addition, troubleshooting in a complex service mesh is always troublesome and time-cost. However, through integration of Istio and jaeger, they can combine client requests, tracing, Istio logs, and application logs altogether. In this way, when a problem occurs with the api, it is easy to locate. In this session, HP devops will share their best practices in cluster observability, hoping to bring useful references for everyone.

A

So as mentioned that this is lisa, I'm the device manager working hp, I'm located in shanghai, china.

B

And this is john, I'm a college with lisa, I'm a developer architect, I'm very happy to have such chance to talk with you. Thank you.

A

Yeah, so today we were talking about how you still actually bring the excellent visibility to hp horizon platform.

A

Okay, here is today's agenda, we're talking about uh basically our business uh business platform and uh how what kind of features we have enabled on your leverage from the istio and also what kind of customizations we have did along with these two features and in the end, we'll have the q a session.

A

So um we uh we want to talk about uh about with our hp horizon platform design with the e-still, as you know, that hp and as a bigger company, we have lots of projects and deploy on the cloud. Some of project has a common features like the single cylon, the usb measurement and the payment etc, and all they also have some specific project features. So john and I are working as a software organization, so we decided to provide a common platform called horizon which were major provide two parts.

A

The first one is: we are serving the common features as a service so that each solutions they didn't have to developing the service by their own, but they can leverage our common service in this area, a second one, the horizon platform. We are also aiming to provide a manager infra for a service for a solution, so solutions don't have to care about or operate their infrastructure, but they can loading their service in our manager infrastructure.

A

So in this way we are hoping that we can maximize the values for the customer and um and also able able to helping our solution be able to delivering on their solutions on to the marketing, with the low cost and with the hyper higher quality.

A

So um here is the how the uh the platform looks like, so you can see in the left right corner. We have uh this cluster called call cluster and the inside of the core cluster. We have uh hosting other call service like the all service, im service, payment, service, notification, service and etc, and all the courses, I suppose, the access from the ingress gateway and then in the left corner. We also have a solution clusters and the solution in our platform we treated them as a individual tenant and each tenant will have their own namespace.

A

Their namespace will be isolated using the policy and their service were also exposed from the ingress gateway. So if the solution service they want to consuming our core service, there was through the mutual tms and we set up a solution cluster and the core cluster. As those issue, multi-cluster are using the rapidly the control panel a plan, and if we have some solution cluster, which is hosting by themselves and not running our on our manager infra, they can still have the way to consuming our core service from this channel directly.

A

So consider if we country we are building a platform and we are provided this kind of service to our solutions. There must be some requirements. We want to fulfill. For example, um you, as we are providing the core service for larger audience and solutions, how we can make sure that our service is performed very well and how we can get alert.

A

If we really have some issues in our core service in time and the second one, for example, if we embody more and more solutions there, then how easily we can tell that one issue coming is easy to detail. This is a issue in a solution, or is there issue in our core service?

A

So we we are hoping that we didn't uh highly rely on the application team to providing hundreds of locks to supporting the trump shooting, but we are hoping that as a infra, we are able to provide this kind of functionality for people to trap, shooting and monitoring. So I think kudos to the istio that provided a very excellent observability um in order to support this.

A

So how is to provide us a way as as we all know that in in the easter side, they are adding a proxy that car, along with every application deployed. So in this ways you will let all the program applications available for the traffic management and also the encryptable availability, so istio generator detail telemetry for all service communication with their mesh, and this telemetry actually will provide a good way for the service behaviors.

A

So in this way they can empower us through troubleshooting and maintain and optimize our application and the most important things that we are not adding any burden to our service developer, so istio provider uh following ways for the telemetries, the first one is access lock.

A

So, as all the traffic flow in to the mesh issue is able to generate all the um full records of them api request, including the source and this destination method, and in this way it will able to tell you that, like the how, when what of the logins and the second way for the telemetry is for the matrix matrix, actually is able to provide a way for monitoring and and also helping you to understand the service behaviors in aggregate and is still able to generate the metrics in a signal like the error, traffic and uh letters and etc, and also is you're able to provide a default default monitoring dashboard out of box.

A

And so the third thing is for the distributed trace. So uh instill is able to generate all the distributed trade spans for each service and in this way or along, you have the inside of the all the call flows and also the service dependencies um in the mesh.

A

So we will go through each of the items in when we go through all our solutions. So, first, let's take a look for the access log from the invoice.

A

So, as I mentioned as we are working for platform, it's key for us to know that to show some data like our service, core service utilization, the performance of our service, for example. If I want to know my core service api access account and my core service api rate and my core service api response time, how should I do because I don't want to ask my core service developer team to implement any code for this. I want my infra can tell this data directly to me. So with the solution we come out.

A

Let's take a look, so this is our solution, so here you can see. This is our cluster you can consider and the inside of the cluster we are leveraging the friendly to collecting all the locks so on, contrary where our solution are collecting locks from the easter proxy and also the istio ingress gateway.

A

So from here after we're collecting, we also are passing this kind of logs in in the for the field which we consider is important, um and then we store all the data in the elect search and then we can able to view the data and the generated dashboard in the keybinder. So let's take a look deeper. How we implemented this.

A

So by default we were not able you were not able to get in the in y access log from still so in order to enable it during the installation you have to use in the instacado to enable this uh invoi access log, and now, once you enable it, you were able to see to view the logs in this way like this is the logs in the istio ingress getaway, and this is locked in one of the container of the istio proxy.

A

So this locks looks very similar and contained lots of informations. So let's take a little deeper. What kind of informations will be content in this log? um We highlight some of the key informations. We consider will be very useful for us in this access log and the first one is the start time for the api call.

A

The second one is, is the method of your api and what's the early api pass and the what's the response code for this api, whether you success or it's not um and what's the duration for the api and the rest, I rest request id, which john will also mention later, is very important information and there was a host.

A

So after we understanding as a formatter for the logs, then we can do something which, based on our request, so in the front d we are do. We are doing the passing of our logs, based on on the on the keywords which are, we think, is important in a friendly config map. So here is one example we are using in our project. We are passing all the important things like the response code message, as I mentioned the proxy and we store each of the values.

A

In our you know, our new field, which we start and after we pass in this kind of pass this kind of format, and we are able to store in the research and then we are able to view this kind of logs in this format in a more readable format in the kibana. So in the left panel. This is all the new key we generated based on the logs, and this is all the values we get from.

A

The logs, um like the as I've said, is a duration method as a pass and response code, et cetera and uh as soon we are able to store this kind of data in this more readable way, and we are able to group this kind of data and generate a report for my needs.

A

So, for example, if I want to understanding for the api access count, then I can able to easily based on the key switching rate and the generated kind of report, and also, if I want to understand my api error.

A

What's the top one there, I'm able to generate this kind of report as well, um so all of them is based on each api and also I I'm able to space on my duration, I'm able to see the latest in my api, so I can provide this this kind of data and ask a developer to do some job shooting based on that.

A

So until now we are able to have a clear view of my platform of the api performance, and I think this is not enough because, as we are operating the platform, we still want to predicate a lot, and so we can know that if any issues um may happen in the system and we can get alert in time, so we also put some effort to working for the api monitoring and a lot of stuff in the platform, and we do meet some challenges um as we're going on and some major challenges we have, I think previously is we have some issues which is not reported in time or maybe some issues is still in the warning level and we didn't pay attention and then later on, it's really become as a one instant in the production.

A

So one example I can show you is we have one service and this service has several api will be supposed to resolution to consume and the one of the api actually is very stable and better access. The access rate is also very high and is very stable, so you can consider if you are monitoring on this way. It will always risk most, but there is also another api inside of this service, which access rate is very low, but it's very critical. If this api failed, it will trigger a critical instant for a solution.

A

So in this way, because our alert is based on the service error rate, then this kind of alert will not get attention in time. So we want to find a solution in order to resolve this kind of situation. So we are also concerning what we can do to overcome this kind of challenge. um So we come. um We are considering that our monitor should be more accurate on the api level and also can be more predictable, and so um we are considering some requirements or rules which we can put into our monitor.

A

For example, if my api error count not based on the error rate but on based on the error, count is very high or even maybe the count is not high, but the error is already in the trend of increasing or maybe my api relation is high, or also it maybe is in the trend of um increasing.

A

So I still some person rest hand. Is there any question so far.

C

Oh yeah yeah. I just wanted not to be sure if I understood it all right, okay, nice- to meet you I am about, but you you monitor the error rate of this service and then.

A

C

You raise the alert or you is that right, yeah.

A

Yeah, so so previously we are based on this. We are we're monitoring the service average and we are alerted based on the service error. At rate, though, but in this way we meet some challenge, because this kind of alert will not be accurate, so we want to change to the api level instead of.

C

A

Level yeah yeah.

C

That's what I understood you just wanted to know if I got it right: okay, yeah.

A

Thank you, yeah yeah you're welcome, yeah, and, and so this is something we come out the requirements we want to base on based on the api level, and then we can do the monitor. So we look into um what the original install did uh for the mattress pad. So uh actually, as I mentioned in the beginning, uh istio actually were able to generate a lot of metrics by default, and so you can query discount metrics from the original outbox of the parameters in inside of estio and and once more.

A

You can also view this kind of mattress from the graph now, um but in this way you can see this is the service and I'm able to see my latest in and also the success rate, but this success rate is still based on the service level. um There also, you can see, there's also the tracings in the keali to see the service dependencies and also in here you are able to see all the other um for the api call, but this is also based on the service level.

A

So all of them provided from the institute by default is awesome. um But still, as I mentioned, we want to do this kind of things in the service level uh in the api level, instead of a service level. So we have to come out as a solution by ourself, um so I will hand off to john and he will give in detailed introduction regarding our solutions.

B

Okay, thanks lisa, so this is john.

B

So, as lisa said, he's already introduced the whole, how we get us loggers and viral logos and how we pass the logos in front d and how we forward to the electrical search is also introduced. The current current istio metrics cannot fulfill our requirements because it's on the service level, not api level, okay. So this is our final solution, so we with waters and logos, pass logos and save the logs in the electric search with that. We already has all the api level data api information, like response code response count and response latency.

B

Based on this, we do the simple a lot with the last a lot or in most cases we do more advanced a lot, because we have more complex rules so with first for each minute, we're using a clone job to do the data, aggregation and save the aggregation data into the database, and we use another clone job to queries. Database find out unnormal data and send a lot with page duty.

B

So, as I said, there are two way so I would well. I would try to explain a bit more for the simple, a lot for the electrical ladder. The lure will like a match where there is at least x events in white time, so you can say for a specific api.

B

If it's response code is 500 and it raises three times in last minute: okay, you should erase such alert, but in most case we want to list all the apn. We don't want to list api one by one in config file and- and we also want to a lot of unit- continues in two minutes like so so we need using this our more complex, advanced alert solution, so one thing animation, although the rule is complex, but our configuration is very simple.

B

We support the yaml file for each rule.

B

Okay, this is aggregation table instead base, so for we grew by each monitor group by server name, api path, api message and respond code, and we calculated the average response time p90 respond time and the total count in this minute for this group by okay, after after we get this table aggregating table, it is very easy to implement our requirements for the previous requirement is list all api whose response code is 500 more than three times and continue two minutes.

B

So we say: okay, we get this data from aggregate table here and response code is 500 and more than three times in last two minutes and continue two minutes. That means both of the minutes has such case, so you can find with this it easily to implement our requirements and reserve our challenges so based on the api level, a lot okay, so based on the same solution. We think we need enhancement and test our a lot by implementing below requirements.

B

So all of them is implemented and works. Well, like us, 500 error rate is more than five percentage in the last five minutes, like the p90 latency is greater than 500 milliseconds for the last five minutes, or sometimes the average response. Time is not high, but it continues to increase right. So we should. We will also a lot if we increase almost 20 percentage in last hour or we can come.

B

We also store the p90 average time in the last week, so we compare current average response time if it is two times of previous last week's p90 average response time, yeah we, yes, we will also a lot. So in such case, maybe this api is very very quicker. It is quicker like less than 100 milliseconds.

B

It is not based on the data, it will not have a lot better, but it gets slower and slower. Maybe, and it is unnormal because compared with previous data, it is unnormal, so we will a lot so we implement all of them. So this is our benefits.

B

After that, we can identify the issues and alert quickly and it do reduce our club cloud platform downtime down and guarantee our sla and improve our use appearance and definitely definitely it will reduce our depth of load and it will improve our work. Efficiency, okay, okay, so let's go to another topic for the easier opposite for perceptibility the chasing integration.

B

I think most of you are familiar with situation. I would like to introduce a better with two popular choices for the json one silicon and another jager.

B

Silicon is developed by twitter and it forwards the http header, xb3 rather, and jaeger is developed by uber, and now it is a cncf project and the leader for what is the http header is google chris id. They are different.

B

So this is our challenges before we get our solution, so we are using jager, but as a spans between the application and the easter process cannot be connected because therefore water or matter is different. Their chest format is different and also it's not easy to match. This client request with jaeger chess, because thousands of requests go to go inside as a yeager and for specific requests. It's not easy to search it out same for the jager chess.

B

You cannot match the longest. I mean the longest in the friendly the longest in the kibana. You cannot match it together. The same even for the loggers we cannot match is the price loss and the application logs.

B

We said when a client send an api and get a 500 is really a nightmare for debugger. You cannot easily identify which service fault in a complex service mesh.

B

This is a diagram to do to uh to introduce our challenge so client-side. Send our request to service through and service four will send another request to the service bar. So, for each time it will go through the enviro policy and both of these applications and employment policy are sent the chase data to the jager.

B

We hope we hope they can connect together, better.

B

The envoy policy said I forwarded the silicon chest formatted only and I also forwarded x, request id. This is a voyages id, but our application said uh iphone was a jaeger for meta because we are using jager.

B

Okay, so jager maybe said I like jaeger for meta, but I am compatible with silicon chase format. So what can application do? What's the best option? Can I forward all of them?

B

Yes, our solution is yes, application for what all below has includes a jaeger formatter like uber chess id reciprocal format like bxb3, because I dispense on the parent spell id and include the jager avoid chasing request id.

B

Besides that we do an additional stamper, all the application when they print their logo. They were using the x request id as a prefix. uh You know for the easy increase, getaway or the easy process. They already print out the active request id in the envoy access logo, since we can connect them together.

B

Is this solution complex, so neither neither developer cost a lot of time. No! No! It is very easy. As long as you add this green part code, you can implement this so previously for integrate with jager. So you can. You need only add this part when, when you new, chaser and now in order to compatible with silicon and the x request id, you need to add this pattern with new zipper and with the package package request id and both for the inkjet analytics character.

B

Okay, after that everything is done. So I put some sample code in my github and you can take a look and try by yourself.

B

So this is our benefits. I I I using a sample to show our benefits. Maybe let's say there is some problem with one api request: we want to get all the chats and the logos we send this api with an uuid as a http header, with a header request id a uuid with this usual id. We can query our success previously. It's not easy to query the chest, but now you put it here, put the titles here: gui dx request id with the previous uid click here and click the button. Okay, that's cool!

B

So the trace will be found out easily and you will you can go to the details of the chase. You can find it will go to the ecl gateway. It will go to the full service in eastern policy and go to the full service application so for different expense and increase. The span of bus service is a process and a bus service application great. So you can find all spans connect together.

B

So this is more useful and we can we want to get all the logs together, so you can find the since it go to the ecl gateway. We can. We will find the logs and a acquiring a great uid. You can find okay, so the envoy access log has showed out, and then we want to find out the full service, easier process and graduate okay cool. It is also showed up and we want to see the full container. This is our our application container.

B

You can find for this log for each logger, so the uid request id is is printed out as a prefix same for the past service, it's a process for a bar service container. We said as long as you put the uuid into the kibana and query all the applications.

B

All the services loggers related with this api will be carried out directly. So one thing you need to mention is the application need extract the active request id from the http header, because equal request id is always always forward by spice easter and our applications. So we can data and print heater for each logger as a prefix.

B

Cool okay, so let's go to the final tip, so a quick debugging with the envoy filter. So here's a challenge: some error rates on production environment for our application. However, this application loggers is not enough. How to handle the general way. Is the developer, add some logos and generate a new image and deploy this new image to the production environment and go to the release chain. It needs color change right and it needs some time and it even changes. Color back.

B

We have a better way, so we just add a simple and void filter. We can get all the requests and response, header and body and print it out, and the developer can get this and take a look and maybe can use this to debug and reproduce in the non-production environment.

B

This is the m1 filter and for this workload selector you can limit which application should the printer, such a body or the body also also also has we don't want to any application print such as so many heads and logs right, and there are two financing: one is environment, request and environment response and on false environment requests. You can print out all requests has and body same for the environmental response. You can get all the heads force, response and print all the keys values and both heads and also print out all the body.

B

I provide the code. Also, you can view it later. Try by yourself.

B

Okay, this is a benefit, so this example is a client said after after we implement this client said the client said a request and with this header, a authorization basic abcde and with this form this body and talking and in server side, you can find out so android filter, already print out all the heads, including authorization, authorization abcde and includes the body.

B

So it is pretty easy and after you debug, you just delete this envoy field. So it is very useful.

B

Okay, I think that's all for today's presentation, so lisa introduce the access logs from envoy and api monitoring and another, and I introduced the tracing integration and the quick debugging with invoice field.

B

I'm very happy to have such chance to to talk with you and we provided a sample coder, and you can take a look and if you have any question to feel free to ask lisa and me thank you.

B