solo.io SoloCon 2021, 24 Mar 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 004 Gloo Edge in Action Operational Tips and Tricks

Description

Gloo Edge uses Envoy Proxy and after running Envoy in production at a variety of organizations, including some of the largest in the world, we've amassed a deep understanding of how to run Envoy at scale. In this session, we cover some common tips and tricks for operationalizing Envoy, the Gloo control plane, and where to look when things go wrong.

A

Good day, everyone and welcome to solocon 2021 and welcome to this talk to point you blue edge users toward ways to improve your operational use of solos api gateway product.

A

My name is jim barton and my colleague baptiste and I are both field engineers with solo. So we spent a good part of our days talking to both new and experienced users of blue edge about the challenges they faced and about the solutions that they come up with to overcome them over the next 45 minutes.

A

We'll share with you some of the lessons that we've learned from those discussions with the glue edge community, we'll cover topics like understanding, envoy, filter flow security, profiling, we'll discuss tips on how to optimize your blue edge configuration both for production readiness and to better align with your team's areas of responsibility.

A

Please note that this session does include some assumption of blue edge experience, but if you don't yet fall into that category, don't worry about it. Just enjoy this session and then know you can also sign up for a hands-on blue edge workshop that we're going to be sponsoring this thursday at 11 a.m. U.S eastern time, so, if you take, if you attend that session, you can then spin up on on some of the fundamentals of the glue edge product and, of course, we'll save time at the end of the session today.

A

To take your live questions now I'll turn it over to baptiste to discuss envoy filter flow. We hope you enjoy our discussion.

B

In this part, I will speak about the filters in blue h and, more specifically, how they are organized and how they chain together, so first, a quick reminder about glue edge. As you know, blue edge acts as the control plane that will push android configuration to the data plane, which is, of course the android proxy, so clients requests will go through the android proxy and its filters before reaching any upstream server.

B

But what does an envoy configuration look like well? If we have a look at this simple invoice configuration file, then you can see two different blocks. The first one is listeners and will define the server sockets. The second block is named clusters and that will define upstream servers.

B

A listener is boomed to a network filter chain and the most common network filter is the http connection manager. This allows you to define routes based on domain names and on matchers.

B

Here we have a simple matcher that will catch every request. This request will be applied, http filters, with some specific configuration before being forwarded upstream.

B

Luckily, glue edge makes things easier by letting you split the configuration using custom resources with the upstream crds. You can define android clusters and some configuration in relation to wheels to those upstream servers with virtual services or worktable crds. You can define the filters and their options that will apply on a route.

B

From this point, you may wonder how these filters are linked together. That's a good question. So, let's go step by step. I first.

A

B

With the export filter, this filter will connect to the glue export server where you can configure different kind of built-in authentication and authorization policies. You can even build your own policies and you can mix them all together. I will come back to this in my next talk now the dot filter.

B

So basically, you can verify your jot signature and check its claims and also add them to new headers. You can also apply this jot filter before the extract filter.

B

Now the transformation.

A

B

So you can apply transformation templates or on the headers or on the body. If the body is a json payload, you can extract values from it.

B

You can also apply this transformation filter right before the extort filter.

B

And with those three filters, transformation, drought and export filter, you can enable the option called clear root cache. What is it so? Basically, when the request comes to envoy, then the http connection manager will first select a route matching the headers, and when you have new headers with those filters, you can ask the http http connection manager to clear the root that was picked and so that you will be able to dynamically select new routes at the end, and this is a really cool feature.

B

So now the rail limiting filter will also call an external service called the right meeting service. That will apply your policies and it comes with a nice api called the set style api that helps you to build complex rules very easily. It's really worth a try.

B

You can also move. You can choose to move the filter before the export filter. This is a global option and there are many more filters. I won't explain them all, but this is the big picture that I could draw here.

B

So now, a few other tips first, the direct response. That is an option that will be evaluated at the very beginning of the flow by the http connection manager. So if you choose to directly respond to your client with a 4-1 response, then this will be applied at the very beginning.

B

Also, basically, will most of the filters, like rate limiting course worth when you set these options, both at the virtual service level and also at the root level, then the policy set at the root level will take precedence and, finally, the airbag filter requires the jot filter to read the principle just before. Moving to the next slide, I would like to add that all of these filters are applied only if you use the associated option, otherwise they are not active.

B

So the final filter in the flow is the water filter.

B

This one is in charge of adding or removing request or response headers, and it is also in charge of playing prefix, rewrite, host playwright or rejects rewrite, and also in charge of everything that is related to the upstream servers like uh timeout settings and retry policies, outlier detections and so on.

B

As a final note uh regarding the export filter, when you use the open id connect policy and if you set the callback path or the logout path, just bear in mind that the request will first grow go through these different filters before reaching the x dot filter. So, for instance, if you are using the course options, then your request will first be apply the course settings before reaching the export filter for the callback.

B

That's it for now. Thank you.

A

Before I started with solo, I consulted with a large omni-channel retailer for some time their business had been flourishing for decades and they had an extensive portfolio of legacy applications. But by this point the entire organization was committed to modernizing their application portfolio. In fact, they were heavily invested in a popular kubernetes platform and had a number of applications deployed in a public cloud. In fact, they were even given an innovator of the year award by the vendor of their kubernetes platform.

A

I guess spending a few million dollars will generate some award buzz uh like that. Anyway, when I visited uh with their development teams on the ground who were making all of this innovation happen, they all had one complaint about their environment that they loosely called developer productivity.

A

To make a long story short. The root cause was this: every single build into a shared development in the shared deployment environment had a dependency on approvals from a central organization, and this had a big negative impact on the velocity of their development process and their ability to respond to the needs of the business.

A

So, first I want to warn you that I don't have a silver bullet for untangling dependencies between the platform and development teams within your organization, but blue edge does have a really useful feature that many of our customers use to help streamline operations by decreasing the level of coupling between platform and development teams, and that feature is, is called route tables. If you haven't used route tables before you definitely want to take a look at these docks right here, uh pretty good.

A

um So let's illustrate with an example. So first, what I want you to do is imagine a simple strange little world where you have two development teams, uh there's one that manages your custom version of the htt bin service and a second that manages the pet store app and, of course, you use glue edge to expose these services to the outside world. So initially you build a virtual service that maybe looks something like this.

A

Your simple little naive version, so you can see, there's a there's, a pet store matcher here that delegates to an upstream and there's also an http bin matcher that delegates to a to a different upstream.

A

So pretty simple stuff- and uh you know if we go in and um apply this, um not surprisingly, that's going to create a virtual service for us and we can go run these applications and lo and behold, there's there's the pet store service and uh and here's the http bin service that echoes all of our uh all of our input parameters. So um so all right! So that's that's really fine. But now you know, let's imagine that that somebody needs to make a change to the routing for this http bin service.

A

Let's say they need to add a hello world header, that's going to be echoed out in the in the http bin http http bin output. So logically, what they want is something that looks like this second virtual service. So it looks. It looks a lot like the first one: we've got, uh you know the pet store.

A

Matcher we've got the http bin matcher, but then we've added to http bim this set of transformations here, that's simply going to go in, and it's going to add a hello world um header uh to our to the response, okay or to the request, and so you know. I know this would never happen in your world. But let's just say that someone makes a mistake in the configuration and they they break the virtual service and now that it's broken it's possible that it's broken not only for the http http bin team.

A

That made the change but because all of the stuff is coupled in one in one virtual service. Here um it's also possible that the other the other pieces of this could be broken as well. And that's not something that's going to be good for our operational excellence metrics. So, let's think about what we would do to set this up using route tables. Instead, it's very very easy.

A

So, first of all, what we're going to do is we're going to establish two route tables, we're going to establish one that you would imagine might be owned by the pet store team, and it looks very much like what we had in the original virtual service.

A

uh We create a second one here for the http bin team and it looks a lot like the the first, uh the original virtual service as well, and we've also added in our little hello world transformation here and then finally, to tie it all together we're going to have a virtual service. But it's going to be a much simpler virtual service than what we had before so rather than encoding all of the routing policies and and transformation bits directly into this virtual service. We're actually delegating this out to the route table.

A

So you'll notice, we're going to for anything that looks like pet store. We're going to delegate this out to a virtual service that has a label called app. That's named pet store, which you can see. We've specified that right here, all right and the same thing for http bin. We're going to delegate off to uh uh to to a similar kind of label here, so this makes things much cleaner. We've now we're now able to uh kind of separate concerns as they need to be separated.

A

So the development team can manage route things like routing and transformation for their app and a central team can manage the uh the the kinds of the kind of things that they care about. Okay, so let's apply this in and see what happens. So we will go in here: let's apply route table number one for the pet store and we can. We can watch these guys kind of build up here in the uh in the enterprise console, there's http bit http bin and finally, we will supply the modified virtual service that uses the route tables.

A

And now you can see we have everything right here and lo and behold. If we, uh if we invoke our service, our pet store service still works as before, and now the http bin is going to work, but you're also going to see a hello world header here in the output okay.

A

So um so it's a very simple example. I understand that, but uh hopefully this gives you sparks your interest a little bit if you haven't used route tables before and will allow you to uh perhaps take a little step towards streamlining your own operations and making that just a little bit easier to achieve.

B

Okay in this part, I will explain how you can build your own security workflow. So, as you know, there is this export filter, which connects to the extract server, where the authentication and authorization policy will be applied. Blue edge comes with a bunch of built-in features like open id connect, opa or of2, and a few others.

B

You can also build your own security policy.

B

There are three ways of doing this: either you start from scratch, and you build your own export server that you will use for these nice policies, and this is what you can do with the open source version of blueh or with the enterprise version. There are two other ways and the first one is by making your own golang plugin that you learn with the export server in this plugin. You can implement your business logic and the other way is by building your own service that will communicate with the envoy filter over the pascrew system.

B

In this latter case, you will be able to use the language of your choice as long as it supports grpc.

B

In both cases, you will be able to take advantage of the autoconfig custom resource by adding an additional configuration block that will be passed to your custom service. In this example, these two values will be passed to my custom service so that I can dynamically control how it behaves.

B

Now. I will show you how you can compose security policies in this spot config resource. There are two steps.

B

First, one is open id connect where the end user will authorize the client to retrieve user information on his behalf.

B

In this second step, there is an api policy which will retrieve the id token from the previous step and verify that the email claim is equal to this email address. If this assertion is true, then the request will be authorized and finally, another example of how you can control your authorization flow in the first step. There is one more time and an id policy, and in the second step you can have your own plugin.

B

That will create your in-house access manager, and this plugin can retrieve one more time the sub claim, from the id token, send it to your server and the server can respond with the group that the end user is part of, and you can allow the request if this group is part of this list, that's it for now. Thank you.

A

A glue edge customer reached out to us on slack recently with a problem. They were moving toward a big production deployment milestone when they started noticing some really heavy resource consumption in their cluster.

A

Specifically, they were seeing very high cpu utilization on the glue pod. How would you go about diagnosing a problem like that before we dive into that question? Let's review some fundamentals.

A

First, the foundation of the glue edge data plane is the envoy proxy and envoy offers a local administration interface that can be used to query and modify various aspects of the proxy if you're not familiar with it. I really encourage you to familiarize yourself with this with this web page here.

A

It can be really useful when you're dealing with a lot of operational issues, there's also an excellent episode of the hoot podcast by solos chief architect, duvall, where he explores this interface in a whole lot more depth and we'll have time to do in this session, but we will provide all the links with the resources at the end of the session so to access the envoy administrative interface. uh Let's set up a port forward from uh from port 19000 of the gateway proxy, so we'll do that here.

A

So, let's take a look at what we get so so here you can see there a whole lot of um services that the this proxy or that this administrative interface to the proxy provides just to talk about a couple of the ones that that we see being used uh might say most often uh one thing frequently when you're trying to debug issues uh you may want to uh be to set the the log levels to uh to say debug in particular modules and so envoy gives you some pretty fine grain controls for controlling that.

A

So you can just set all log levels to to debug or info or whatever you want, but um there are also finer grain mechanisms as well, so you can see you can see there at the you know at the bottom. If you, if you want to have uh just debugging of the wasa module, uh you can get just that, but just to show you, you know very quickly: it's pretty simple how it works.

A

uh There's a there's, an interface here where you can uh can post to that and say: okay, let's change everything to debug and you'll see. Then um all of the interfaces are changed to debug so that that's a pretty handy little operational tool um if you're, not if you're not familiar with that already. There are also ways you can look at, say your envoy configuration dump. So, um as you may know, envoy is best when it's not deployed with static configuration.

A

So one of the downfalls of earlier generations of api gateways is that they frequently required static configuration, and so when you had a problem, you know you change your configuration. You have to bounce the you have to bounce the uh the gateway instance and that's really not a good, not a good way to do things if you can avoid it. So envoy is strongly oriented around this notion of dynamic configuration, which is what the glue pods provide.

A

Is they provide dynamic configuration to the envoy proxy, and so, if you want to see at any point in time, what does my envoy configuration? Look like you? Can you can dive in here and see? For example, what are the listeners that are configured? What does the? What does the actual filter chain look like to envoy at this time? uh What did the clusters, which basically the the upstreams, the endpoints, that you're routing traffic to what kind of clusters are available? Are they static? Are they dynamic?

A

uh What ports are they looking at that sort of thing? So this is some pretty useful information. um When you have a configuration issue and you're trying to understand hey, um you know my my glue configuration, I think says this, but yet envoy perceives it as something else. So so that's pretty um pretty useful as well.

A

There's also a whole lot of uh statistics that are published and you can, of course uh you can of course, export these to your own tools for analyzing, but just to give you a sense of the, I guess, the breadth and the depth of the um of the statistics that are available. You can see, there's, there's literally hundreds and hundreds of of things that uh that envoy publishes even down to the level of uh you know, what's happening with these specific clusters. uh What are hack what's happening with these listeners?

A

What kinds of what kinds of what kinds of requests? What kind of responses am I getting to my request and breaking that down uh to pretty much? Is pretty much a fine grained level as you want? So so that's all! That's all really useful stuff, and so um so the the envoy proxy administration um is a really useful tool to understand from an operational standpoint. But if you recall the original question that we started this session with was around profiling, and specifically our customer has a problem with excessive cpu usage in the glue pod.

A

So while these administrative services that envoy publishes can be really helpful in a lot of operational scenarios, they won't really help us much in analyzing cpu usage in the glue pod. So what we're going to do is we're going to start by grabbing a cpu profile um from the glue pod and we'll analyze that using the the prof tool for go so again, we're going to first establish a port forward from the metrics port on the glue pod, which is 1991.

A

So, let's, uh let's set that up.

A

A

And so there's the base interface and let's, uh let's set up another one here as well.

A

um So the first thing we'll do: let's actually go and kick off the profiling session that we're running just against this test pod here, which is a pretty pretty quiescent at this point. So, let's, let's dig into here- you can see there are a whole lot of different kinds of profiles that are available.

A

The one we're going to focus on right now is this one, which is a cpu profile.

A

You can have this run for as long as you want it to we're, just going to accept the default length for profiling, which is 30 seconds and so we'll, let that run for a little while, meanwhile, we'll go back and take a look at the interface, so you can see that the administrative interface um for for the glue pod is not as extensive as what you get with envoy, but still there's some some useful things here.

A

For example, um if you'd like to be able to change the log levels for the glue pod, um you know you can set debug info, etc to adjust the amount of information you're getting. You can also take a look at the metrics that are that are published out of here, which is also pretty a pretty extensive set of metrics, and at this point our profile has completed, and so let's go and we will kick off the p prof.

A

Tool and let's see what we have here so um so, this is what p prof gives us, and this is again for for a pod where there's not a whole lot going on right now. So this is just showing a graph, let's zoom, in of some of the uh some of the individual components that are running, and you got you know length of time.

A

And you can obviously see some of the dependencies between the various elements of the glue pod. There are other kinds of. um There are other kinds of graphs. You can look at as well, this one's a really popular one, uh the flame graph. That shows you. uh You know kind of a different view of what's happening in your in your blue pod and again. This is available for other pods um in the blue edge product as well.

A

But if we, um if we take a look at just this graph view- and we compare that with um what we saw from this customer situation when they ran their own profile against this problematic blue pod, that was exhibiting really high cpu usage you'll see a contrast that just jumps off the page at you almost immediately. So you can see if you compare the two, you can see an enormous difference right in this part of the graph right here and if you, if you zoom into that, you can see, there's a whole lot of activity.

A

That's taking a lot of time around. uh You know json modules that are that are unmarshaling and and in the processing that's associated with that. So um so the customer um produced this produced this profile and then sent it, sent it to solo engineers for analysis and just immediately. It pointed uh it pointed us toward a solution around a new interface that was being used on the endpoint discovery interface.

A

So um if you haven't already um there's some really powerful tools here that are available both through the envoy um administrative interfaces, as well as with glue, if you're, having some kind of a problem say with resource consumption under load, whether it's cpu or memory or something else, just know that these tools can can help you, particularly if you also engage with. If you engage with us on the slack channel and get our engineers involved us, we can help you isolate and resolve some of these problems pretty quickly. Okay,.

B

This is the last chapter of our torque, and this is about securing the control plane for production.

B

As you know, glue can discover services running on your cluster and across multiple namespaces by default. It will watch all namespaces and write those upstream objects to the namespace. Where glue was installed, you can control the name species that blue will watch with a kind of a white list and there are different kind of upstreams that you can use in your virtual service. The most complete one is upstream because you will be able to apply health check or connection timeout or failover on this object.

B

But this is not mandatory. You can just get with the keyword destination type, which is a kind of a reference to a running kubernetes service, and in in this case you don't need any upstream objects.

B

I will show how it works in a minute, and just before I will speak about the function. Discovery service, the fds. This is a way for glue to discover rest, endpoints or jrpc endpoints or even lws numbers and points by scraping, existing abstract upstreams, and you can choose if you want glue to scrap or upstreams or only the one that you label explicitly and, of course, you can disable upstream discovery and function. Discovery with this global option on your hand, chart so now, a quick demo.

B

I have just installed glue with the discovery, sorry where the discovery disabled. So if we have a look at the ports, as you can see, the discovery pod is not running, so it won't create upstreams for every community service running on my cluster.

B

It just comes with these two upstream objects that are part of the m chart, and so now, if I deploy the http bin application- and I have a look again at this upstreams, so it was not automatically created well, this is just a simple virtual service. Using this cube destination type and as you can see, I will be able to query the slash, get endpoint and I still don't have any upstream referencing this upstream service.

B

Okay, that's it for the cube destination type. Now I would like to enable again this discovery service, so I will modify my own deployment with this uh upgrade feature. So this will add the discovery, deployment and let's apply it for real, oh yeah.

B

So if we are lucky, the pod will appear okay. Now, where is this discovery pod running and it will create the upstream objects for press? That's great now I will show how you can control the function. Discovery service.

B

A

B

Sorry, the pet store application.

B

Looks like looks like it's deployed. If I get the upstream created by default for my pitched application, it is accepted. It is all good. So now, as you can see, I don't have any explicit label asking for the function. Discovery service to scrap the rest, endpoints that are available by and exposed by this pet store application, so I will label this service and okay.

B

I have this new label that ask that tells blue to to discover endpoints for this service and if I run again, this blue ctrl get upstream now I can see my rest endpoints, that's great! So next stop is about invalid root replacement.

B

It can happen that you have something wrong in your virtual service configuration. So my current configuration is this one where I will match any domain and any path and wrote the request to the http backend. So I can hit the slash get path. Okay, I've got the 200 with the payload, it's all good.

B

Now, if I modify my virtual service and if I add a new route on the slash get and if I ask blue to route this queries to something not existing, I am pointing to a not existing upstream. Let's see what happens, uh let's try and still check and see what glue thinks.

B

As you can expect, there will be some errors.

B

Yeah blue just say that he cannot find this upstream dash bad, that's normal, and um if I create create this new endpoint, it works because this new configuration was not applied, and if I want this to be handled a nice way, I can configure you to enable the invalid replacement and to show the user a nice message with a 404 hdb code called sorry and let's square again that should get endpoints. And now I have my 404 error with a nice error message all right.

B

I hope you enjoyed thank you for watching and have a nice day at the silicon.

B