Red Hat OpenShift The Cloud Multiplier | Red Hat Livestreaming, 27 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Red Hat Advanced Cluster Management Presents… Cluster Health Monitoring with Thanos

Description

Multi-cluster management is hard. Technology, teams and culture clash in a race to deliver clusters and applications in a secure and compliant way. Red Hat Advanced Cluster Management for Kubernetes (RHACM) provides the capabilities to address common challenges that administrators and site reliability engineers face as they work across a range of public and private cloud environments. Clusters and applications are all visible and managed from a single console—with security policy built in.

A

How's it control I'll delete.

B

Good morning, good afternoon, good evening, hello, everyone and welcome to another episode of red hat advanced cluster management presents it's not a mouthful. It's a product. um Today, I'm joined by today, I'm joined by scott behrens and the wonderful team from our rackham program. Developments and scott. Do you want to introduce everybody just kind.

A

Of around the horn.

B

A

So, chris you know I was on your show. A few weeks ago we talked about the end to end today. I thought I'd bring in some some real muscle. You know we had to bring our vp and our senior architect last time and I think we're going to go a step better this time. So I've got randy george, I don't know if that's left or right he's a pink floyd shirt he's an architect that leads the observability pillar within rackham.

A

I've got joy, deep, banerjee based out there in sunny southern california, uh he's the technical lead that focuses on observability search, analytics, all the goodness in there and then chris doane, who brings our sre perspective to the table today. Chris is based in austin with me and raymond. uh Chris really helps us to dig into the bowel of the system and understand what's breaking and why and how do we fix it?

A

So we're actually going to use acm to fix acm and look at some problems about how we tie it all together, which is pretty fun.

B

That's amazing.

A

As you can see, chris, I brought a little bit of a gray hair element because I really wanted to age up the demographic of the twitch viewing audience and you.

B

Oh okay, I mean, I guess I appreciate that. Do you like diversity, I mean.

A

Bonafide gray, hair on the table today into which you've seen it all.

B

Yeah so I mean I have gray hair. uh If, if I had a beard, it would be gray, so there.

A

B

A

C

B

Okay, like my father.

C

B

Completely white at 40 years old- I did not so if.

A

Technology doesn't give you great hair, then your family will. How about that.

B

D

What happens when you work with scott come on.

B

Man yeah: I only only have to talk to scott every once in a while. That's why I don't have as many great.

A

B

Right yeah look at my story. I'm sticking to it.

A

B

A

Pickers we have a release coming up in about a week.

B

Yeah y'all are showing off some stuff. Last week at anfield, fest weren't, you.

A

B

Last, when was it geez? I don't remember a few.

A

Weeks ago and we had the integration of dancing bowl and the application management, so today we get to look at the observability piece kind of dissect. What we do there we're bringing a new architecture and play, and I'm going to stop talking there, because I don't want to ruin it for ranges, so take it away randy. What are we doing.

D

So um thanks scott, yes I'll, give you a little bit of kind of the problems, we're solving first and I'll. Take you into the high level architecture, however, approached it right. So, as you know, and you've talked about this, what does rackham do right and you can break it down into simplicity, cluster life, cycle management, application, life cycle management and governance, risk and compliance right?

D

Well, if you think about it, it's very hard to do cluster life cycle management, application, life cycle management- if you don't have insight into the health of your clusters and your applications right, you can't. Obviously you want to meet your corporate slos, etc. But why would I want to deploy an app to an unhealthy cluster? I'm definitely not going to meet my slos right, for example,.

E

D

um So the key thing we want to do is um provide that insight and we're doing that as scott mentioned by adding observability and again. This is a step along a roadmap. Observability is quite complex and so don't take it as what we show you today is the end game by any stretch, imagination, it's beginning right, and so we have some key focus. Initially we're going to focus on uh clusters right and then we'll get to the app layer and scott will take us through that road map.

D

So I want to steal his thunder uh touche, and so some of the things we've I talked about. First was the health right and we'll get into some of that. Another thing, and we've heard this from customers quite a bit as the various development teams are working on their projects and then they're deploying them to the clusters. What do they do? They request?

D

You know through coup requests, they request a certain amount of compute resources, memory, cpu, etc, and what they're finding is their costs are not being managed because these teams are requesting more than they're, actually using right.

D

So they'll come in and find out that you know I'm reserving x amount of cpu or x amount of memory and I'm only using a tenth of it, for example, and that's a cost. It's real cost to get the commute resources, it's cost of ocp that they're running on or whatever you know, runtime they're running on et cetera, so they want to get their costs in control as well right. So it's not just a make sure everything's healthy and stable, that's very important, but also make sure everything's optimized right.

D

So those are kind of thing and then kind of I call the third facet is you know we detect issues like governance, risk, etc. um We're gonna detect, you know, monitor the health and we can, you know, send alerts and notifications. Well, the sre gets notified. How does he or she go about problem determination?

D

Well, you know you think about the simple pattern: what's an sre, do they they form in their mind a hypothesis of what caused the problem right, and so they have to either go, prove or disprove that right and if they can prove it, they obviously provide a fix or run a script to compensate for that or they disprove they come up with a new one, so they need a way to be able to interact and introspect the data right. So it's not just dashboards right.

D

Dashboards are great for status, they're, horrible for problem solving right, and so that's kind of the third facet and joy dave and chris will demonstrate some of this. We'll show you our capabilities, for you, know, uh introspecting the data and and discovering problems, and so the focus we'll have and they'll demonstrate this later is in this drop. We're collecting metrics um from the clusters like I said, we also have the ability to collect logs. Now the approach is slightly different um and what I do is I'll switch over to a um a picture.

D

Picture's worth a thousand words. That way, you don't have to um look at me as well, because.

D

There's somebody else's sharing. Oh scott, are you want to go through that or not.

A

No, you got it. I was sharing just our our value pick.

D

Okay yeah, so let me share this diagram. Where we're going about it. Hopefully you can see. Okay, there you go, it is um so the other thing we had to think about not just the capabilities, but some of the challenges. The challenges get to be with scale right. The number of managed clusters that we're going after and also the size of those clusters. These aren't servers, they're clusters right and so how many nodes, how many pods, how many names etc?

D

Then you also have to get into the concept a long-term store, because, especially when you're looking at things like slos or even compliance, and if I'm having a, I noticed a uh where I became out of compliance. Well, you want to look back over time. You know how many times that happened in the last month last year.

D

Are you seeing a pattern right? We can look for pattern, detection, etc.

D

Metrics should look at trend analysis, so we needed to have historical data, long-term historical data, as well as the ability to scale all these time series that we are required to collect right so a couple of challenges.

D

So this is a real high level diagram of our approach of how we've gone after it, so starting at the bottom up. If you look, um those are the clusters that are being managed by acm, acmb and kind of that dotted diagram above and that's a small subset of the acm hub.

D

We've added the observability add-on right, so acm deploys a cluster of set of add-ons and when you install 2.1 or upgrade to 2.1, this add-on will be dynamically automatically deployed and configured to to these systems.

D

These managed clusters, if you would anytime a change, happens with respect to the configuration there they're constantly connecting back up to the hub server and they'll detect a change and they'll apply the change, so the configuration uh is is sure to be uniform across all those right.

A

So that's so positive, real quick, because that's a key piece we talked about last week. Chris was our desired state of two weeks ago that desired state model we play in this kubernetes world. We set the way we want it to be for an application or for that configuration of a cluster, and then we work up to that and so what we're doing in 2-1 this isn't even something you're paying for it's. It's in the box, you're going to hit this observability add-on. That brings this next layer of cluster health monitoring into your purview automatically.

A

So I I mean that desired. State model is something we are all about here and just want to make sure that we're on the same page with how you get how you deliver this, how you deliver that quality.

D

Yeah and that's a great point, scott thanks. um In fact, I want to add one thing you mentioned about: this is there in the box, but one thing we did do is so we installed 2.1 everything for observability is installed. It's just not enabled out of box all right and you have to do. One apply, create a cr to enable it and the reason being if there are customers- and this could even be in tests or whatever- maybe you don't want to be collecting all this data collecting the data.

D

Storing the data can be expensive right and if you don't want the capability, we're not going to force you to have the you know, the persistent storage required we're not going to force you to collect all this data that you're not interested in using. So you don't have to enable it right.

D

So one step to enable is really what you need and then automatically everything like scott said, you define the desired state with that enablement and everything will comply with that desired state, and if you decide to change any configuration again, everything will get deployed and synchronized with that desired state. Okay, so we have an add-on and initially for openshift clusters, there's already a built in prometheus that scrapes and collects all types of data. We have we collect from that prometheus.

D

We can collect from other sources, but the initial drop will scrape from that prometheus and we don't. We don't gather everything. I mean there's many many many time series and they can be quite expensive down there. We're collecting a subset of that right and what we think are key and optimal right. That's kind of general purpose across all customer needs. These are the metrics. That would be important. If you want to extend that list, you can right. So we have the ability for being a a config map where you can extend the list.

D

So I'm going to make this up a simple number like we collect 10 and you wanted 12, you can define, add the other two to the config and it'll start collecting right. um So going after that, so everything is collected and brought forward and technology-wise.

D

You know we knew in kubernetes management prometheus and when I mentioned prometheus, I kind of mean the you know: prometheus ecosystem, the community. If you would it's not the fact of standard right, so we had to comply, so we want to make sure we're compliant with things like prom ql, which the query language right. A lot of people have skills in that we want to make sure that we support rafana a lot of people, use grafana and have skills, and that um you know we want to take advantage of the alerting capabilities right there.

D

We want to take advantage of the open metric format that prometheus uses and a lot of people in the whole ecosystem of enabling metrics through an endpoint. So we're doing all that, however, we're adopting the thanos technology for our storing of metrics over prometheus, which is built off of the prometheus engine, is compatible. It just provides using objects towards the back end. It provides the ability to do that. Long-Term uh storage right that we that we require now.

A

Go ahead, pause here for a second take a breath and think about how yeah I think about how we we have a soluti. We had a solution which was a federated prometheus model right that was about a year ago, and we we sat down and talked with christian heidenrich and you know, comes from the korres background. Has all the understanding of the the the monitoring space observatorium that's being used uh internally with red hat gathering telemetry from thousands of connected clusters.

A

All that runtime knowledge over the past couple of years of red hat cloud is now in this project is now in acm. So we take that goodness, as randy was just talking to thanos and the data store scalability the long term retrieval all of that baked, in goodness, with production runtime code around. It is now part of the acm story. So, instead of a federated prometheus model that falls over doesn't handle, you know scale, we have it it's here, it's thanos and it's prometheus architecture.

A

So I just I want to make that point clear, because we had to re-architect what we've brought here alongside ocp, to really ensure that it scaled out to that multi-cluster challenge to make sure that it would really open up the opportunity for edge and fire edge cases. Yep and.

B

Thanos is a huge problem, solver and cncf project, and you know, there's there's no reason for us not to use it right like it's. You know the more the merrier, in my opinion,.

A

It was a no-brainer at that point, but now.

B

A

Gonna be a lot of work so, six months later here we are.

B

Good for y'all, it's.

D

Got your spot on, I mean we've been we've red hat have been running this in a production use and what we've learned and what we've updated. Those are pr's in the thousands communities, so we've helped to you know, evolve and mature that product right.

A

Yeah bring that enterprise knowledge that we have our backgrounds and observability. In fact, I didn't even give you guys a chance to introduce yourselves and talk about who you are so maybe.

B

That was my fault. That was my fault I'll, take blame for that. Oh come on! uh Well, I've been streaming all day. It feels like I've introduced a thousand people. So I'm sorry so joy deep. You had something to say, but please introduce yourself as well.

F

Okay, my hair has been gray or white. However, you call it since I was 25, I'm only 30 now, fair.

B

F

B

F

Yeah I I've been playing with uh this. This world of uh containers before kubernetes came into being, and one of my first introductions into containers was using docker, ansible and.

E

Openstack using heat, yam.

F

Yeah so that that was quite a while back. But that's what.

B

F

Thing about what we are doing here, you know is randy mentioned that everything out here is well known to the community thanos prometheus from ql every alert manager. Everything is very well known, so once we once you guys start using it, you can get to the internals of this very quickly and you know ingest all the information so so that that is key and the other stuff is. This object store where we are storing this long term time.

F

Series data is what you own is what you set up and you can set it up in s3, gcp ceph, I'm wherever yep.

B

Randy, please introduce yourself since you know scott couldn't.

D

Yeah yeah scott tried so yeah, obviously randy george and so uh long long time in this industry and probably been doing management for the last probably 15 to 20 years right and focused on observability.

D

Well, I'm going to go back when we first started doing. uh I was with ibm before we moved to red hat and doing autonomics, so it's been probably a good ten ish years been doing on the observability side, focused on network and then and clusters and kind of what joe deep said. His experience on the cube side of managing the cube. I mean we ran a cube in sas production platform when it was like 0.7 release. I think way back right, and so it wasn't.

D

There was no eks back then, or I guess, or I guess or anything, and so we ran just vanilla, cube and managed ourselves and kind of learned from that on the way up. So it was kind of a good way to learn because there weren't a lot of capabilities back then or a lot of brute force. They had to do it so a lot of experience in managing coupe. From that perspective, a lot of experience and observability and management in general.

B

Awesome and no call would be complete at red hat without multiple chris's, so please chris stone introduce yourself. Yes, that's.

C

Right I mean I've been on many calls with uh four or five chris's, so I don't know.

B

I was on a call the other day and I was all the only chris and someone said hey chris. I think this is you and like I was like. Oh me, oh, oh I'm, the only chris here. This is amazing, so please introduce yourself chris.

C

Yes, sir um hi, my name is chris doan and I'm representing the s3 perspective uh on this call. I've I've been in the industry for a long time um and in my current role I actually wear multiple hats. So right after this column, I'm probably going to jump off and do some devops uh ci cd work uh with our jenkins cluster but yeah. I I use acm our product internally and I try to evangelize its usage throughout our community um but yeah.

C

I I use on a daily basis trying to deploy applications and and try to break it in different ways. So I'm excited to showcase that later on, awesome.

D

Yeah and that's actually a really key point is folks like chris, that, like you said, he uses the product right and if we can't buy value out of it, how do we expect our customers right? So we um we, actually, you know, eat our own dog food or some people like say, drink our own champagne. However, you want to look at it um so and we get that better. That's a huge benefit right of using that and getting real-time feedback early feedback and help iterate and mature the product right.

D

So um the only other thing I wanted to call out here is any so, in addition to the object store where metrics and alerts are stored and looking for long-term trends and pattern, recognition and again, joy, deep and chris will demonstrate some of this. We also collect information about all the resources and their relationships on these managed clusters and as well as some of the key attributes of metrics near term, the now metrics, not that the trend that's under bonus and we store that in a redis search database.

D

That has a graph db plugged in right, and this becomes very powerful for um querying and introspecting the data, especially when you want to understand the relationships. These clusters, as you know, um a lot of relationships that are very dependent on each other, whether it's be the flow of the app or where things are deployed and relationships with other things, there etc right to be causing issues right. So we have that kind of.

D

Other database there to provide this that sort of capabilities- okay, um another key pillar of absorbability- is not just metrics and the alerting and inventing is but blogs are critical right and especially, you know to be an sre, you gotta, hey. How am I gonna solve problems, not understand? What's going on these logs, uh as you can tell from the diagram, a lot of people are into loud collection, log storage, there's a lot of value. You can do that applying anything, but it's also very, very, very expensive network bandwidth, storage, etc.

D

um Our approach, currently um we may look into down the road centrally storing logs, especially for something like edge and that may need to. But right now we have more of an on-demand approach right. So, if you're looking into solving an issue on cluster a if you would, we can go out and grab the logs in your real time, pull them over to the hub server and utilize them.

D

So it's more of an on-demand approach, collecting logs and being able to search those logs and use them versus, let's collect them, store them and hopefully someday search them. Maybe.

B

F

And randy, I I think one of the other fun things is that not everybody has access to login to the fleet of clusters right or neither is it possible, so we can actually centrally do it from one place, which is so yeah awesome.

D

Yeah, I mean that's a great point. I mean we're not launching over to that cluster and looking at log we're dynamically pooling them on demand right. So it's still a log collection approach. It's an on-demand versus a continuously collection, collecting them right right.

F

This is something we use all the time to debug. You know as areas at our own dog food.

A

We talked about that a few weeks ago. Chris, you remember with michael and dave and jeff how we we ran into this problem in our own development. You know in creating our own kubernetes platform, we said. Well now we got a bunch of these clusters and how do we introspective and how do we dynamically gather information about them?

A

That's where the redis graph, the search collector, started to develop that kind of became the core central theme of that multi-cluster management uh really trying to solve that problem internally, and that became the start of the the next two years project for us.

A

So who's board of the of slides, raise your hand.

E

All right, let's wake up some developers.

A

Out there get your coffee tuned up who's, taking the screen share.

F

So you I mean, since we were talking about uh search and viewing logs, why don't I jump over and you know yeah.

A

F

I was certain as well right perfect, so let me do the screen share, share, desktop.

B

Okay, all right screen sharing here we go all right. Yeah.

F

Okay, all right, so this is you.

E

F

Let's so uh the screen as the we didn't show our launch in uh you know, when you go into acm, I didn't show how that page looks like well you'll land up in this home page right, but what I think I tend to use most is our search page right. Almost everything that you can see. You can get it centrally from here, and this is very ad hoc and hawkish in nature right.

F

So, uh let's look at. Let me give you a real example, so I'm looking at pods and what this is showing is the relationship part has to other things that we are collecting right. So we can. We actually have pods over nine clusters.

F

The pods are related to 1100 secrets, they're running on 44 nodes, etc, etc. Right now we had this real case where we had pods running in multiple clusters, which had the name something like this.

B

If I can, I've.

F

B

F

Right and see here, the fun starts these pods we are showing are across really multiple clusters. They are across the fleet and going back to the.

E

F

You know you could you could click on a pod, so this is in a cluster which is named oregon2, you can click on the board, and not only can you see the yaml, you can look at the logs right. You can look at the logs to see what they are doing ah response time, 200, okay, right, this is cool. Now, the the real issue that we had is and chris is aware about it. He and I we were on the same boat during development.

F

We had you know uh what do you call it? We had a blip, let's put it that way, so we really needed to reach out all the it never happens normally, but you know so we had to restart the pods right. What do you do go here and just delete the part? The pods are restarted, that's huge! That was, that was huge. You know, and in real life situations, as you can imagine, developers do not always have access to log in to all of the managed clusters.

F

We are doing that all from one one single.

B

Place right, one pane of glass, one.

F

Pane of glass, right and and the the introspection, I guess, is one of the other very important things you know given.

D

Before you go there, just in that use case, the way we have this set up is we have our alert manager configured to talk to a slot channel right. So if we detect a problem, we notify slack right and then so he didn't just come in to hear cold right and say. Let me look for a pod with this name right, um so we'll actually detect a problem by collecting the data, analyzing the data and doing notifications, and then he came into here and said.

D

hmm Let me go see what's going on with this pod, because it's problematic and that's where he'll find it through- that search right.

F

Right right and and talking about detecting problems, yes, we it's! You know we have configured uh alerting rules right same way as we do it in good old prometheus. So, let's, let's take a quick look at that uh uh example of an alerting rule and this you guys can obviously go and change so.

F

This is a very familiar loading rule. In fact, in this example, we have a recording rule created as well as we have some other rules created and we have configured the alert manager and this this this is pretty well known to.

F

But if, if you're working with prometheus, it's pretty well known, you go and configure the uh alert manager to slack and send to pager. In this example, I think in this cluster I've commented out the pager, but anyway you're sending it to slack and that's how it launches to where, and he said, you're getting the details in slack and it's giving you the details, relevant details, and you can jump back to acm by clicking here right. So, as randy said, you're not coming in from cold yeah, but you know where I was going with.

F

This uh search thing is given a problem right. Let's assume. Scott and randy are two brilliant engineers who has been given a problem to solve right and they have some some history some background, so they might approach the problem differently. You know a classic case is which I think you folks can relate to it very well.

F

Is some people when they are told there's a problem would like to log in to a container and look at the logs to see what is going on right, whereas there could be other folks who might want to come in from inside out. You know they might want to see hey what all has been created in the last hour and let me filter from down there right what changed right, what yeah and I want to be a little careful randy. As you know, we don't actually fully it's in our roadmap. We are not exactly capturing.

F

What has changed. Change has no answers. We are capturing right now. What has been created? There could be things that has changed which are not yet created. We are trying to get there but yeah exactly so. There can be different ways of looking at this problem right and and we you know, we capture this- all, no matter which way you look at and uh talking about eating our own dog food.

F

If you want to explore how how the multi-cluster observability works, randy mentioned in the beginning, that we have a cr right, so our little cr is, if you go to kind, it is multi-cluster, observability click on the cr boom. This shows you the relationship around the cr. So imagine guys. You know I mean this.

E

Is very real right.

F

You are, I mean something is pushed on you, you have to take ownership, you have to discover what's in there. This is telling you that this cr, I mean typically two things that I would personally look at is, first of all, what's the route who is accessing what I'm you know what uh what this is serving so there's a related route: okay and it's the observatory, api. In fact, this is the route through which everybody is looking at the metric data right and what are the services?

F

What are the related services? This is this. Is you know this is not only I mean this is this is a day one tool. This is a day, zero tool. You know, however, you look at it and the simple fact that you're sitting in one place and you're having the visibility across all the clusters. That's really awesome.

A

You see that you know in the industry there's a an app sre right who gets handed an application.

B

A

Her job to go, keep it up and running without any concept of. What's in there what's baked in, what's out of what are these routes and services and to georgie's point, you have the capability to learn, to understand, to use this tool to dissect it and figure out where things are at how they've been deployed and architected, so that fsre, who typically just gets handed something and has no clue where to go next, now has a starting point.

A

It can pivot around one one piece of information to find more and then understand where to go next in terms of troubleshooting tree.

F

Yeah, absolutely, and and in in real life, just happened so many times, because you know I've had the experience of working on the other side on the customer shoes and moving applications, monolithic applications, migrating them to cloud, and you know if, if I had this kind of stuff, it would be awfully useful to come up to speed and have control and see what is going on right.

F

So, uh chris, I know you you have been. You have been using this right and you've been using this in lots of different ways. So would you like to share some of your perspective? Perhaps.

C

Yeah, let me um let me take the screen and I can. I can have a little uh segment, a.

B

Little fun with it. Here's my screen here: it's not a segment. We're gonna have a little fun.

C

How acm is able to centralize the management of your fleet of clusters, and I really found that in this latest release that we were working on where internally, we have set up this cluster and we try to manage a number of clusters right up to 50 clusters and we deploy different levels of acm into this cluster and, like jody, was pointing out. Sometimes there are bugs in the code, and I was able to use acm itself on itself to help debug some of these issues right.

C

So, um first of all, as an sre, uh I also focus on uh command line interface. So another cool thing within our platform is that we provide this virtual web terminal.

C

I opened it here, but you can launch it from this navigation window right here and within this virtual web terminal, there's also multiple tabs, so I can actually work on multiple contexts at the same time as an sre engineer.

C

So the first thing that I want to do is: okay, I'm I'm an sre I'm trying to maintain the availability of the services on.

E

The meeting has ended because it had for the last 30 minutes right.

C

So I I normally, um I normally go and look at what are the what are all the the managed clusters that are available in my hub. So I run oc, get managed clusters and these are the available uh clusters that have registered into my my hub and I can start drilling down onto them.

C

So specifically in let's take a look at singapore, so I can do oc, get or search a search, kind, pod, cluster and then cluster search, kind, pod, uh cluster name, uh singapore, uh we'll scroll down to singapore here and then um in in a particular name space. So we were debugging a problem where some of the agents pods were not at work were not starting up or deploying appropriately.

C

So with this, I can do the same thing that joy deep was doing a while ago by searching on a remote cluster for the pawns in the agent namespace.

C

And the cool thing is that it, uh even though I'm running command lines uh commands from the command line interface. This is like a pseudo command line with um with ui widget components. So once I did this list of all the pawns in this particular name space on this particular cluster, I can still leverage some of the widget context and do like filtering.

C

I can also look at uh the particular pod by clicking on them and inspecting the logs right, the same way that we're doing it from the ui, but this is like a combination of ui and and command line and.

D

Chris, when you're clicking on that show you're actually generating a new command line to get you know, it gets entered right.

C

Exactly yeah, you can see, as I click on this you'll see the commands generate and then the results being displayed here so.

D

It stays in a command line environment. It just helps with the ability to enter a very specific.

E

D

Command right with all the parameters necessary just by.

C

Clicking exactly it's kind of like auto complete, but at a on a new level, it's generation.

D

Terrorized right, auto, complete on steroids.

C

Yeah yeah, it generates the command for you, you don't have to uh yeah type it and then press tab to complete yeah. That's that's a lot of efficiencies there, but the.

A

Cool thing is that that's a big step again, sorry chris, but you know we're talking about an app sre, gets handed a bunch of stuff and has to keep it running. Here's another tool in your pocket that you come back to every day. That says well, I've got this. You know visual interface, this uh visual web terminal. That allows me to interact with an environment that I don't know a ton of things about, but I'm learning on the fly right.

A

They spent six months building it and I get six minutes to understand it, and now it's my job to keep it running. So I can use this to introspect and make changes to pull logs get summaries and events. This is all based on an open source project called kui kui, and it's electron based implementation here. So it feels very similar to slack and things like that. uh I'll drop a tag in the um in the chat.

A

You know, I bring it to the point and say like this open source code and these these opportunities we have for flex in here. This multi-cluster problem is really cool stuff. I mean it's a challenge that we looks to solve from the outside in and bring in some neat technology to solve these issues, but to give chris, you know the tools to do his job so boom like here. He is able to click on things and inspect and move through it in a way that is unbelievably impossible for just a cli.

C

That's correct and, and sometimes like, like scott, was highlighting all right. You have a problem that you don't know exactly where it is yet, but our tool set a combination between ui and and cli allows me to transverse the different paths quickly and then narrow down uh the problem area right.

C

So I think that's that's really one of the cool things that may not be highlighted, but um but I really I really have that insight after using the platform uh for a couple of releases and in this last one where I think everything came together, the performance came together, uh the feature came together and I was able to um experience a problem locate the the source of the issue fix it, and then we were able to continue on with our work right.

C

um The other thing that I wanted to share was that so here um uh I'm using search to pull back the resources and and um navigating through those resources, but as I'm a command line person, I do want to run oc commands.

C

But if you have to have a session into the remote target you can so here I show a terminal where I actually log into one of the remote clusters and then we still have. We can maintain a session or a context to this remote cluster singapore and I can run my cli commands like oc, get cluster or no oc cluster info in the context of this remote cluster, and it is the context of singapore.

C

Oc get pods, um get a pause in the current namespace uh and I can do the same thing and and the cool thing is that the widgets respond, the the the virtual web terminal responds. We still have a widget, we can click on these items uh and then um inspect like we would do uh through search, and uh I can access a lot of the data without hashing having to take a lot of commands and.

D

To me, what you just did right, there is so powerful is went right to the logs in context of a specific pod right yeah, you don't have to switch over to some other. You know log collection mechanism, search for the logs you're, just doing a writing context and near real time. You have the launcher right.

F

Right right and this this is so powerful that you know I mean I keep on repeating myself, but it's important that you know we use these tools to debug our code.

F

We do that because, at the end of the day, rackham is just another workload running on openshift yeah, which uses you know which uses some uh sophisticated features of uh kubernetes, but that's what it is right and- and I guess the other important thing uh chris- that we were talking about yesterday- is the events thing you know, for example, if your, if your pod is stuck in a state where logs are not yet generated right, I think in your cluster you'll, you do have some containers chris, which are in a container creating mode.

F

So no logs are generated. That's.

C

F

These are events right away and see. What's going on.

C

That's right so yeah in the event that that um we're not able to pull back the relevant data. Then, of course uh we can. You can log into the cluster directly and then run the the commands to debug the issue, and you can do that from this terminal um and we can maintain the the the context of that kubernetes context for you, uh while you're in this session so yeah I've. I've used that as well before as well.

E

F

And then you can of course, log into the terminal itself to go to the pod.

C

Right, sorry, that's right! You can uh right here you can rsh into the into the remote pod and do any debugging that you need to do um um right.

C

The other um another thing that was cool that I I recently experienced in kui was um oc get nodes, so sometimes you might when you're deploying uh different uh products or platforms or applications on an ocp cluster. You might have to actually log into one of the nodes here. So I was able to show the other day that you can actually run oc. Debug, node.

C

Node and then the node name.

C

And this is probably like advanced technique, but um let's see.

C

Slash, I don't know, maybe this is not gonna work. uh Let's see there, you go, which is kind of cool.

F

I don't know this well.

C

You can actually log into the uh to the node and start to debug at an even lower level than the application layer. So I thought that was kind of cool that our visual web terminal supported that as well.

D

Twitch more often you'll learn things.

F

Okay, you can't you know you can't uh go and do a bunch of stuffs in other clusters, if you're not authorized to you're.

C

Right, that's exactly right!.

F

Could you just uh uh show the folks where you click to get onto this wonderful, uh visual web terminal, queer.

C

Sure um so, if I go to the top menu bar at the top, uh there are two options: you can actually open the visual web terminal in the current session or open it as a separate tab.

B

A

That's in there out of the box extra bells and whistles just ready to go ready for your use. Great.

C

Okay, so yeah, I I when I was doing srv for our internal cluster. I use uh visual web terminal a lot because I was mainly focused on on cli uh uh doing things through the cli, but like joy, deep demos a while ago, the search capability is also available through the through this component as well. Right.

B

D

Is that the power of observability we're adding to we're collecting all this data that we talked about? We're storing it in an optimal way to allow you to search, understand relationships and giving you the tools necessary to quickly introspect the data right through the search capabilities, the kui or the you know, visual web terminal to make your cli interface more powerful and give access to logs events etc in real time and quickly get the roots of problems like chris was saying that we do in real time and in development right right.

F

And and randy where I was going is on the other side, you, as a user you can or as an sre, you can configure the alerts you want and you can route them to pager or wherever you want to to get woken up. You know hopefully.

B

Was that a pager.

F

That was kind of a.

B

F

It's no fun being woken up in the middle of the night. I tell you it's it's. It is fantastic. Once you're woken up, you know on two consecutive nights at 2m in the morning, then you know all about the best principles and the best practices that you know what you need to do to make your life easier right and I think we are hitting a few of the sweet spots.

F

We are not. I I wouldn't say we have won everything. Then scott wouldn't be able to show his road map. We have not yet done.

E

F

But we are getting close.

D

Yeah but but I think you're you're spot on jd, like I said earlier, this is you know the initial entry to this space and we went after some of the major items that are required right I mean you have to right, and so we still ways ago, but um you could be very and chris demo this as well right, very productive with this initial site capabilities.

F

Right and randy, I believe we have not even thrown our dashboards yet right. No, but I.

D

Think, like chris, he was.

C

Showing well yeah- I I just had one more point to show is that a while ago I was in the visual web terminal was focused on some of the components for for our platform itself, but the same thing applies for applications that you would deploy through our app lifecycle flow.

C

Here's an example of a simple hello world application, and you can click on search uh this link and it will search specifically fill in the search parameter for those particular uh fields. For that application and the same thing applied.

C

We were deploying a number of applications in our internal uh environment and, um having this feature where I could select an application that were that was under focus and uh uh quickly have a link to search uh for the related components for that application allowed me as an sra to to to focus uh on the the key parts of that application in terms of whether it was going to have any issues or not right. So I think that convenience is is quite quite powerful.

D

And as oops don't jump off of that one yet, as jaydee called out earlier right below that, there's all the relationships that you can navigate with respect to that application right.

C

Yep and one of the key uh indicators for the for the reliability or availability of your application has always been the first thing that people ask for is, uh is the pod deployed and is it running right? So, with a couple of clicks, I can quickly see whether my pod is is running across the expected set of clusters, uh or not all right. So I thought I thought having that reliability in our platform to be able to show this data uh was one of the key things uh this release.

C

Okay, I'll turn it back over to you joey, to cover.

F

No, I mean well and and uh chris while you're here, you can use the same search to, for example, to see which policies are being violated back to the pillar. Point that randy and scott were making earlier right three pillars. The third pillar is the security you can see which policies are getting violated across the fleet of clusters. Here you know.

C

Oh yeah yeah right here right.

F

I mean also through the search. Yes.

C

C

F

Yeah kind of policy and then that's it. Oh that's another pager. I'm sorry guys.

B

Oh you're, fine, is it actually something breaking that you need to get off the call for okay, I'm just making sure whatever's broken takes priority over this.

F

E

B

F

Are not compliant so you can again further filter by compliant, not compliant compliant equal to not compliant right. So you know that's the way you can manage.

C

Or I can sort it and focus on the compliant one and not complete one, beautiful, okay, I'll stop, sharing and I'll give it back to you. Jordy.

F

Okay randy: do you want to show the dashboards or should I go ahead and share doesn't matter again.

B

Screen sharing toward a force here, I love it- bring on the screen shares.

A

This is where we started the call chris, you know with chris short, that is, we.

E

A

About the net new, the thanos architecture, bringing this optimization story to the table, yeah.

F

Yeah and- and so you know back when randy was making the point- that the the two things that we are stressing on or inside the metrics collection world is the health of the cluster and the the optimization features and the capacity so the the health of the cluster. The health of the back plane, as we all know, is, is reliant on the api server right, api service, the front and the center.

F

So we focus on the api server and the couple of things that we are highlighting is we are measuring the 99 percentile latency of the api server and we bring it up here and obviously, if you have a problem, you know you can drill down to a cluster and api server.

F

This this page is organized in a fashion of the golden signals right, sre golden signals, so we are showing the latency the 99 percentile latency, and we have a threshold. We have put it as one second uh and and then we are showing the request rate and the error messages right, so 200 error rate, etc, etc.

F

The saturation we are measuring by looking at the q depth. I always get just scroll down by the cursor here.

D

The outside cursor area.

F

Now the other try a few times there we go there. You go.

D

He's a crazy guy.

B

F

Scroll bar, you look at the uh q depths and the cure rate for saturation. So.

B

F

This is as far as the cluster health is concerned right. I.

B

Mean this is the sre dashboard right here. Boom.

F

Right and uh yeah, and then uh you have the optimization back to what randy was talking about in real life. We do know this happens constantly, and this is.

D

With that we've had cluster customers, talk to us about this and wanting this right. So it's not just something like you said you did, we knew but been validated. You know there are people real customers coming and asking for these quick insights and in the way this is organized is when you're doing many clusters.

D

These dashboards give you a quick little kind of view of the status but also which ones do I want to drill into right. So it's not an. I equals one to end. Go look, look look, but let me come up here see where ones that I want to attack. You can see. Oh maybe that's a dev cluster. I don't need to look at it, but this one is one I care about and it's red. Let me go drill down right.

F

Exactly exactly and- and so you know here, you're seeing that I am utilizing only 16 but because of my request- I've claimed 51 and you know trust me there are. There have been cases where you know you might be utilizing 16 and you're. You have requested, you request, are adding up to 99 or something like that.

F

Back to the point randy you were making earlier. I think you cannot schedule an app on that cluster anymore. Right, coop will not schedule and talking about drill downs, you know you can go all the way down you can. uh We are looking obviously at the name space level. Here right we are looking at the name space level and then from the name space. You can go down all the way down to a pod level as well. So you know that that feature is there and.

D

And well, we ran into this and, like, like, I said, some customers want this insight and they find that some of these applications projects are basically allocating a lot more resources. They really need right and enables them to set up. You know quotas if you like one way to manage it right.

D

They also found as we drill down. Some of these are justifiable right. If you're running in an aha type environment right, you are going to reserve more than you are really using, because you need that that spare right running so um but having the data allows you to determine. Is this a real problem or not? You can also look at it, as hmm this is how much I'm using to get aha. Am I getting? Is this really worth it right? Yeah, right.

B

Is the investment worth the functionality exactly.

D

So it seems kind of you know, basically anybody you think about there's so many people when you and you've seen the charts that are now starting to adopt kubernetes at last. I see about what 30ish percent or something- and there was just so much going on so fast- that people really don't have a good governance and really understanding are people.

D

You know optimized and they're in controlling their expenses right, and this is a high level. This isn't cost management great, but a high level to get a feel for it, because we know, as developers hey it's free, yeah, more memory. So free it's good. You know the old days.

D

It's not like the old days where you got to do you know a po and go through procurement and order and get it in and wait 90 days or longer, it's just it's there use it right.

D

You don't see the bill right well, somebody does right in the company, and so this has been helpful to many customers to get their initial arms. If you would around the usage right.

B

Yeah, I mean a lot. A lot of our customers are like yeah, we've been using kubernetes for a while, okay, let's put on acm, and then they discover things.

D

One customer telling us that their openshift was too expensive to run. I'm like what do you mean the license and we found out was well they're, they're running all of these worker nodes and they didn't need to but they're, paying by course right right. They have all them reserved but they're, not using them right and so they're able to roll back um to more realistic usage and get their expenses in line right.

D

B

And it it's just like everything right like you, will pay for what you don't use if it's on right, so be elastic right like be flexible,.

A

B

Got a good topic.

A

Coming up in a few weeks chris, where we're going to be able to talk about cluster pools and hibernating, how cicd how our devops team has really educated us on the fact that it's a little bit too easy to spend that money. It's just honestly.

B

No, it's super easy to spend that money. Trust me, I know yeah.

D

We only got a couple minutes, um I know jdp. We were going to show some of it, but did you want to hit roadmap item but.

F

Yeah yeah, I mean, let's, let's see, hit the road map items I was just showing the other feature. Is we allow data exploration right? This is again one of the one of my favorite ad hoc interfaces. You know by which you can use prompt cure to query the data, but but.

A

Yeah you undo the roadmap since you've got the data you can you can pivot around it yeah. We didn't talk much about the the trifecta randy. We didn't talk about metrics, logs and trace data, but that's that's in the that's in the future. Right like we're working towards app monitoring, so today we're delivering cluster health monitoring, that's multi-cluster, health monitoring. Yes, uh we'll start to build the direction of app monitoring.

A

uh We have a story coming together with loki, which is an up open stream.

B

E

Open source project- uh that's.

A

Bringing that log story together for us and.

B

A

Logs and monitoring together events and alerts which are in the box today and then over time, you'll see you know, jager. You know that trace aspect of this start to come to the fold and I'm you know it's a lot of products that basically assemble into this observability platform. But what we're seeing is customers are already doing it. They already have an opinionated point of view on how they work together, but we want to make sure it not only services this.

A

This open shift portfolio, but you know: there's open stack right, there's other things in the broader red hat context that can feed into this this umbrella of observability, so yeah.

B

Mark mac on youtube asked: can the reporting data be piped into a bi engine of some sort? And my assumption is yes, since it's prometheus, you could transform it and then shove it into whatever you know, bi tool you might have um I've never seen anybody do that, though. Anybody here see anybody using.

A

Like bi tools into.

B

Yeah yeah I mean I haven't seen anybody doing that yet, but people.

D

Take grafana and export the csv and then you can import, but again it's wide. You already have the dashboard so, but I mean it's doable. Yes,.

B

Yeah I mean, I guess you know what is the use case, you're thinking, yeah, behind putting that data in some other engine versus the one you have.

A

C

E

Might be, you know.

B

There might be something you that mac is working on. You know, that's pertinent, so mac uh feel free to email. Me see short at redhat.com with like what you're trying to do, and I can get you an answer. Yeah no problem.

A

I think we're we're.

B

A

To say so, we'll be back in two weeks.

B

That's right, yes, I was about to say two weeks from now: uh y'all return with um the great red hat advanced cluster management present show um not a specific use case, just curious that says: mac responding. Sorry, not a use case, specifically just curious thinking like centralized executive reporting. Well, you could totally do that with what's in the box right like you, can build those dashboards for your execs with grafana no problem.

A

B

Yeah, so look at it that way right like use the tool, that's there to build that dashboard. The way you want it right and- and that way it's it's native, there's no conversions, there's no, nothing! It won't break when an api changes right like it's just going to be there, so you can totally do that.

B

So yeah awesome. Thank you all so much for this. This has been great. I love scott scott. I love it when you come on with multiple people and there's multiple screen shares. So just.

C

Check that box right, like yeah like.

B

Yeah like this is this was awesome. It was really great a lot of people tuned in I'm sure, there's some questions. So, if you have any hit me up on twitter, chris short hit me up on email, see short of red hat and I'll I'll, get them I'll, get them to the team and we'll get you answers if you need them. So. Thank you all for joining.

B

Thank you, joy, deep randy, chris and scott and uh we'll see you all in two weeks. Thanks.

A

So much for having us appreciate it.

B

Thank you all this hour just flies by it. No, never it never fails to blow my mind.

A

Wait we forgot randy was supposed to sing some pink floyd.

B

E

E