Kubernetes VMware User Group, 3 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Kubernetes UG VMware 20220303

Description

March 3, 2022 meeting of the Kubernetes VMware User Group with discussion of recent updates and patches, changes for Kubernetes version 1.23, and a user survey and discussion of what logging solutions people are using.

A

Hi welcome to the march 3rd meeting of the kubernetes vmware user group. um We've got a little bit of a light turnout today, maybe because we didn't have a uh speaker pre-organized, but we'll start the meeting anyway with, and uh we did have an announcement um here. Let me uh share the meeting and agenda notes just a minute.

A

So, just two days ago a new release of the vsphere csi storage plug-in came out and in fact the docs just got updated yesterday. So there were a few things I wanted to call out. There's some pretty major features for some like snapshotting is now supported for block volumes, so that could potentially be a big deep deal for people who are using backup tools like the open source, valero or others, because I think that is a nice enablement feature for supporting these disaster recovery scenarios.

A

There are also metrics exposed for prometheus, and um one thing I wanted to call out from the release. Notes are very detailed, but one thing I wanted to call out because somebody's already brought up the issue is that there is a known issue with vsan volumes being partitioned causing issues.

A

So I would recommend that you actually go refer to the release notes here.

A

And it seems to have very good coverage of these known issues um if you are on a commercial release that bundles these things as part of the install process, I'd perhaps recommend that you not manually try to move up to the new um csi storage plug-in unless your vendor supports you doing this, because that might cause issues with the level of support they're going to be willing to provide on the agenda.

A

I had proposed that we also talk about the survey of users members and what they're using for a logging solution, but due to the right, like turnout today, maybe we'll postpone this. um The idea I had here is that often we come up with um scenarios where people are expressing difficulties troubleshooting when you've got multiple layers of infrastructure going on. You know you've. You start with an under layer of a hypervisor on the vsphere platform.

A

Maybe storage is part of that or on the side of that, and then kubernetes layered on top and faults and problems can happen at any of those layers, and maybe they evidence themselves in the form of malfunctioning apps.

A

But a common scenario is that people first get reports of trouble with apps or services underperforming or not running and try to establish root cause and if you are able to put in place a logging solution that captures logs from all of these layers of your infrastructure, plus kubernetes tied together in one place, I think it has the potential for aiding your troubleshooting efforts, because you can go to that one place.

A

Often a good tool for establishing causation is actually to see you know which layer had the errors occurring first, if at all, and but like I said some of the people who often are vigorous participants here, don't seem to seem to have other things to do today. So I think I'll try to roll that into the meeting um for next month. um I see we have somebody who joined a little late. It's up to you whether you want to call yourself out. uh Oh okay, we we do have people joining late.

A

So maybe I prematurely called this off hi scott.

B

Hey, how are you.

A

So you missed it, but I just covered you can refer to the agenda notes document, but I just called out that uh there was a new release of the csi storage plug-in and it has some pretty major features. uh Just quick recap: the support for snapshots is now there on block volume, so that could be a big deal with backup and dr scenarios.

A

There are also some known issues, one in particular. I think at least one person has noticed where a partitioning of your vsan can cause some difficulties up at the kubernetes player. So I'd strongly recommend that anybody look at the release notes before moving on to that the docs seem very thorough and we're just updated. Yesterday.

B

Awesome awesome: I will take a look.

A

On other new things, the cloud provider has not been updated since the last meeting. Oh by the way, the other thing in the storage plug-in is they did move up the level of support so that it supports some newer versions of kubernetes.

A

So I suspect that might be a pretty big deal to some too um and uh another thing that came up that probably people might be curious but doesn't really relate to this, as it turns out, is that uh a new release of vmware tools came out and I went and looked into it and really it only affects people who are running uh vms, running windows and mac because on the linux platforms, which is what you'd end up using for running kubernetes on vsphere, um if the story is still that the open vmware tools is the one that your you should be using and the latest changes haven't been reflected over into the open version of the vmware tools so non-issue for kubernetes.

A

But you might have come across and noticed that new tools came out so just wanted to. Let people know that it probably isn't going to affect you if you're on kubernetes.

B

And actually, the new version comes with huge benefits, though, for those running on kubernetes when it comes in to open vm tools, the new or relatively new, like within the last year, they added app info that comes out from vm tools as a guest info kind of thing. So you can query from outside any processes that run within a vm. You can get that as an ovf property that gets updated every 10 minutes um in this version they added that for containers running either on container d or dock or get bubbled up to vsphere.

B

That can be queried from a vsphere, ui and vsphere api. So you could actually tell now from a vsphere perspective which containers are running on which nodes, um which could be helpful in different automation, scenarios around kubernetes without needing to access kubernetes itself um to understand where, like system components, are running or things like that.

A

Yeah and I guess if you had something like a jump box host or something that you use to manage your kubernetes some of those things well, the new container support may or may not come into play there, but it you might as well use it if you're on an os that would support it.

B

Exactly I think, it's just a good thing for automation tools around things. I don't have many use cases for it necessarily, but I think getting that information into vsphere and not requiring to have cube cuddle access and still allowing some automation, especially around like maintenance windows and things like that, where you can make sure not to mess up with things from a vsphere api perspective when dealing with maintenance could be definitely something that will be interesting to see how people utilize that functionality.

A

So another item I put on the agenda now: we've kind of covered the update to the csi storage plug-in. um Just before you joined scott. I was proposing. Maybe we roll this into next month, but since you're here we can try it and see how it goes. Maybe have a continuation next month.

A

But I wanted to bring up a audience user poll as to what logging solutions people are using because it struck me that what you've got going on when you run kubernetes on on top of the vmware infrastructure, is potentially logs happening at the kubernetes layer, as well as perhaps apps and services being hosted on kubernetes, but also down below at the store at the vsphere, hypervisor storage infrastructure, layer and potentially at a storage layer too.

A

If the storage isn't thoroughly integrated into vsphere itself, and a frequent thing that comes up during these meetings is a situation where people seem to have a performance or a failure. Issue: try to troubleshoot it um get down to a root cause and if you were to have in place some kind of a consolidated logging that would bring things together through one tool and accessible.

A

I think it might aid you a bit in trying to do that sort of troubleshooting and, of course, some of these logging tools come with analytics built in that would perhaps assist with that effort.

A

You know anything from just being able to tie things together into groups or even algorithms and machine learning going on that would go in there in an automated fashion so that something that isn't a human admin is going in there taking a look at those, and I'm just wondering what people are using, what the experiences have been and any recommendations as to things that are either good, bad or you're, indifferent about.

B

I I think, from my perspective, unfortunately, in my mind, the most common thing we see is elasticsearch.

B

um The reason I say, unfortunately, is because anyone that has managed elasticsearch at scale at any scale over time understands that it is a beast of a tool and it is a pain to manage and no matter how many resources you give elasticsearch. It is never happy and always once more, um I happen to think that grafana loki is actually one of the best tools out there and I think that getting because loki integrates into grafana, which is very commonly used for observability.

B

Getting the two together, make things very easy from the you know, open source side of you know products. I think that loki. It also scales better, it's better performant and just does a better job.

B

um I think fluent bit for sending the logs there's no question uh does a much better job than fluent d in terms of its performance, although fluent d is more flexible, um but I think fluent bit still wins there, in my opinion, from what we've seen by customers just because of the resource utilization that fluent d requires, because it's so huge, um I'm also a fan of log insight right which many vmware customers happen to have.

B

So if you have it, it's a very good solution as well and vmware managed the elastic search, that's behind the scenes there and you don't need to so you can be happy.

A

Yeah I happen to be now, of course, I'm biased being a vmware employee. I can get free licenses for my home lab, but I'm still using log insight. A lot of it, I think, is inertia where uh I started using it years and years ago and it has held up and still works. But then again I've got a pretty small installation. uh You know three cluster nodes and uh it could be an entirely different beast if you get up there to a real data center.

B

Yeah, it definitely has issues at scale that make it difficult, especially if you're running elastic on kubernetes um there's a lot of intricacies with doing that itself um and lots of issues with the helm, charts that exist out there and unless you're, going with, like the operator based um from elastic, which is definitely the suggested way, I would say running it on kubernetes, um it's not easy um to maintain overtime in large environments, um but yeah, I'm personally not a big fan of elastic, but we see it a lot out there. um Just.

A

Because it's one of them advice on loki and poke around with that. When I get some spare time and just take a look at it, I'm curious and I think I run into people who are looking for open source solutions for particularly when they're smaller scale sort of bringing up an eval or a learning exercise kind of cluster.

A

Occasionally even I think I'm slated later in the year to conduct some workshops for teaching people how to bring up kubernetes. So maybe that would be a good thing to put into that workshop.

B

Definitely, if you have I'm building a carvel package for that uh now as well, so yeah, okay, it's so cute.

A

It's a lot of fun if you.

B

Have any issues with loki? Let me know I've uh dealt with it. A lot.

A

And then, of course, you know in this arena, there's a lot of commercial things and cloud hosted and uh yeah. I think it's I've been hearing for decades that some of these logging tools become such beasts that in something it's not a one size fits all, because some of these that are super capable could end up being so large that the act of deploying them and maintaining them is as big as your the effort to deploy a small kubernetes cluster. So they might be kind of overkill for some situations.

B

I completely agree with that. We're seeing I mean, there's a reason: you've got hosted elastic by like seven different uh vendors out there. You have logins like cloud from vmware you've gotten all of these uh systems, because when you reach large scale managing on your own, just the cost of the storage on its own and making sure that you actually have that when it comes to regulations that people have where you actually have to keep your data for x amount of time.

B

And then you need to make sure that you also have redundancy of your storage and everything just becomes huge. I have a customer with 150 terabytes of elastic search data, then that covers less than a year of logs. um That's not a cheap endeavor or a fun endeavor to deal with.

A

At that scale, what would be the typical implementation of the storage you'd use, for that is that set up to use something you know as the back end? Where does that actually reside.

B

Usually, it's usually that's running on physical servers, elasticsearch, we hadn't plugged straight into a storage array.

A

So like even like expensive, back storage and things like that,.

B

A

B

Right and in that case I say, go to elastic cloud, go to you know, amazon's hosted elastico.

A

Well, it sounds like scott cut out on audio, maybe.

B

Sorry yeah! Okay, can you hear me again yeah.

A

We got it back, but I think that whatever you said at the last point got cut off.

B

Yeah no, I didn't. I think that.

B

Sorry I'll be back in a second.

A

Well, we wait for scott, I don't know if anybody else online has thoughts to throw in their experiences.

A

Yeah yeah, if it's that expensive block storage, yeah you're spending a lot of money on the log collection, but then again you know if the search engine part of this requires performance, uh I can see where that's coming from yeah. Definitely it's so interesting to hear like some.

B

Of these companies and like how much data.

A

B

Have on their back end.

A

It's pretty wild yeah. Well, you can you know if you get into the vendors that sell logging solutions like splunk, you can just look at the financials reported. You know in in the stock market tracking services to see how much revenue they're pulling in, and uh you can tell that uh yeah it's it's a pretty big deal and a pretty costly deal.

B

Hey sorry about that, yeah! No.

A

B

Yeah, no, I it's not. I caught what you said there at the end. It's completely true, and I actually think that you know one of the things that people are realizing now also, is that if you keep your logs for long enough, I just did a project of this by a customer where, when they were keeping their logs for a year and a half, um we were able to build very simple ai models around those logs to actually gain a lot of data.

B

On what you see in the logs before an error actually occurs, you can actually understand when an error is going to to occur and things like preemptive, aha, that vsphere has, or things like that you can actually do at application levels, um because you can start to get a lot of data coming out of there.

B

That becomes really beneficial and having that in a sas solution, where you don't need to deal with the amount of storage necessarily and have that up front cost of the you know, actual storage and a lot of those have the built-in ai capabilities um can bring huge added value to customers. I think.

A

And I can see the value, particularly if you're doing ai machine learning kinds of things. You definitely need a year, maybe two years worth of data to make that effective, because you know they need to have a training set that would go back to establish whatever is a norm. That is a meaningful uh period of time.

A

A couple days doesn't cut it, but if you had regulatory and compliance issues that are going to force you to keep those logs for 5-10 years, I'm not sure there's any value other than checking the box to say you had it and paying for keeping that part of it on block storage. I don't really know if there's any solutions that support tiering to just kind of roll, that off into archival storage that maybe isn't readily available.

A

But uh you know at the cost of keeping that on high performing block storage for 10 year old log files would seem to be pretty dubious to me. That's just my gut feel I haven't. I haven't really been charged with running a gigantic installation that would keep that much data for that long. But uh that's just my gut reaction.

B

Yeah I the way that I view it is so many organizations, especially in the financial sphere, have to keep these logs for such a long period that it's just a waste not to utilize it so having it in a tool, that's easily accessible via apis and things like that to actually be able to run ai models on. It is huge and that's really where I think decisions need to come in of what the actual extension points and what the apis are of a system like this.

B

When you talk about companies that anyways, you have to have your data for a year and a half, that's just money waiting to be made off of the data that you could actually and bug, fixes that you could figure out in advance, and you know precautionary things that you could actually utilize to help your business.

B

That in many organizations is just wasted. They just keep the logs because they have to yeah um and that's unfortunate because that's a huge benefit. You just have data. Why not mine it yourself and figure out what to do with it?.

A

Yeah- and this isn't just preemptively spotting trends that would clue you in on impending failures or or bugs, but I can see there's a potential security tie-in. If somebody was to help people mine this, I think very few end users are gonna. This is something that should probably be done by a vendor or project, because it's useful to everybody.

A

But you know I could see a scenario where this is a security tool. If you look at things like ransomware attacks, where they're attempting to secretly start encrypting things uh or somebody who slipped in bitcoining mining into your, you know, somewhere into your stack, you should be able to spot these with these logging tools in the form of unusual occurrences in resource consumption, and if somebody was able to deliver that kind of analysis based on the big reservoir of logs, I think there's a lot of people out there who could benefit from that.

B

Yeah, I think it's in, I think there definitely is. I think that it becomes a bit more difficult with those things because usually they're not logging, anything right. So unless your system is logging, specific things that can show an anomaly, usually malware isn't going to be logging. Anything in your system.

A

Yeah, I'm thinking more like you could look at the vsphere vms or you know something up at the kubernetes layer. If this stuff was had managed to slip into containers and notice that it's just burning way more cpu cycles or heating up memory.

A

Maybe there's anomalous network traffic going on, but uh those kind of secondary things might end up showing up.

B

Right, I think that that's going to show up more in like a prometheus right than in logging, that'll be more like metric gathering that you have, or you have like weird spikes happening yeah. There are some interesting things happening there as well in the metrics world. um I think the biggest difficulty on a vsphere perspective is that vsphere, just doesn't have collection intervals or most tools that collect metrics from vsphere does not have low uh enough collection intervals to actually gain a lot of that data. um So like vrops, is five minutes.

B

um So a five minute interval is very easy to miss things when they happen. um Prometheus is five seconds, so that becomes easier right. It's the question of. If you have real-time metrics, it becomes a huge benefit and you can definitely.

A

Probably at those kind of intervals, storing them for a year or more, that would make the expense you'd have associated with logs. Maybe look cheap if you try to maintain those for any length of time.

B

That's why what I usually do in those cases I use thanos actually with prometheus, and then you ship it up to object, storage and then the benefit of that is it's basically long-term retention. Prometheus isn't good with storing data over two weeks. um The time series database in there is not good at storing data for a long time.

B

Thanos is built to do that and it actually provides a full prometheus api um and you have thanos side cars that you just put within the pod of prometheus and it sends the data to an object, storage, and then you have a centralized thanos that basically can query those buckets um and compacts the data. The benefit of that is that you actually can do things like archival tiers once you're in object, storage, you're in let's, say aws you could create.

B

You know different tiers of your object, storage data and you know, start to rotate things into cheaper storage as time goes, um making it more beneficial.

A

Yeah, I guess the challenge there would be that if you were to attempt to unleash an ai to analyze, that having that crank having that work against an object store is going to have some performance difficulties. Now you might have to suck it up, suck it back out and run it somewhere else.

B

Exactly it's all the plus and minuses there. I and I don't think, there's necessarily a you know perfect solution for any of this. I think that it's you know really intrigued.

B

I think it's really interesting, though um logs are the easier one I think than metrics are, um but I think that met I think metrics are you know more difficult in metrics, usually a single metric isn't enough, um you really need the contextual of the log from the container from the vm from the host from the like esxi host, all of them together, you could correlate something right um when it comes to logs it's much easier to do just on a single object and actually gain true value from it.

A

A

Okay, well um I I wish we might have had a larger turnout, but um you know regarding this discussion of uh logging solutions. I think it's it I've. It's been really useful. David uh dropped, a message in chat that he had.

A

I don't know if anybody else wants to bring up further matters on this logging or has other topics you want to discuss as part of today's meeting. If not maybe we'll end up wrapping this up a little early.

B

Nothing else, for my end,.

A

A

Okay, I'll take silence, then as uh nothing more to discuss for this time, but uh thanks it. It is a has been a fascinating discussion and uh if anybody has ideas for a next month or even something where you might want to ask me to go, recruit a speaker, let me know if you don't have that idea now you can drop it in the slack channel for this user group or put it in the agenda. Note stock.

A

Okay, so with that said, let's wrap this up thanks everybody for attending and uh we'll see you next month and in the meantime, anything you want to discuss, bring it up in the slack channel, bye, bye,.