Cloud Native Computing Foundation End User Community, 24 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Cloud Native End User Lounge: Operationalizing 300+ K8 Clusters Across the Cloud

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Welcome to the cncf end user lounge, where we explore how cloud-native technologies are adopted by end-user organizations across different industries and sectors. Cncf end user community is formed of more than 155 vendor-neutral companies that use open source software to deliver their products.

A

I'm dave zolitowski a principal engineer at spotify and today with me, I have rajan and dave from fidelity investment as our guests. In these live streams, we bring end user members to showcase how their organizations navigate the cloud-native ecosystem to build and distribute their services and products join us every fourth thursday at 9 a.m. Pacific time this is an official live stream of the cncf and outside you're, subject to the cncf code of conduct, please don't add anything to the chat or questions that would be in violation of the code of conduct.

A

Basically, please be respectful of all of your fellow participants and presenters. If you have any questions for us, we'll be monitoring them throughout the stream. Make sure to ask your questions in the live chat.

A

This week we have rajan and dave here to talk to us about fidelity investments. Before we get into the conversation, raj and dave, could you briefly introduce yourself? Could you briefly introduce yourselves please.

B

Yeah sure uh I'll go first, uh my name is raj rajan puripati people call me rajan, I'm part of an organization called ecc within fidelity within that I'm part of the cloud platform team. uh I primarily work on the kubernetes based projects and uh the goal of our team is basically to uh set up the next generation uh application platform uh for our users. So uh to put it in right words, uh if, let's say a double developer with infidelity wants to, you know, do something uh to achieve a particular business objective.

B

We wanted to make it uh as simple and as easy as possible for for the developer, so uh that that's me um very happy to represent the cloud platform team.

C

Hi, my name is dave batello um I've, I'm in charge of the private platform squad within fidelity, um where we're primarily focused on building out kubernetes platforms, to run our container workloads on premise.

C

We have probably around 40 clusters production clusters on premise of kubernetes running at one time, supporting a variety of workloads uh with close to about 10 000 containers.

A

Cool that sounds like a lot um before diving into.

B

The running part.

A

I'm sure it feels like a lot sometimes um before diving deeper into the karate's part um curious to. Can you tell us a little more just about the infrastructure setup at fidelity and what prompted you to start adopting cloud-native tools and kubernetes.

B

uh Yeah, I can, I can start with that. So basically we uh we have a mix of on-prem and the cloud we are into multiple uh cloud providers as well, uh so, basically just just to give an idea uh instead of uh just setting up clusters and then opening it up for the users, uh our goal basically was to uh come up with the platform, uh which is more fidelity specific uh in the sense that we want all the the best features available from the cncf technologies.

B

We want that to be available for the users, but at the same time we have some hard constraints from from an enterprise standpoint. So, for example, today, if I'm a developer with infidelity, it's not very easy for you to just go and spin up your own kubernetes cluster and then deploy an application and take it to protection. uh You have to take a lot of security aspects. uh You know that come into picture in terms of uh your your image that is using.

B

That is that's that's very important in terms of the amis there's a long list so and- and this particular list, for example, it it sort of keeps changing as well. So, for example, there could be a security event, uh not just with infinity, but in anywhere outside that could actually trigger. Like a you know, a different policy or change to a policy, so uh these constraints are something which, uh as a developer, it's not so easy to keep up with.

B

um So we so the goal that we set out was to actually build an application platform where we take all these constraints uh into account and we sort of come up with this platform where, as a developer, you get to experience all the features, the best best features from the cncf technologies. At the same time, you're guaranteed that you are running in a in a secure fidelity, environment security is one aspect of it.

B

There are other aspects like compliance and a lot of other things which will get into the detail, uh but that is the goal asset developer. uh You sh, you should it which we want to make it really really easy uh so that you know you're able to you know quickly. Deliver. You know the business objectives, so we have uh today mentioned 40. That is more 40 clusters. So that's on the on-prem side.

B

uh So totally put together, we have like eks clusters, aks clusters uh on aws and azure, so uh total crosses like 300 plus now, so we have a dash. We have a dashboard and sometimes we look at it and it jumps like from like one number to another, so I'm pretty sure we've crossed like 310 or something like that. um So uh so that's that's! That's a high level! You know setup dave. If you want to add.

C

Yeah, I mean infrastructure from an on-prem perspective. um You know primarily where we built out our platform around uh vcenter around vsphere infrastructure.

C

um You know where, for our on-prem services there's also a proprietary api platform which we built on prem, that sits in front of our vsphere infrastructure, and we we leverage that for a lot of the provisioning of our virtual server instances that we, you know, we lay the foundation upon for for building kubernetes.

C

So a large portion of the responsibility for building out the platform entails also understanding. uh You know how the infrastructure works behind the scenes um and tightly coupling uh our integration deployments, uh our kubernetes build-outs on that infrastructure.

B

Yeah- and I just wanted to add on the csf technology standpoint- uh use yeah kubernetes of course, but we do use helm so helm is our standard for packaging most most of the time when I actually mention something like helm, uh so we always have this approach where we don't want to restrict users to the fullest extent. So we we always have uh this famous thing where it's battery is included, but it's swappable right. So we do uh uh you.

B

As a user, you have an option to switch to something else as well, but, like helm, is one of the uh like most widely used packaging. uh You know uh mechanism.

B

uh This platform will talk about more in detail, but uh it's it's kind of like a base platform on top of which you can actually run a lot of things uh so uh in. In that perspective, we have like envoy, which is running on top of eks clusters, as well as a part of like an apa gateway and stuff like that. uh So it's like a combination of all these technologies.

A

But that sounds really good. Are there other cncf projects, aside from kubernetes, helm and envoy, that you're excited about or either using now or plan to use in the near future?.

B

uh We use uh yeah, we we are constantly exploring uh the so almost always we try a little best to not create something in-house.

B

We try to look at where the community is going forward and we want to stick to the community right. uh So uh almost always we sort of look at this landscape and make sure that you know we pick something from the landscape. um So in terms of the, for example, uh we use like some, uh if not full, uh some some components from the flux. uh You know cd, for example, helm operator, which is part of flux, uh hem controller. Now, and uh now it's, I think it's github's toolkit right.

B

So uh we use that extensively. uh We we actually have built uh open source project called cron on top of it.

B

So basically, we took these technologies from uh I've, seen cs since, if and uh built something on top of it just to extend it a little bit to our use case uh and, and we have sort of open source uh that as well so github's toolkit is one example and uh cron is the open source project that we have built on top of it uh uh yeah we use container d right and- and these are the major ones and we're we're always you know constantly exploring uh you know uh projects I think uh craftsman on the telemetry side as well.

B

We are, we are looking at, uh you know fluent bit and you know stuff yeah. We use.

C

Fluid bit for a lot of our log collection, um we also are pretty heavily invested in opa, um so from a governance and compliance perspective, so we're using opa to build policies and constraints around uh you know how we govern uh the platform itself, um so there's all different types of policies that we've implemented uh to enforce um specific people, the metadata that are associated with name spaces, um we're looking to make that migration very shortly over from the native psp policies within kubernetes to um to opa to do that policy enforcement.

C

So that's just a couple more examples of some of those cncf projects that we're using yeah.

A

Now that makes a lot of sense and I think both of you alluded to or touched on security for the environment, and I don't think you necessarily said it, but I assume there's a lot of regulation as well, so I'm curious how being in such a regulated and required to be highly secure space impacts. The way you look at all of this infrastructure and open source tooling.

B

uh Yeah, so we we have uh very strict, uh you know, uh regulations uh being part of the financial industry. That's that's actually important as well. uh So the way we look at it is uh that's very important to have right, uh so we have built the platform in such a way that, uh from a user standpoint, for example today, if you, if you talk to any of the fertility developers, they don't look at.

B

If let's say they are, let's say in aws: they don't look at it as if uh it's it's an eks cluster or a kubernetes cluster, but they look at it in terms of like a fidelity platform uh internally, we call it as like fit fit eks fidei case and stuff like that, so they always refer it to as like. Hey fedex is version one. So we have our own fidelity platform versioning, so they usually say eks1.02.0 and stuff, like that.

B

So coming to the security point, what we've done is we have sort of packaged uh all these things as a part of the platform right. uh So, whenever we sort of make a release, uh we what we do is like, for example, there are all these add-ons I'll give you uh I'll give you an example: the opa which dave was mentioning about uh that's something which is part of the fertility platform.

B

So from a user for the fidelity user standpoint, he doesn't look at it as like one standalone add-on, which is running in a cluster, but he uh hershey is basically exposed to the features that come comes out comes out of the add-ons.

B

So the way we try to portray is basically we have this platform and you have all these features that are available for you forget: don't focus more on the uh add-ons aspect of it, because behind the scenes between 3d case 1.0 2.0, we might actually switch the add-on to something else, or we can combine two add-ons. We can do a lot of things as a part of the platform, but from a user standpoint they just look at it as a feature.

B

So uh from this perspective, if you look at it, uh we sort of built in the security features uh within the platform, so that, uh let's say, if we are going from one kubernetes version to another, we sort of have this rigorous process where we we, uh we check every single add-on. That is part of the platform. It goes through like a as a process to make sure between the versions between the version of kubernetes, as well as between the version of add-ons.

B

Is there anything that has changed that uh impacts our you know security guidelines right? It could be as simple as uh a particular add-on version, not com image. Let's say base image that is part of this new add-on version. Maybe that is not compliant with uh you know some of the current security policies. This is one one good example right, so those are things that we will actually validate as a part of our no rigorous validation process before we release our platform version. So what happens is whenever the users get a notification that hey?

B

We have this video as 2.0 or video case 2.0. uh Most of these things are actually already handled for them and we do make it part of our internal release notes so that they are aware of like what all things.

B

Sometimes you have to uh take a slightly different approach where, for example, uh let's say we want to, we want a particular change: it's not in compliance with one of our security policies. At the same time, let's say we are unable to get that change from the open source project immediately right. So those are the cases where uh we will sort of come up with certain work runs. It could be like for a very smart period, a small period of time.

B

We could actually do something where uh we will release the feature as a part of the platform version, but at the same time we will sort of do some work around for for for a period of like two months or something like that until the actual change comes from the open source project. So these are the things that we do as a platform team, but from a user standpoint, uh they're they're unaware of these things like from their standpoint, you have a feature that is like very stable and you know it is working.

A

um Yeah and so what does that? Make versioning and upgrading look like from a user standpoint like you release, uh vidks n, plus one.

B

A

B

uh So basically, what happens is uh uh so maybe a little bit on the fertility structure. uh We're not like one central team which manages all the clusters. So philip is a large organization uh and we have a lot of uh sub organizations right. Each each business unit itself is a company by itself. It's a lot of developers. They have their own dedicated devops teams, sre teams- and it's like that, so the way it works is um a business unit will have a devops team rate and operations team.

B

So when we release it, uh it's actually uh them taking the platform version and then upgrading the cluster. It's not as if like we will, we we sort of give the tools in place we sort of have, like uh you know, a ui where they can actually go and do it, but we don't do it for them. uh They have their own timelines and it's up to them. So basically it goes like this, so we release a phd case version so we'll have a call.

B

uh We have a release, call uh and then uh the user sort of get to see what all new features are coming in. What are all the breaking changes and then they get to decide? Okay, when they can actually do it. We do have a oh.

B

You have a time frame where we we support, like n, minus three and minus four version. It's not as if uh a particular business unit cannot stick with the particular version for a long time. That's there, uh but at the same time uh they sort of pick the version. They actually upgrade the clusters. And if, if there are any issues, then that becomes like a platform issue, it's like an issue happening with platform 2.0 and then we sort of uh you know jump in and we have all these monitoring.

B

uh You know dashboards and everything set in place, so we people know a friend if, uh if, if somebody's doing a cluster upgrade uh platform upgrade, and if something is going wrong, we would automatically you know, jump so that's how it it works. So from from from a devops team member who's actually picking up the version they might, we might have packaged 20 different add-ons within the platform uh and each could be in its own version right. So they're not worried much about it from their standpoint.

B

They look at the entire entirety of it as like one single version. uh So even if one of the add-ons is not working, let's say a particular feature. Is uh you know not working, so they just from their perspective. It's basically the platform version vs unstable, so we just released like a patch uh version for it right, uh so that that's a wet house that that's got that's how it goes actually.

C

Yeah, so on on the private cloud side on prem, um so what rajon was talking about was a lot of what how we manage things out in the public cloud and a large portion of that is self-service right. So the cut the release they provide the release out to the customer base and then the customers are consuming it and then they pretty much have a you know the ability to roll it out on their own schedule.

C

um On-Prem we're a little bit more prescriptive over it. We take a little bit more control over it. It's probably more of a managed offering more than anything, um and so what we've been trying to do over the last you know year is try to obviously keep up with the kubernetes versions. That's that's a challenge right. um So this past year um we had a target to try to do four upgrades in one year, and so we did four upgrades this this year.

C

What, by the end of this year, we'll have done four kubernetes upgrades with a target of one per quarter, so hopefully by the end of this year, um and maybe my team's listening will get 120 out the door by the end of this year, and so that's a substantial amount of workload. That's just to keep up with the versions that doesn't include.

C

You know all the work that we do around uh like add-ons things that you know, rajan was mentioning around maintaining conversions, for you know all the different charts that we roll out to support the environment and provide other capabilities.

B

Yeah, I I see a question from maywork. uh How do you keep up with the upgrading dave if you're, okay? I just wanted to touch upon that because that's something which I think some of our learnings could be used for the users.

B

uh So there is this problem right, especially when you are uh multi-cloud, so imagine we are trying to uh have an uh platform which which actually provides certain features so from a user standpoint, they're looking at this unified platform, and uh that is supposed to run, irrespective of where you are, whether you are an on-prem, whether you're natively as an azure. It's a very difficult thing to do, especially when, uh if, for example, in on the on-prem side, let's say we are using rancher uh there.

B

uh The versions that rancher would support will be slightly different from I'm, referring to the n minus four n minus three problem where, for example, one one vendor might say, my current is 120, and I I follow the n minus three model. So at any point in time, one seventeen is the latest at the same time, another vendor. uh If you're on the cloud aka aks, for example, there could be a situation where they are doing 119 and n minus four, so their least supported version is 160 right. So how do you?

B

How do you do this? It's it's a very tricky problem, uh so that is where uh there's no clean solution to it. Let me put it that way. So that's where we constantly we the platform leads we sort of meet, and sometimes we sort of ask dave, for example, to sort of slow it down a little bit where we sort of catch up and stuff like that. But but one thing that we always put in the front is the stability of the platform.

B

Even if, let's say uh 120, I'm just taking number 130 has like some really important feature, and then, let's say one of the teams or some of the teams are uh waiting for it right. If we think that you know, we won't be able to provide this uniform experience. uh If we, you know, if, let's say one of the platforms, let's say if you say that azure wants to move forward right and then you know do that where uh you know aws lags behind.

B

If that is going to be a situation, we really evaluate uh we, we try not to do that right. We try to wait so stability becomes more and more important than releasing new features, so sometimes we'll actually tell that application teams that hey this is a feature that you want. But can you can you live without it for like another few months?

B

Is it like absolutely important, because that would directly mean that we can actually the point here is the stability always comes first, so uh that is one thing and the thing is it took some time for our internal users to it's. I think it's my mind. uh It's it's a mindset so, for example, like upgrading these, these are big clusters and a lot of critical applications are running in it right so uh like a year back when we went back to them and said that hey community most very fast, like kubernetes.

B

If you look at look at it as a project, the developers are like amazing. They come up with all these features very quickly, so the versions move very fast, so uh we sort of release versions as well right, so it was very a year back uh our internal customers. It was very difficult for them to digest the fact that every two months or something like that they have to do a major upgrade.

B

So now, if you look at it looking at how stable the whole thing was, so we have taken it to a point where uh it's more of a personal perception thing, but we have taken it to a point where it's okay to do the upgrade it'll be stable. So building that sort of a thing uh is very important. So if you can, if you can actually put uh all your efforts towards making the upgrades like really really stable, uh then the users build it.

B

This is this conference that builds in the user right, for example, if you look at our version, you know upgrade validation process, every single add-on that we use. We have a mix right. We use some of the community uh add-ons a lot of community add-ons and we do have like some of our custom build. We have like a lot of operators. We have a lot of operators that we write every single add-on. uh We have a rigorous walkthrough. We we try to.

B

uh We have like a separate set of smoke test integration tests uh that is very well maintained uh that will almost always catch. If there is uh an issue that you know, that's mapped to a particular version or something. So there is a rigorous amount of work that goes into validating each of the add-ons that are part of the part of the platform.

B

So we put in a lot of efforts towards uh the stability aspect of it, so that will in turn, you know, increase the confidence for the users and then uh now it's it's a new normal right. No, no! It's a new normal. It's not as if, like uh you know like something back upgrades, would be like few times a year for major platforms, but that's not the case right now. So uh building that conference in your user is is very important. I I just wanted to add that.

A

Yeah now that makes sense, I think, of a question. That's kind of building on that exact kind of stability and confidence in the user. Part they're, asking about how you make sure that uh upgrades or updates to any of these components are safe to apply and on.

B

Top of that, that's a good one. How do you limit the blast? Radius.

A

As you're finding that some might not be safe to apply.

B

That that's a good point, so uh one of the things that we have we do is uh we sort of have a structure where I know it differs from company to company. But we have to follow an approach where we have certain engineering clusters. We call it as test clusters, platform engineering clusters, so, for example, um I'll give an example. Let's say between a development uh and the testing and the production environment itself like there are.

B

Usually there are differences in terms of policies and stuff like so we make sure that our testing clusters, the platform engineering clusters, are on all these spaces. So when we start out first of all, uh before even going to the platform, nearly everything starts from your uh local right.

B

uh We have a very strong set of uh test cases that are very well maintained right, so it's it's uh it's it's based on, of course, combination of cucumber and all different sort of things, so we have a very strong set of integration test smoke test that is very well maintained. I keep stressing on the very well very well maintained because it's easy to come up with the first set, but sometimes like over a period of time. You can easily like uh you know, not maintain it very well.

B

Then it loses its purpose, so we use we sort of rely on that which will actually catch a lot of issues, and even after that there is a rigorous testing uh on an environment basis we sort of tested in like platform, engineering, dev platform, engineering production.

B

uh These are efforts, but uh for for for our scale, these are like you know massive for for supporting 300 clusters. We cannot afford to make mistakes. We do.

A

B

Mistakes here and there, uh but we do everything possible, like we sort of things to do it, so that is. That is one thing right. We have like strong integration test, uh the test suite that is like well-maintained, we test through every uh you know, environment type, and after that also uh when we release it again. This is something where we don't upgrade all the 300 clusters, as I said before, it's more of the users, picking it and picking picking up the release.

B

uh So uh we also try to see if we can actually work with some of uh the business units uh who are usually uh they're, okay, to pick up something first right, so there we worked them very closely to see if there are any issues in the in the development clusters when they upgrade, they usually start with development clusters. So we sort of monitor that very closely.

B

We have very strong uh uh logging in telemetry, which sort of helps us that if somebody is picking up a release and putting it in their dev clusters, let's say that is the first of 300 clusters that is getting up upgraded, like all our eyes are on this right. uh So we we watch it very carefully and uh if we see an issue then we sort of uh quickly you know revert to it.

B

Sometimes we even you know it's rare, but we can even like pull out the whole release and say that you know what like uh it. You know we will come up with the patch fix and stuff like that. So no straightforward answer, but uh uh one one good thing. uh If one point one take away, if you want, I would say, maintaining a strong set of uh you know: integration uh test, suite.

C

Yeah, I could add on to that a little bit I mean from from the on-prem perspective. um Like rajan said, you know, we we definitely have spent a lot of time, building out these test, suites um unit, testing, functional testing um and ensuring that we're not just doing this testing.

C

You know at the end of a release cycle, but we're doing these types of tests all the time, and so some of the strategy behind it really is around building that end-to-end testing, something that we can run on a daily basis, something that is, uh you know, bringing issues to our attention um on a daily basis versus you know, finding out right before the end of the release.

C

I think the second piece of that for us is really rolling out these releases um in a little bit of of a you know, canary fashion, if you want to call it that, um where we'll do um you know in in our area, we have multiple zones in multiple regions, we'll do one at a time.

C

From a non-production perspective, we give our business partners adequate time to cycle through that environment, ensure that they've, you know, maybe deployed workloads multiple times become comfortable with it and then subsequent scheduling of of the upgrades uh to our production clusters happening. You know during tech windows. You know during times when uh there's the least chance for impact to our production running workloads.

C

um So that's a lot of the method behind the madness. That's for sure.

B

Yeah, I I there's another question on the chaos engineering uh stuff, so I want to take that that's a very good question, so we we have we've been using. uh We've been doing the chaos engineering stuff uh since 2021 uh early 2021, uh but the point I want to stress is even two years back I've I remember very clearly even in 2019 right, we we made sure that, for example, let's say we want to add a feature to the platform and the feature comes from a particular community maintained add-on right.

B

So there are cases where the community maintain add-on might not have like a help test.

B

You know a testing associated with it, so even when we bring that- and we make sure that before you can plug that into the platform you have to, uh you know, add add your uh you know test case to it, so uh we run helm test against all the add-ons, so there is no add-on that can actually go into the platform without a test case associated with it. You also take it a step further, so um we have this open source project called cron, where we came up with the idea of uh something called layers.

B

So uh what happens is basically you have a collection of add-ons right. So look at this case, so we have clusters running in azure, um aws and then on-prem right. There are certain add-ons that runs everywhere, but there are certain add-ons that runs only in the cloud which is located as azure, and there are certain add-ons which runs only in um you know on-prem for example. So we came up with the idea of something called layers.

B

So what we did is we packaged all these add-ons in terms of layers and, for example, we have the security layer that is shared across these platforms and we have this cloud layer which is only on the case aka.

B

So the reason I bring up this layer concept is even in 20 I mean even even like two years back, we were very clear that each add-on should have a test test associated with it, and this layer, which is a collection of add-ons which is closely related, will have like an integration test associated with it, which is basically another helm chart. So imagine a layer which has like five add-ons uh each is a helm shot.

B

So each sensor has a test and uh there'll be the last add-on in the layer which is a helm chart which basically does the integration of all those add-ons. uh These were significant efforts, but it paid off paid us off really well in the long term, so help test is extremely important, even if you're picking up a community project which doesn't have it, uh you know, please uh you know, add it add that to your uh list at the same time, uh uh you know come up with like an integration test, help chart.

B

uh You know that can actually validate like how certain add-ons how they work together. I'll give an example, for example, as a part of our onboarding process right, so uh we created an extension to namespace called namespace groups, uh so the users typically they're not exposed to namespace. They always start with something called namespace group. So as soon as you create a namespace group, uh there are certain things that happen. So uh your your reading groups are automatically you know created.

B

uh There are certain things that happen and it's it's basically a work done by a few add-ons together. So there is like a an integration test, helm chart which basically checks this particular thing right so uh yeah. These are some of the things that you know we have been doing like even from the beginning. At the same time recently we have early 2021 starting early to 21. We have uh started focusing a lot on the gas uh engineering stuff, so we that that's part of our suite as well now.

C

Yeah and the chaos engineering aspect of it right, so you know we. I think that we're dabbling in that right now I have you know I have definitely looked at. You know um integrating chaos mesh into some of our pipelines. That would not only uh handle you know building you know, building these these clusters uh running through unit tests, um but also knocking things over um you know and then ensuring that the platform continues to function as as we expected to um so um we're still.

C

From my perspective, we're still at the beginning phase of that.

A

Yeah that makes a lot of sense and then for the specific tests. I think there's kind of the question about case engineering, but also post mortems and things. Do you do things? Do you have ways of ensuring that times when it does go down or you do run into issues that doesn't happen again like how you bring that back into your testing frameworks?.

B

uh Yes, yes, that that that actually happens. So, let's let me think about it. So, basically, typically what happens is like when we, um uh when we we sort of prioritize our it could be an obvious thing, but I think that's something I just want to stress upon, because it really works. Well, uh we we prioritize stability over features. uh That's that's that's that might sound obvious, but it's something which is very, very important.

B

So if we release a particular version and then let's see at the same time we are, we are working towards this next immediate version with a lot of new features.

B

If we find there are certain things that we have not done wrong, uh it actually feeds back and then uh we sort of focus on that first before uh the next, the new features. So uh the reason I say it's obvious thing is uh it takes effort uh when you bring that back when you discuss in your you know, sprint meetings and stuff like that. This is given, like you know, very high importance, so I think it's it's part of our.

B

uh Maybe I don't know it's the same culture now that we have to focus more on the stability. I think, if you have like a small team with a few clusters, then it's a different thing, but especially when you are holding uh all these like 300 plus clusters for, like you know, thousands of developers and a big organization like fidelity uh stability comes first, so we sort of immediately take that and then put it back to our. uh You know sprint to make sure that the changes are done to the the test.

B

Cases are enhanced and stuff like that.

C

Yeah I mean added on to that version. I think that, like it's pretty much ingrained into our fidelity dna, that root cause analysis is the de facto method for us coming to conclusions of what needs to be fixed right. So um I mean my team is very well versed in the fact that you know when we find problems that we need to come to that root, cause to understand how we can resolve that. So it doesn't happen again so that our application uh partners don't run into these types of problems down the road so yeah.

C

We spend a lot of time uh tracking um trying to ensure that we are opening up stories and and and understanding uh when we haven't figured things out. You know that we get back to those and we drill into those things. um You know. As a matter of fact, we were talking about some of those things this morning before we lucky enough to join your uh your your uh broadcast here so yeah and.

B

Then maybe it's just an important question right, so we we sort of look at things in a slightly different way. So, for example, let's say if something is happening on the customer's side right, we do have three different environments, which is supposed to mimic uh the customer environment right in terms of like security profiles and everything, but so if, if there's something that they are catching, we are not catching. We sort of try to look at why this difference came right, which means like some.

B

uh There is a mismatch in terms of like how the environment is configured versus. You know us, so we even try to look at the process that the fundamental process that was actually broken so that this can actually occur. So we sort of go and fix that, so that not only this problem will not happen again, but like many such diff types of problems will not.

B

You know, uh you know occur so uh like we sort of go down to that level where even if it's like a fertility, specific process which is like you know, very basic, we we try to.

B

You know push to make changes or automate that in such a way that you know so we basically try to go analyze the base, not just like on a high level why this happened uh not from this one particular problems perspective, but uh to an extent where how do we uh prevent not just this problem but similar types of problem uh from occurring, for example? One example: is it happened, maybe a couple of years back, but uh the way our uh im roles were managed in aws.

B

So we even like took a big step and came up with our own framework based on stack, sets and stuff like that, so we changed the whole process. uh It was. It was a lot of effort to actually do that um relatively. uh But now, when I look back, that is one of the important thing things that we did so we changed the process. uh We had to get a lot of approvals because that was already a hardened process, but we made we sort of got the approvals and approvals and we changed it.

B

And now after that, like we've, not seen uh not just that issue but like any that that space is like very stable, so you have to go to that level uh if it makes sense.

A

Yeah now that does make sense. um I guess one more potentially quick thing on testing before we move along. um You talked a lot about testing and it sounded to me, like you, were talking a lot about testing the fidelity platform like pieces that you've built. um I'm curious if your automation and your tests also catch potential issues in the tooling like if there's a change in kubernetes that breaks something for you. Does that get caught here or is there a different process for catching things like that? No.

B

It's it's baked in actually so, for example, even if we bring so the the integration test piece that I talked about, uh so even if you bring it, it includes test cases for the community and it's not just our staff. So uh sometimes we actually go and uh raise issues. Upfront, um uh and you know it benefits the you. You know the community as well, so it it actually, uh you know, involves all the kubernetes stuff, as well as the community, add-ons cool. That makes a lot of sense.

A

um So then, I just wanted to take a bit of a step back and hear a little more about the overall architecture. I mean you've mentioned multiple clouds and you've mentioned 300 clusters, but that's about as far as we know, so I'm just curious to kind of dig a little deeper. How big are these clusters?

A

What sort of things are you running in them? Are they multi-tenant just all of the kind of typical things you think someone would ask about how your clusters are organized.

B

Absolutely absolutely so I'll start, and then I think you can also jump in uh so most of your clusters are, in the medium size range medium size. In the sense, uh I don't think it has more than 75 notes or something like that. So.

A

It's not like thousands of notes.

B

Not not for sure at the same time, it's not small aspects. Most of them fall in that range. uh Everything is multi-tenant, so that is one of the key things that we decided back when we started in 2018 until we decided. So there are a lot of processes built along to support built around to support the multi-tenancy.

B

So one of the examples was stuff that I was mentioning earlier around extension of name spaces, so, uh for example, when a team on boards, uh instead of just mapping it to a particular namespace, we sort of created an entity around it, so every team, when they actually onboard they, they get like something called an ns group, and uh so that is the kubernetes aspect. But there is like fertility specific aspect in terms of like how does it get integrated with the uh so, for example, how does the ad groups get created right?

B

uh So you have to. You, cannot automate half of it and not the other half. So these are things that we did, so everything is multi-tenant.

B

So in terms of in terms of the number of teams it it it sort of varies but like easily. You could you could find like 50 to you, know 75 teams working in a cluster, and I'm talking about team of you know, let's say eight or something other uh and they cannot step over each other. There are. uh There are frameworks built around resource codes, limit ranges and stuff like that, for example, the ns group concept it sort of includes a section for the resource code.

B

So as a as as a cluster admin, I can actually say this team gets this ns group with certain resource quotas and stuff like that. um So which means the team themselves can actually go and add and delete name spaces uh within that ns group, but it is actually bound to a particular. uh You know a set of constraints, so uh it is. It is multi-tenant, as most of them follow typical approach of having a cluster admin, and usually there is less dependency on the cluster admin.

B

What I mean by that is you go to for the initial stuff where he sort of onboards you and sets the constraint after that, we try our little best not to create a process where you have to go to the cluster admin. Again and again so um so mostly they are self, you know, independent uh we do have pipeline setup, but uh pipeline is something which you know. Sometimes it can become opinionated, so they can actually bring their own pipeline, so most of them use their own pipelines, deployment, pipelines and stuff like that.

B

um So that's the uh that that's on the cloud side, if you want to add on.

C

Yeah from the on-prem side of things um again, you know we've adopted or or prescribed to the notion of smaller clusters. You know rajan mentioned medium-sized clusters, um you know initially, you know the thought behind. It was maybe we'd go with logic clusters, um but ultimately it comes down to blast radius.

C

We've learned that you know the the automation that's involved with. You know: maintaining maintenance around these rehydration. um It gets lengthy right. It takes a lot of time to rehydrate a thousand nodes, um so by breaking it down into smaller pieces, we're really able to uh create decision points where we can decide if we need to move on with other clusters. If say, we ran into a problem um or if we can, um you know just continue or stop dead in our tracks and revert or pop things. So to speak.

C

So you know the smaller clusters multi-tenant, mostly business aligned. um So a lot of our clusters are specifically business unit aligned um so that it creates that separation between our business partners and then within those clusters. They're they're definitely multi-tenanted, where all the different development groups are working within that cluster separated through many you know, name spaces and the non-production.

C

We see a lot of those name spaces uh delineated by their various development cycles. Right, so you know developments.

C

And for the most part, those non-production clusters tend to be, on average at least two times larger than our production clusters, just based on the number of workloads in the various environments for them to cycle through uh application development before they get to production from an architectural perspective on prem I mentioned earlier that we're predominantly on vsphere, um and on top of that, we front front all of our clusters with avi load, balancing services um and those that hobby load balancing service sits.

C

You know pretty much as an l4 uh proxy down to our either the interest in nginx ingress controllers, um which handle that path-based routing um and then also we uh allow them to also use like node port ranges right so that they can do direct uh pod traffic. So um yeah.

A

Cool so that spurred a few questions, um yeah.

B

A

Psycho, the big thing here is just around uh monitoring and observability of your clusters like how what tools do you use? How do you kind of collect, metrics logs traces, all of the normal observability things, and how do you monitor the health of of the workers.

B

Yeah uh we'll talk about it. uh I think there's one question on the ownership of the cluster. So uh so, basically, as I said, um the business unit uh are our internal clients right. So basically, we sort of provide this platform and certain it's basically framework with the collection of tools to manage it. The ownership of the cluster is actually with the business units, uh so the platform this has defined its own set of rules.

B

So there's something called uh the global admin uh platform, admin and stuff like that, so business unit uh they will have like a you know: devops sre team, so they'll actually be the the the cluster admins. So we do have uh you know uh uh like overall access, but uh they are the actual cluster admin.

B

So in terms of upgrades, if there are, for example, let's say if uh you're having issues with the resource quota and stuff like that, so that's how that's, where the actual the users actually go to, so that is the ownership of the uh cluster and and basically the ownership is sort of divided in this way uh when, whenever I say platform, end updates a collection of namespaces, it's it's more than that from a from a kubernetes standpoint, it's a collection of namespaces.

B

So if you open up a fidelity, you know cluster, you have all these set of uh management, name, spaces and system name spaces. Anything with hyphen system is a system name space similar to group system where all these critical add-ons runs, and then we have a collection of management, name spaces where all these uh right from clusterrunner skilled, ingress control. All these things runs those set of things put together as a platform. uh Any issue happens there.

B

We are responsible for it, so uh the cluster admin like doesn't even have to look into it like it straight away, comes to it comes to us because it's a platform issue, your platform is unstable, um anything other than that. Let's say if there's an issue with the particular resource code and limit range within the username space, uh and that's where maybe the class admin will will come into picture other than that like we have the we have another role called namespace admin.

B

So uh if, if I'm the owner of the namespace group right, I have a collection of namespaces I'll, have admin access to that namespace, which means like I can. I can do whatever I want. uh You know within that, for example, let's say I'm I'm trying to install let's say some sort of a crd based operator right. So uh there is like a a particular automated process where you can actually go and submit where a particular add-on within the.

C

B

Will sort of create the customer resource definition for you from then on your your you're on your own? So that's how we have sort of done it. So, uh coming back to the monitoring strategy, we basically have uh it's actually a mix, but we have a combination of data dog. uh You know splunk and, like you know stuff like that, so we have a collection of very good collection of pre-built. uh You know dashboards, so at any point in time, when you have 300 clusters right, so look at it.

B

This way, each cluster, uh each cluster has this collection of name space, which I said is a platform. So within these three under clusters, if the platform is unstable on any of these clusters, we will get to know so uh so that's how we actually set it up. So from our standpoint, we just look at a particular platform version. When we release platform version 1.0, uh we just we, we know what all clusters are upgraded. Water trusses are not upgraded, but at any point we would get to a platform.

B

1.0 has issues in any any of the clusters, so we use datadog uh in combination with spunk, and we have all these people uh dash uh dashboards, um uh so yeah and we usually uh metrics heavily metrics logs uh traces is uh some of the. I think some of the community projects have it some of them.

B

Don't uh even like our internal tools that we've developed, uh we, I think, we're still in the process of uh making uh the best use of tracer, so I think we're getting there uh so in terms of monitoring again, I think that's separated any platform related come to comes to us, but anything which is like application related. It goes to the namespace segment and then, if it is anything on on the on on the other side, it goes to the cluster admin. So that's so we have sort of separated.

B

So if, let's say an application team has an issue with their deployment, then it doesn't come to us at this point. uh I just wanted to touch upon a little bit on this on something which you are actually working uh just just to uh you know. Maybe it's used for users. Maybe they can actually think along these lines. uh Let's take this problem where you have these deployment pipelines right. So at this point, if a deployment is having an issue, um we have an sra team.

B

It comes to us sometimes, but most of the time, usually, what happens is if I'm a developer like mid-level developer. With four five years of experience, I usually go to the team lead first and then team david will actually go to the business unit. Devops teams right. So uh what we are trying to do now is sort of as a part of the platform right. We are trying to actually come up with another. uh You know sort of a system where uh imagine this. This is the this is what we are trying.

B

So you have a jenkins pipeline. For example, uh imagine we give you a jenkins plugin where anytime, your jenkins build, fails. It prints out a link and you click click on the link. It tells you what the problem is right, so that is something what we are actually working towards. uh Hopefully we'll actually open source it uh we're trying to. You know uh build some machine learning models and then uh you know do some uh some analysis or on top of it, to actually come up with these things.

B

The reason I mentioned that is now we are actually trying to uh take it a step further in such a way that each developer who's, actually the user of the platform we trying to focus on the pain points they have uh and then try to you know solve all that. So I just want to touch upon that a little bit, but I think going back to the monitoring again. Maybe dave. Do you want to add something on it.

C

uh Sure, yeah, um you know one of the questions that was on the board was: how do you monitor worker health? So I mean we do have some. You know basic. You know things for workers that just ensure that the virtual machine exists and it's responsive.

C

That's really just basic monitoring, but the monitoring itself really comes from datadog right, so the datadog monitors that we've set up are specifically looking at the components within the clusters right. So, for instance, you know if you kublets down well, then your your node is not going to work right. So from that perspective, the health of that node is inoperable right. So it's it's not it's not functioning, so that's kind of how we prescribe to it.

C

So a lot of the monitors we're writing that we've written are really around like those service component health at health status, um from a logging perspective like he he mentioned real briefly was you know we do use splunk. um We have kind of a mixed bag of logging. um We use datadog in some areas. We use splunk uh we and we also uh have a team internally, and it has built out some really interesting architecture around like an aggregation tier around fluentd.

C

So basically, our fluent bit log collectors that pull the logs off of the nodes will push those logs to a fluent d aggregation tier and that aggregation tier then pushes those again to kafka topics.

C

And then those catholic topics are read by elk right um and that's how we're able to kind of use kafka as a you know, almost like a traffic manager right. Where do I send logs for these specific clusters, because there are different requirements that come from business lines around where they want their logs to land um so um yeah? I hope that answers some of those.

B

Questions yeah and I just want to add some something uh to it. There was a question on the: how do you monitor the worker health?

B

uh I think the agents, the data dogs, plunk agents that we have has these by default, but I think, on top of that, we have deployed as a part of the platform one of the add-ons, if I'm not wrong as uh the node problem detector add-on uh from, I think it's part of the kubernetes project itself, no problem detector, so I think we are sort of using that that actually helps as well, uh but I just wanted to mention one problem uh which we, which we have right so uh today, uh look at it this way.

B

So let's say there is an application deployment that failed right and it, and if you look at the logs it will say helm, release time note and if you, if you run uh get parts it will say pod spending, then you will figure out why the parts are pending and you'll see your nodes are unstable. Then you will figure out like why the nodes are unstable. It will be something to do with cluster scalar or like something happening on the network side that is affecting the aws autoscaling groups.

B

There is like a chain of things, so uh one of the pain points that the developers have today is uh when something like this happens, uh even with the current solutions. uh If even if you set alerts, if you open the mailbox, you have like a flood of alerts. So it's not as if, like someone tells you that hey all these problems are happening, forget about it, just fix this one right. This is what you need to focus on, then everything else will happen automatically.

B

So this is a problem that uh the project that I was uh mentioning earlier, based on a ai and machine learning and stuff like that. That is something which you're trying to solve when a deployment fails when they click on the link. You want to tell that hey. There is this network issue happening and your order scaling group is having an issue yeah someone.

A

Is working on it? Don't.

B

Worry about it rather than spending a list of commands which, basically so that's sort of a correlation analysis.

B

These are things that, even even though we have a pretty sophisticated monitoring setup today, these are problems that we still have uh and sometimes like. Let's say there is like a network outage going on that is affecting a lot of things right from a developer who's, just looking at jenkins, five plane, two for him to get that information. It takes like hours. Sometimes he he raises a ticket. uh Someone has gone for lunch, they come back, they look at it.

B

They raise something in the team's chat and then somewhere they get to know that networking is working on it, this correlation, uh but but if you look at it from the way, we are platforming right, but the way we look at it is they are users of our platform, and this is a platform uh you know experience. So we want to enhance that.

B

So uh we are sort of investing uh along with the existing monitoring and stuff like we are sort of investing in our efforts around the area where how do we use the latest ml techniques? To sort of make this better for them, uh and the question was: how do you, how does one differentiate between logging and tracing so as.

A

B

Most of the add-ons, I'm not sure if they they do a lot of tracing uh at this point, most of the things most of our monitoring starts with metrics and then from metrics we sort of tried to correlate uh to logging. uh We have seen that whenever you have tracing. That is the best thing you have right.

B

You start with metrics, that's where you get the alerts, then you go to the trace and then you get the locks but uh yeah at this point everything starts with metrics and then it sort of uh you know goes to the logging, but some of the latest stuff that we are trying to do based on the machine learning. It's actually reverse.

B

So you sort of start with the looks, and it's it's you know interesting, it's for the future, but that's one thing we've been doing and that question was around uh what is the component, the message that actually reaches out to the google dpi?

B

uh I didn't get that so, basically uh in terms of the communication between the control plane and uh the kubelet. That sort of differs from you know, for example, ek is slightly different from you know. uh You know a rancher, how do you do inter cluster networking? That's a very good question.

B

That is one of uh that is one of the. We still have that as a pain point. Let me tell you uh so it's it's, not a problem that we've solved. We are working on it. um There are solutions that we are still looking at, but I can say that's a problem that we've not sold. Basically, it's.

C

Still we're still using you know external load, balancing services to to handle inter-cluster traffic. You know we we haven't come up with any type of a solution to that, like a service measurement.

B

Yeah, nothing like a global trust. You know service measure or anything like that. It still goes out comes through the load, balancer and then comes in.

B

Yeah, so the the the ny stuff that I mentioned, oh no stuff that I mentioned was uh so there are. There are things that is getting built on top of our platform, uh uh so so we have 3d case right. So we have all these. For example, uh there's like an ml platform that we're trying to build on top of vdk.

B

So similarly, there's like an ap gateway that gets built on top of uh our uh our feed case, uh so the fertility platform that we've built over a period of now like when I look back over the period of two years, whatever we have done, it's it's now, it's like a solid foundation where, uh like people uh can actually build on top of it, so internal clients, one of the businesses that can actually build uh the layers concept that I talked about earlier. It's basically collection of violence.

B

They can actually now contribute and say that hey, I have this machine learning. uh You know set of uh features available, I'm packaging it as a as a layer applying on top of your platform that becomes, like your uh you know: ml plugin. At the same time, it's an ml platform which, with all the fertility constraints, you know uh you know, set to it.

B

So um the unwise stuff that I mentioned was around those lines where uh there's like an ap gateway that is actually getting built on top of our platform, and that is where you know it is actually used.

A

So just a lot of time, but yeah. If you can get through the last question or two.

B

Yeah, I think, there's something which maybe if it is like even two hours we can, I think we can. We have lots to talk about uh looking.

A

B

To uh crusher part to board, uh we do have uh teams using uh you know, calico within our clusters. um We stick to uh at least on the cloud side. I think uh maybe natural answer it's different, uh but on cloud said we stripped the native cna drivers. We don't use overlay, um at least at this point, um so we have teams, which is something which we don't enforce, but uh we have teams which uh basically uses uh they can actually install it.

B

It's not part of the platform yet, but they can actually install calico on top of our platform and then uh do the network policy and stuff like that. Yeah we're using we're using.

C

Canal on-prem, with overlay networking in the public cloud. They do not use overlay.

B

Yeah and I assess a cluster to cluster is still something which goes out and then comes through the ingress, so uh we don't have any global service mission or anything like that. At this point,.

A

All right that makes sense. I guess, with that we can, we can wrap up. uh I just want to thank everyone for joining us today for this, uh for this episode of the cloud native end user lounge, and uh it was great to have both of you rajan and dave on to talk about fidelity, and we had some great interaction and great questions from the audience and we bring this end-user allowance to you on the fourth thursday of every month at about 9 00 a.m.

A

Pacific time so hope to see you the next one and don't forget to join us for kubecon cloud nativecon, north america october 12th through 15th, to hear the latest in the cloud native community and also, if you'd like to showcase your usage of cloud data tools as an end user join the end user community. There are a lot more details on cncf.io, end user and again thanks everyone for joining us today and hope to see you next time.