Cloud Native Computing Foundation End User Community, 28 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CNCF End User Lounge: Platform Evolution - 5 years of Kubernetes at Sky Betting and Gaming

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

All right so hello, everyone welcome to the cncf end user lounge, where we explore how cloud native technologies are adopted by end user organizations across different industries and sectors. The sincere end user community is formed with more than 155 vendor neutral companies that use open source software to deliver their product. I'm ricardo, russia, I'm a computing engineer at cern today. I have andy bergen as a guest speaker.

A

So in this live stream we bring end user members to showcase how their organization navigates the cloud native ecosystem to build and distribute their services and products. You can join us every fourth thursday at 9 00 a.m. Pacific. This is an official live stream of the cncf. So it's subject to the cncf code of conduct. Please do not add anything to the chat or questions that would be in violation of that code of conduct.

A

Basically, please be respectful of all of your fellow participants and presenters. If you have any questions for us during the stream, we will be monitoring the chat. Make sure you you ask the questions in the live stream chat. So this week we have, as I mentioned andy bergen here, he will talk us uh about um platform, evolution and five years of kubernetes at skype, adding in gaming uh before we dive into the questions uh andy, do you want to briefly introduce yourself.

B

I think it should um yeah hi everyone thanks for joining the stream or watching the recording um and hi to you, ricardo nice to uh nice to be here today and thanks for inviting me on um so I am lead platform engineer within the infrastructure and platform engineering squad inside the infrastructure and platform stride. Let's go back in gaming.

B

I've been at sky betting gaming for over seven years now I originally started as a devops engineer um in the bet tribe moved to work with dupe in the data tribe and for the last three and a half years I've been working with the kubernetes team, which has been great before that I did lots of things with digital marketing, many different hats, many different skills from dev from ops from production, management, finance and all sorts of things, um but I'm really enjoying being back in the tech.

B

Now um outside of my day job, I run the local devops meetup in leeds in west yorkshire in the uk, I'm also part of the organizing team. Around devil stays london, which is a conference there happens um supposedly annually, but obviously, over the last year, things have been somewhat difficult around that as we uh we all know, but we hope to be back next year. So we're looking forward to that looking forward to just going to conferences in general and including kubecon in that list. Right now,.

A

Awesome that sounds pretty exciting lots of things yeah. I agree for the conferences. It's been pretty good to have and the first one physical after all this time with north america yeah, but I guess we can dive into the questions I would start. uh Maybe you can tell us a bit more about the infrastructure setup at your company and specifically, maybe you could explain a bit. What are the specific technical hurdles that bookmakers have to face.

B

Okay, um let me set the scene a little bit on that, so we are on online bookmakers. um We've actually been around for uh over 20 years, um initially the uh the betting arm of sky television in the uk. So we were to be the the red button on the remote. You would press that and be able to uh place.

B

A bet was the idea behind that, but that was a very long time ago and since then, obviously things have evolved, um skye started his own technology stack over a decade ago and that's been growing steadily. um uh We offer a range of product service services around sports betting and also gaming as well, so it comes kind of poker and sports and spots and also all sorts of other entertainment products like that.

B

um I think the main thing with our industry, where it perhaps differentiates from from a lot of others, is really the nature of the um traffic patterns we get and the and the technology stack. We have to have to deal with that coupled with the regulatory stuff. We have to do as well to make sure that we are looking after our customers and we are compliant with the regulations, um but it gives us a number of problems which we have to use technology to solve so um particularly to do with load.

B

I think um you know many people who work at retail will be familiar with the uh the busy days of the uh the black fridays etc. Well, um we tend to have at least one of those a week in our industry and we have to uh deal with unpredictable demands. I think perhaps, on the the gaming side of things, this is a sweeping generalization.

B

We know kind of the patterns that can plan around promotions for things like that, but with sports betting, um we really are at the uh the whim of what happens in the sports game. So typically in the uk, the uh the soccer games kick off uh at 3 p.m.

B

On a saturday afternoon and um there's there's a a large spike in activity of people uh placing bets up to that mark uh and then what would have happened um several years ago is that would have dropped off immediately and we would have really been quite quiet until the end of the day when we were settling the results. But now we've got in play markets, etc. So um uh we don't know quite the demand, we're gonna have on the services depending upon what events happen in the sports games.

B

So you know, we've got a very spiky um traffic pattern, which is kind of unpredictable as well, so we have to have systems which can deal with that sort of scale uh and to be able to um obviously make sure that uh they're available for our customers so they're the kind of challenges we face and I suppose to answer your questions. Traditional stack that we had in the pre-kubernetes day.

B

um It would have been very much uh vm based um running out of data centers and uh building applications with another capacity to deal with the load. Obviously things are a little different now, but uh that's kind of how things were when I started all.

A

Those things over the business- that's super interesting, actually, the I guess one of the questions uh or one of the points I'll say for later. Maybe it's also understanding how you manage these spikes and maybe over provisioning of resources, or I'm actually interested if you're running on premises on public cloud, but maybe maybe we can start with um your transition to coordinates. You just mentioned the virtual machines, and uh can you tell us a bit about your transition to coordination club native, and uh how did you get that going.

B

Okay, yeah great good question, so um yeah yeah I've set that one nicely haven't they so yeah we're uh we're about to start on a kubernetes journey, and this is this- is back in 2016..

B

um It wasn't meant to be a kubernetes journey. It was going to be a journey of what could provide the next generation of hosting platform for the bet tribe, um and you know what what platform could we put together that um really made it easier for developers, I think, um as kind of like um operations engineers.

B

I think we maybe approach problems from thinking about um the problems it can solve for us, but really this this whole journey started on about how do we get our developers to go quick, we're in a very fast changing market, a lot of um companies um in competition with us? How can we get products to customers? How can we make it easy for our developers, because, unless codes in front of our users, it is kind of worthless.

B

So how can we make that easy, and how can we address some of the um problems that we were having with the more kind of traditional infrastructure we had? uh You know the quality gates the bottlenecks. How can we enable those, but still do it in a safe way, so the objective really was around creating a platform which had as few human interactions between somebody pushing code or repo and having an automated process to get that onto the servers in front of people.

B

That was the objective um and to do that, um we we set about building out some pox. First of all, obviously, you've got a technology choice and back in 2016, um kubernetes was still relatively new and wasn't as mature as some of the other container stacks that were around them. So there was some technical evaluation done with that. uh Also at the time this was one of the first pieces of work which wanted to run in public cloud as well.

B

So, although traditionally a lot of our stuff would run in data centers, we did have some stuff in cloud, but we wanted to get more stuff in there.

B

So we did the initial kind of um proof of concept to check out the technologies settled on kubernetes, I'm very pleased to say it was before my time in the team, but I'm glad they chose it. um But after that it became how do we build with the developers a platform that they need. So we worked with the team that has one of these spikey workloads I referred to earlier, so what we call our push team, so they are the uh the updates in there in a typical um sports website.

B

There are lots of events happening, not just like football games, but things that happen in football games or netball or whatever it is, and these events can change the prices of markets that people can gamble on. So there's, there's literally thousands of updates a minute going through that all need to be reflected uh on the user's device in which they're using to interact with us.

B

So we worked with the the push team to build out an mvp first of all, building out using uh container linux, as was at the time uh on aws um provisioning, sort of the uh cloud, storage and cloud load balancers as we needed for that, and what that allowed us to do was on the platform that um for the for the updates that when there weren't many updates, we could scale it right down and when it was busy we could scale it up and that platform was very successful and it went on to become the kubernetes platform, which was fairly widely adopted around the business.

B

And uh here we are now um five years later, with a whole bunch of people around the business from different departments using it.

A

Super interesting, I I think that the early adopters went through the same process as well of like deciding which orchestrator and container platform they should choose. That's also very nice to hear um that. Maybe maybe you can dig also you just mentioned that you deploy on aws. um Maybe you can mention a bit about the stack. Do you use like managed kubernetes or or is it like your own deployment.

B

Yeah sure so, obviously we're talking 2016 and I think gke was available back then, but in the uh in his very early days too. um So we decided to uh do things the hard way, as was the uh the way to do things back there. So um we we didn't do things completely the hard way we based. I think I mentioned before we use container linux so core os as the base for our solution, and we provisioned that through a bunch of terraform which would provision in our case ec2 instances which ran core os.

B

We took a slightly novel approach, which has worked really well in the sense that we wanted the whole thing to be ephemeral so effectively when we reboot or create nodes in our cluster.

B

They pixie boot into core os work with some container linux technologies called matchbox ignition to pull down pre-rendered configuration all those for those nodes and then effectively boot from scratch, they're going to the early user space where they take these uh the matchbox and the ignition configuration, and they uh they apply that to the os and before it goes into proper user space.

B

So it's kind of like a a pre-boot thing inside container linux, so we use that to provision set up the node with all this specific settings and configurations mainly systemd files, and then we boot properly and that's when the operations operate system spins up. So that means that if we reboot a node, we kind of start from scratch. We do have some persistent storage on there, which are volumes mounted off the file shares to store things like docker images, because that can act as a cache.

B

We don't have to pull down all the containers every time we start them up, but other than that a lot of stuff is held in memory disk, um so slightly different set up to some other um some other clusters. But that's where really when it means things like upgrades, are a change of a version number in a repo and then we republish all the matchbox and ignition stuff through terraform, and that means that when the node reboot they can pull in a new image of coreos so that that works really.

B

Well, um we run the control plane in high availability. We run that across a couple of nodes, the xcd database is backed up very regularly and, yes, we have tested it to make sure we can restore it as well. um Like you say, we use a lot of terraform to actually do the provisioning and based on other system components. We should expect like a monitoring stack based on prometheus.

B

uh We used some of the other services which were already running and supplied for developers around the business. So we don't run our own logging stack that goes into our elastic stack, that's run by one of our other teams in the business, so we kept that familiar um stack of tooling, which the developers knew, um and now we don't just run aws. We also run on-prem using the same terraform and um ignition scripts, although they are slightly customized for different provisioners, for things like storage and for and for the virtual machines as well.

B

So we run those on on vmware, but it's essentially the same regardless of which environment you're running in apart from the nuances of the uh of storage and load, balancers and uh etc. Essentially, we keep the same stuff and um yeah. That's allowed us to keep parity between all the environments and we run about five clusters.

B

uh We don't have a lot of clusters and we run them independently as well. We don't have like a uh a uh a cluster mesh over the top of that, although we do run the service mesh, it's the oh, but we run that locally on each cluster. All right.

A

Super nice, um like one question I have is just before we jump into the coordinates details. You have a bunch of small socks behind you.

B

So um yes, on the wall behind me, um this is, uh we recently moved officers um during the pandemic. So um we uh we we, I left the office in march 2019 and I've only been back once to to collect stuff because we moved officers so we're now in a uh a building, that's entirely owned by the company which, which is really nice and it's completely custom set up. um But in the office we had a few bits of um uh kind of uh customized stuff.

B

We had around the place just to make a place feel like home, so the kubernetes sign behind, which says platform engineering. I used to hang above our desks and basically, when I went in to collect my stuff, I stole it. I don't think work know that so I probably shouldn't have said that out loud, but the socks behind me are from conference swag. Many of those will have been from a cubecon or two um and at the time we were going through our socks audit, um so there's kind of a pun there.

B

So we uh we we used to uh have those in the office with a graph of the number of stocks we had and the number of socks. We were going to tell the audio to auditors as kind of a joke so yeah that was our. That was our official socks audit for the kubernetes platform, beautiful.

C

Cool I'll go back to the mechanic. That's.

A

Pretty good all right, so um I guess like digging a bit more into the kubernetes part, um you mentioned you started um and eventually you had to manage. I guess growth as things picked up. So um I guess I had two questions here. One is just the growth of extended usage once things get popular, the other one is kind of related to what you mentioned at the beginning for handling spikes. Do you have some sort of auto scaling and how do you manage that.

B

A

B

um Yeah, the growth of the cluster so um yeah we we started off with. As I said, we we built it for one customer in one use um and that gained popularity really really quickly um and um with that, come some challenges because you've not only got a scale. Tech you've got to scale. People you've got to scale um the way you work as well.

B

um So um I think when we, when we moved to on-prem there were some changes we had to make around the code base, so we made some optimizations at that point to handle some of the uh some of the growing pains we'd seen in the first iteration of the cluster. So we, for example, on in aws. We could use the cluster autoscaler to deal with workloads. So, as things got busy, uh we could pop up more ec2 instances to run more workloads on and obviously, as it got quiet, we could scale that down as well.

B

um So that was great on um on aws, but on prem, that's not something we can do. We have to kind of over provision for on-prem. um The bits we had to swap out were, I think we went to dimensionally just before we tend to like storage provisioners. So if you want a slice of storage on the aws based clusters, you get an ebs volume. If you're on prem, you get a slice of netapp provisioned through through our in-house storage, arrays load balances, you're, getting lb and aws.

B

If you're on prem, you get a slice of f5 configured and, of course, what we wanted to make sure we did with that was we kept the same developer experience?

B

So, although the the the provisioners for the cloud were fairly well, you know understood: we had to write some custom stuff to do that on-prem, but we didn't want developers to be slowed down by having to configure f5s and to be requesting storage. So we use the same. Obviously we kept the the system, volume uh storage. We just changed the provisioner there, which makes it sound, really simple. That was a lot of work, went into that and the same with load balancer provisioning.

B

Obviously, we didn't necessarily want our developers to be logging into f5s and configuring those when they could just declare the state of what they wanted. Their network connections to be, and the cluster should do that for them, and obviously we we put that together as well so um yeah they were. They were some of the um challenges we faced from keeping that parity as we changed environments um in terms of growing.

B

um We, uh I think, when we were working more closely with certain teams, we hadn't necessarily anticipated the challenges ahead with that particularly multi-tenancy, and I mean I think the the initial year of the cluster was without our back because it wasn't there. um I know how back was added shortly before I arrived to the cluster, but that presented some challenges, because um how do we manage that? For both environments? We've got a solution based off volt and ldap groups, which allows teams to authenticate and get access to um to the cluster.

B

So from that they're restricted to what name space they can do. We've done a lot of work, we're putting insane defaults and least privileged security when those name spaces are created, so that if you get on the cluster you're kind of locked down to start with, until you unlock the bits you need, so you have to set up your network policies. You need to sell quotas and stuff um and by that we've kind of managed the expectations of the customers getting on.

B

um We've got a support channel where people can raise support, requests and tickets and uh ask questions and we can help them there. But I think the main thing we found was in terms of that growth.

B

Was our users didn't always understand the line between what was the kubernetes thing and what was our kubernetes thing and there's an expectation from ourselves there that we expected our development teams to learn how to build apps for kubernetes and also how to maintain and manage those um and off the back of that we've. You know: we've put in a lot of training we've trained over 400 developers on a couple of different courses on how to build and write apps for kubernetes, so they can get that right. um But of course there's still you know.

B

um I've heard it said that kubernetes isn't a developer tool. I'm not sure whether I agree with that definition, but um I think that there's definitely a barrier to entry there. But um whether or not it's it's massive or small, I think largely depends on the developers we're working with.

B

As an example, we've we've got developers who would gladly be given root access on everything and would uh love to uh insert records directly into the xcd database, given the opportunity to do so in the control plane, but obviously, at the other end of the spectrum, we've got people that just want to put a few lines of yaml together and they're, not that interested in what that is because they've got you know. Quite rightly, developers have got a whole lot of other stuff to deal with.

B

um You know domain uh knowledge of their actual problems, they're trying to solve the code they're trying to write uh the business logic. You know, there's the you know. I think the expectation that go away and learn kubernetes as like an afterthought, I think, is something that that doesn't really work. um I think we we got bit a little bit by that, and hence we got to retrospectively, do um quite a bit of training around that to uh to bring um or to help developers uh easily understand what they need to do on our clusters.

B

Now how clusters work.

A

All right now that that's very interesting, maybe maybe I have maybe another question uh about the management of the clusters, but maybe.

C

Building on the developer.

A

Experience that you were talking about is there like a streamlined or recommended way for people to manage and deploy their applications. You.

C

Mentioned that they.

A

They have access to the clusters, but but is there like a recommended way to manage lifecycle of their deployments or or the upgrades they do and like there's all this talk about git ups to embrace this kind of thing or use some other tooling? uh Maybe you.

B

A

B

I think, um given a time machine, we would have put more developer tooling in place or encourage the teams that we work with initially. To do that, um I think um if we were starting again from scratch now we would certainly have some opinionating ways of building apps for kubernetes and what was supported on there, but um as with all ecosystems that evolve, um we now have the particularly vet tribe.

B

I put together a standard way of building applications, so after a couple of years of people going off and doing their own thing or being influenced by what other teams have done, um there's now a pattern evolved of how things should be done and we have a team that are building a built. Sorry, an application helm chart which allows developers to build applications based off a set of base images which are regularly updated.

B

They can take their applications, there's pipelines built for those to deploy those onto the clusters, they get a set of standard dashboards and they get a uh a bunch of tooling uh and um references to where they can find the logs etc.

B

All of that kind of thing which are you know you need to to run an application, but that wasn't built by us that was built by another team and that's kind of becoming slowly the de facto way and we're seeing lots of our developers migrate to that way of doing stuff, which is nice to see.

B

um I think, as I say, if we could start again, we would um perhaps have done things a little differently and I think one of one of the things we we've done over the last three years and certainly for my day job as well. Is we relied heavily on this kind of um use of developer experience to kind of like solve a lot of the growing pains we had with the cluster?

B

We got a lot of users on there uh fairly quickly, and I think we suffered the growing pains sort of internally of how we were working with the clusters. So we've done that. We've done a heck of a lot over the last three years to to like to like smooth that out, um starting with just basically talking to more and more of our users about what they want from the cluster. How they're going to use it um understanding who was actually using our cluster was was quite a big, a big undertaking.

B

We did trying to understand um which workloads belong to teams because they can move around as well. So we basically tag all the workloads on the clusters now with metadata, so there's labels which indicate who owns the stuff and that that's allowed us to do a load of really cool stuff. It's allowed us to shard the logging so rather than just having one logging pipeline. Well, we can do that per tribe. Now, there's a lot of work done into that. um I mentioned we go out. We speak to teams, we talk about requirements.

B

We take that feedback back. We can do that and understand about the workloads which they're running, but it's enabled other stuff away around others, as well around things like best practice and standards. So we put together a whole bunch of um ideas and gold and best practices call them standards the principles of how you build and run an application at sky, betting and gaming on a containerized platform. So we've got standards around build, run, deploy now um and we've got that that was built with input from everybody um who was using our cluster.

B

So we've got like a collective uh mindset on that. It's not just our opinionated uh version of what looks good um so we've got that and then we built tooling around that to kind of like check on that as well and provide dashboards etc to indicate where things aren't. Following the rules and some possible solutions, they could have to uh to fix that.

B

So um we've done a lot of work on that and it's um you know: that's evolved further into things like um understanding costs and education on resources, and things like that, so that we can, um you know, run things efficiently as well.

A

Yeah yeah, I think, like you, you covered a lot of the challenges. uh Yeah, it sounds sounds very good, but one one thing like maybe you already mentioned, but if you would say like the main problem you will have today while running your clusters. Well, would you highlight something? You mentioned a bunch of stuff that that is uh tricky to handle.

B

Well, I think, from the technical side you you're always going to have um you know this is a kubernetes problem. This is just a you know. A running computer systems problem really distributed systems. uh You know you've got you're going to have face problems with problem workloads and with components of the system, not behaving.

B

There's you know constantly keeping things up to date. It means evergreen and management of that, um and then, of course, the the probably the big one which um I think, whichever system you're running is going to be capacity, particularly an on-prem environment. You know: do you have enough storage? You have enough network bandwidth.

B

Is your monitoring able to scale with your workloads um and then coming down to like right-sizing workloads to uh to the right requests and limits on them, trying to support teams to to to get that right? We find that particularly challenging, because I don't think there's a great range of tooling out there to to help with that. We've built some in-house tools we're building more. We know this is a problem and we're you know in order to get our development teams to understand and to set their requested limits correctly.

B

We need to help them to do that. We can't just you know, uh you know, produce graphs and then point out inefficiencies or you know things getting killed or uh cpu, throttled, etc. That's not going to help, so we need to put better tooling around that so that there are some other day-to-day challenges. But many of them are, you know, just keeping things up to date, making sure we're maintaining up time, keeping things reliable, yeah.

A

Sounds very good, very good, I'm just checking. If there's a question I don't see any so um maybe maybe we can switch slightly the topic and less from the technical or tooling part, but maybe you can tell us how what's your experience uh as an end user in this community? uh Is there I don't know, what's your feeling, uh interaction with other end users or with the tools and well um you mentioned.

C

You've been to a.

A

Couple of cons from from the socks as well, so I guess you've been involved.

B

Yeah, I mean I mean I think, from the the end user community. I mean um being a member of that um it really um that's a real boon when you're at the conference. I think um um the attending uconn is um something the team have really enjoyed. I've not actually been the one. Yet I've got to be honest about that so, but I am hoping to get there, but I do like the physical conference I'd be to the virtual ones, but I I love the whole hallway tracks etc.

B

But there again I am a conference organizer, so I uh I'm a little bit opinionated on that, um but yeah I know cubecon is certainly something which the team have been to and have come back full of ideas full of um different approaches to do stuff.

B

I think I think the the the main takeaway I take from the team when they've been and they come back, is they say they had a plan of what they were going to see and obviously, as you'll know, cucumber, is a massive conference with with many many tracks there of talks to see, um and they always come back watching lyrically about the things they didn't expect and I think when well, I think almost they've said when they went to like the popular talks and they couldn't get in actually the ones they went to because it was near or um or it looked interesting, they're, the ones where they picked up these little tidbits and these little interesting uh bits of knowledge which have come back and have been used.

B

I think um like opa was, was a great example of that. No.

A

One had heard of.

B

It before I can't remember if they went to copenhagen, I think that was the one we went to before. They went to barcelona uh and they came back from that like like this is brilliant. We have to use opa it's it's obviously like like, like the you know, something we can. We can um use to help our teams on the cluster you know, but without actually ending up with that talk, they would have.

B

Obviously we would have known about it eventually, because, obviously it's like a huge topic now, but I don't think we'd have had that um kind of early visibility of it. I think um a lot of our early history adoption was based around um talks and examples and demos and talking to other people at cubecon, which is which we've seen so yeah, um I think even more than just the conference, which of course is you know great and important.

B

I think um I think supporting the cncf is important, because um you know we we rely heavily on um the projects which it looks after so supporting that is super important to us. um So yeah, I think they, uh you know, the the end user community is, is really important and so is cubecom.

A

That's brilliant and yeah we're all hoping that uh normality will come back.

A

Fingers crossed it's. It looks like it's it's happening, um so you you actually mentioned the um a lot of a lot of the tools that you you are relying on. You mentioned, of course, kubernetes you mentioned prometheus. You mentioned helm opa just now, I'm kind of curious because you have a pretty large deployment and, interestingly, you have both on-premises and public cloud deployments. So it's multi-cluster.

A

uh You mentioned that you don't do any kind of communication between the clusters, which is also kind of common. I I think, from from what I hear, are there any tools? You also mentioned challenges in costs and things like this: are there any tools or technologies that you're particularly interested in integrating in the near future or that you're looking forward to to look at.

B

Yeah I mean I mean there are a couple I can mention. So, for example, um I mean we're we're heavy on our prometheus adoption and we have had constant requests for long-term storage and metrics, so victoria metrics is something we're we're heavily looking into now. Obviously we want to manage that carefully, because we're aware that um long-term storage, you know, means different things to different users and we we're particularly careful on how we uh manage our prometheus instances as it is based on.

B

You know things like the amount of um cardinality the metrics have and start-up times etc. So um you know, victoria metrics is something where we rolled out and we're starting to roll out to our customers. Now so there's long-term storage, but we want to do that in a manageable way. So that's that's one of the things we're doing. um We've used gatekeeper, which is a tool which allows you to basically report on opa states.

B

We've used that for our kind of like standards and best practice dashboard, we wrote an exporter which takes that data out in the format we want, because we've also got a lot of metadata tagging in there, which can identify workload, ownership and stuff. So then, in the dashboards we can visualize that by ownership as well. So so that's been a very useful technology.

B

There are various updates. The networking stack going on um updates to istio at the minute. The 122 upgrade is um not without its challenges. I don't think um so.

B

We are working closely with our users and, I suppose the great thing about already having those community those communication channels with our users in place has actually made that fairly straightforward, we're able to identify workloads and go and talk to them they're going to have problems when we do the upgrade so um we're hoping to have everything in place in the next couple of weeks, so we can go to 122., but I suppose the overall thing with our cluster is, although we're looking at different um we're, always looking at new bits of technology and replacing existing functionality with newer bits, um I think the thing we we more than fans about is just stability and updates to things you know, we've got a lot of operators that we run.

B

um You know we want them to be stable. We want the underlying kubernetes system to be stable. We want the monitoring stack to be stable. We want all the things that send data from the cluster to all of the other um services which ingest our stuff to be stable. You know so stability is a big thing and um you know it's a dull answer uh and it's not a very exciting one, because it's not like the the you know the new shiny text, but but we kind of like things just just working.

B

I think one of the things we're really um pleased about with our cluster is is the stability of it and we like to keep it like that and nobody likes getting paged and we to we want to keep it like that. As best we can.

A

I think that's that's pretty fair and yeah. I think the the interesting bit here is also that you have a pretty like large deployment and it's interesting to see how you scale things like prometheus or metrics, and things like this and you're looking at these new new products too, to handle that I think for other end users, it's extremely useful, this feedback.

A

So I think um I don't think we have any questions. uh One one thing maybe I would uh put here is: uh do you have something else that you would like to tell other end users or the community that we didn't cover here.

B

Yeah I mean one thing I haven't really covered, um which has been uh really important to to the team is how we use kubernetes, not just to run workloads. We also use it to provision infrastructure. Now. Obviously I mentioned things like load balancer provisioning. I mentioned that storage, provisioning, but they're kind of um you know they're with the basic built-in primitives for kubernetes.

B

So if you you know, you want a pv, you will get some storage if you, if you want to load balance, then I mentioned that that's configured for you um based on that work, though, the the the automation really hasn't stopped there. So um I would give you some examples with a team that did the f5 automation they're now trying to automate more things.

B

So, if you want more, if you want to configure an f5 now for a virtual machine usage, um you can do that through um through a code base where you, um where you commit um yaml definitions for the load balances you require. So even if they're outside of kubernetes, we can use our provisioner inside of the cluster to actually configure load balancers for things that aren't in the cluster.

B

So obviously, there's a pull request approval on that, but it means that, rather than going into an f5 reconfigure configuring that for teams, it's now all done as code. That is obviously you know a massive benefit. The same with dns entries. We've done a lot of automation uh in the cluster, and if you want a dns entry, you can create a dns object in the cluster.

B

That's got an operator behind it which will provision you a dns entry in our dns provider through the through their api and we'll handle all that and we'll tear it down when you don't want it etc. But equally, we've got another repo where people can put those dns definitions and they'll just get created, even if the dns record isn't used by something inside the cluster, so we're starting to automate bits of infrastructure through kubernetes, even if it isn't kubernetes.

B

So two more examples of that um uh firewall automation has been something we've been. You know, every organization is wanting. You know that software defined networking is is something that the organizations want. Obviously we have. You know over a decades worth of network configuration in data centers in offices, etc. We're now starting to build tooling, which will configure uh some of the firewalls through things that are provisioned from kubernetes.

B

So again, we've got a repo where these rules can be defined and they can be pushed out to cube and they can be configured through the tooling that's available within kubernetes.

B

um Another example is cert manager, which are that people are very familiar with we're using that with our certificate provisions now to manage certificates and we're hoping to offer that outside of the cluster. So there's lots of bits of automation right that we run inside kubernetes to um to manage the resources we need, or you know, teams or developers need, and indeed people in infrastructure. But we can also offer that as the way to manage this stuff in an automated way outside of the cluster, so that I think it's something we we.

B

You know we we're building on and building on them. I think a lot of the automation we're going to be doing over the uh um next 18 months for things for infrastructure are going to be powered by kubernetes as well, even though they may never have a workload related on the cluster.

A

I think that that's that's a trend that has been going, maybe in the last two coup cons. Maybe we see a lot of projects like crossplane that seem to be like starting to look at managing things that are not related to containers or containerization at all. They are just relying.

C

A

As a platform, I guess for for all this, so it seems that you've been you you've gone pretty far already.

B

Yeah I mean and of course we're also doing that for things in the cluster as well. So you know, operators are, you know, think things like um you know. We've got an in-house, uh my sequel provisioner, which will you know with a small chunk of yaml, creating your namespace a you know. However many um node or container based replicated my sequel cluster. You want and with the amount of resource and tooling, to back it up and restore it built into that operator. So we've done a lot of work on that kind of thing.

B

We do offer some operators which we didn't write. So obviously the prometheus operator is one which we use a lot to manage the prometheus instances on the cluster, but we've got stuff for redis and a few other bits and pieces as well. um So that means that you know obviously like like developers that don't have to go.

B

You know, ask for that or provision it or you know um or manage it themselves. You know you want to my sequel. It's about six lines of gamble.

A

Yep, that's brilliant! Actually, we still have a couple of minutes, so I just uh thought of something also because you, you are mentioning uh managing things that are not in the cluster. You also mentioned that you have like multiple clusters uh multi-tenant just out of curiosity. How do you handle this setting of external resources when you have multiple clusters? Are users like allocated to a certain cluster, or do they see these resources everywhere or.

B

Yeah so so I think the um the ldap groups are obviously shared in the organization, so they're learning the ldap groups you're in and they obviously define which set of permissions you get and then those are bound to access on the various um certain amount points within within vault. We use vault on a per environment basis. We don't have a. I don't think we have a vote in every environment, but I think the environment configurations are set within the same vault instance, so they're not shared, but they may be managed on the same one.

A

All right very interesting, I think I think uh this has been fascinating thanks so much for for all the information.

A

uh I think we will wrap up here and uh I guess um if, if there are any follow-ups, uh people can reach out uh either either finding you or or meeting you.

B

I'm on the linkedin, I'm I'm fine, I'm findable on there, I'm andy burgin on twitter. If you want to tweet me, please yeah or.

C

Hopefully, like share some drinks or a future coop con, I will.

B

A

Forward to that.

B

I don't know that would be great. I like to say I I I organized I spent eight years, organizing meetups, then a few years, organizing uh conferences- and I haven't done any of that for you for getting off for two years now and I miss it and I'm looking forward to getting back to doing that and uh yeah talking to people about stuff and finding out what they're up to so that'd be great.

A

Okay, okay, super cool. So then thanks everyone for joining the this episode of the cloud native and user launch. uh It was great to have uh andy talking about the sky, betting and gaming and how to use coordinates.

C

A

Native again, I remind that the end user stories are happening every fourth thursday of the month at 9am, pacific.

A

Don't also forget, as we mentioned a couple of times already to join us at kubecon cloud native com, eu yeah. It's may 17-20 and we'll have a lot of latest uh information from the cloud native community. um Also, if you would like to showcase your usage of cloud native tools as an end user, then you're welcome to join the end user community with more details at cncncf.ioslash, end user. So thanks again, everyone for joining us today see you next time and thanks a lot andy for the great uh you're.

B

Welcome. Thank you thanks for having me it's been, uh it's been. It's been good fun nice to share with uh with people what we've been up to so.

A

Yeah, it's been great. Thank you all right.