Red Hat OpenShift KBE Insider, 30 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: KBE Insider (E14) - Lalith Suresh and Michael Gasch

Description

KBE Insider interviews Lalith Suresh, researcher at VMware and Michael Gasch, formerly a staff engineer in the Office of the CTO at VMware. They both contribute to the Sieve open source project, automatic reliability testing for Kubernetes controllers. We’ll dive into how the Sieve project got started, what’s next with the project, as well as how you can contribute.

A

Well, hello, everybody I'm very excited that we got our video back um because uh you know they don't want to go at work to make a nice little video and then have it not show up.

A

um So uh today, let's see we are uh doing kbei and we do this show on a monthly basis on the last tuesday of the month, and uh you know we like to sorry I'm getting some background noise. I'm not really sure why.

A

I don't know: what's going on um and.

A

Sorry, I'm actually hearing myself on a like a five second delay. uh Hopefully that will be better now all right uh so, uh like I said uh a few minutes ago, if we don't have a couple of technical difficulties, we're afraid the show is gonna go terrible, um so uh so that works out well um yeah. So I think I had a twitch stream running in the background that was uh off mute, which is always annoying, but all right.

A

So today, uh so just by way of introduction, I'm langdon white- I am a faculty member at boston university and I used to be a red hat.

A

Employee focused primarily in my last year or so on, actually doing another twitch show which uh we'll be doing this last episode on wednesday of this week, uh which I will be rejoining to kind of uh you know, check back in and say, hi and all those kinds of things uh that was about uh containerization and openshift, but my primary focus was actually on working with field and developers around serverless, k-native et cetera and event driven architectures, and so that's a little bit about me and my co-host today is josh wood who you've seen in prior shows, and do you want to introduce yourself real, quick.

B

Sure langdon thanks, I'm josh wood, I'm currently a developer advocate at red hat. My focus is on openshift and particularly interesting for today's guests and the project that we'll be talking with them about uh kubernetes operators and how they work at the core of openshift to deliver auto updates and management of foundation software on the platform and where we see a lot of people, building cool new stuff sort of in the space and on the scaffolding that the operator pattern or concept and the toolkits around it represent.

A

Nice, um it's always nice when you can have a tool that does its own like installation and kind of updating right, because then you don't have to figure out how to get it right. uh You know this. We've we've celebrated this in the linux world right with package managers for a long long time- uh and uh you know I don't it's so painful- to use uh something like windows where you have to go manually, update things all the time, um although it's getting much much better.

A

uh So today we have on the show um lalith suraj did I say your name correctly. I hope.

C

A

All right so uh and uh if you wanted to quickly introduce yourself, uh I meant to check before the show, um but uh you know feel free to say it again correctly and I will try to.

C

Sure yeah so uh hi everyone, I'm vladit, suresh, I'm a senior researcher at vmware, where I primarily work on cluster management. Basically looking at new ways to program cluster managers and the focus of this talk would be some of the work I've been doing on uh testing cluster managers, so yeah.

A

So the infohasis was on the wrong. Slobble is basically the takeaway uh and uh michael gash. uh Do you want to introduce yourself.

D

Sure, uh hey everyone thanks for for having me on the show, I'm michael garch, I'm based out of germany. I hope my audio is good. Now we had some technical issues, as you said earlier, and as of today, I'm still with vmware, but uh on september 1st I'll be rolling into a new adventure as part of my career um I'll, announce on twitter, I think maybe tomorrow or so so stay tuned and follow me on twitter. If you're curious, where I'm headed to.

A

Oh and the insider show: don't we don't get the secrets now.

D

And I forgot to mention that at vmware I worked on event different systems as well, canadian serverless and early on kubernetes, and I tend to build like prototypes and help our customers with large installations, including openshift as well.

A

Cool um yeah, and so, if you, if you want to follow both these folks on twitter, um you can find their twitter handles as well as ours, on the notes on the show on cube by example. But why don't we get right into it? uh And so the first thing we like to ask our guests uh is um you know, kind of what got you into open source?

A

um You know what what was the driver uh that made open source the thing you wanted to kind of work on, um whoever wants to go first, uh feel free um and we'll try, hopefully not to step on each other.

D

C

Yeah, I can start so um yeah. This goes back to my undergraduate days when um I was first introduced to things like linux and foss, and I used to attend a lot of these open source uh conferences back back home during that time and then for me, like uh sort of the sort of the first time that I started becoming a contributor was with the ns3 project.

C

It was uh it's basically, a network simulation um software heavily used in the networking research community and and then I joined the google summer of code program with them, and that was sort of my launching pad into open source and.

A

C

I would basically say I was trained in writing real world software right. So then I was a maintainer for a while, but then yeah, that's kind of where I got introduced and then after.

B

C

Once you get started, it's easy right like contributing to more projects, and it just rolls after that, but it's at first.

A

Once you're like pulled in it's hard to escape again yeah, so what brought you to kubernetes um kind of along that route?.

C

um Yeah so that started when I was uh when I joined vmware actually, um and so I just so, I work on disputed systems and networking and at the time I was looking into things like fault tolerance, and then I started looking into things like. How do you like there's.

A

A lot of reinvention.

C

Of the wheel in cluster management, right problems like software yeah or, like I don't know, say rolling upgrades all of these things. You know the concepts are very similar, except reinventing that wheel in every new system that we come across and in one particular project where I was looking at you know, how do you write things using declarative, programming and stuff? You were like hey. Let's we have this cool technique, we're sure it makes it easier to build certain things, it's the second and third times.

C

So why don't we try building a kubernetes scheduler on top of all the other things we've built right.

A

And that's how that was my first introduction to kubernetes.

C

And that's where I also you know uh was uh talking to michael a lot because he knew a lot of the kubernetes internals that at the time I was sort of running into like. Why is this? Why is the kubernetes api behaving this way? Michael help me out here. That's.

D

C

How we got started into this sort of uh rabbit hole that led to steve eventually.

A

That's cool, that's cool um yeah, I mean, as you say, I I one of the things I think was hilarious. I saw an oscon talk many years ago.

A

That was uh basically everything's already been invented by 1979 in software. You know and then we're just kind of redoing it better and cleaner and faster. uh You know, since then you know in a lot of ways right. The cloud is mainframe on steroids right, so um you know it just keeps going and going.

B

So, michael, oh, go ahead, josh, just as an aside langdon, when you bring that up, I'm wondering what it was that was invented between 1969 and 79 that I'm not thinking of, because I would actually.

A

Push the d I couldn't be misremembered.

B

A

All right, but.

B

uh Yeah michael tell us, uh instead of instead of me diverting us down other paths into history. Tell us how you got your start in open source and and then especially, you could do a beautiful job like lalit did of dovetailing it right into the kubernetes experience. Sure.

D

Yeah exactly because I also had like two entry points into open source. The first one was 2002, where I was part of a training program at the research institute and there they were all doing linux stuff, and I was tasked to migrate. The windows, nte domain system and file services towards samba, uh some of the file services and by the time samba was moving from version two to version three, and it was an alpha one. So uh I was like there was a lot of bugs and and issues.

D

I was starting to use it and give feedback to the great samba community, and this was kind of my first um open source. uh Adventure uh include like some dyno, samba and open elder and then um like.

D

I built a lot of systems on linux, a lot of um file services, parallel file services systems, and then years many years later, in 2015 I joined vmware and I was going through an onboarding phase at the ember doing all that the hiring stuff- and I saw a talk from somewhere from some some of our colleagues talking about messers and kubernetes, and I got somehow intrigued, even though my role did not require that knowledge. By the time.

D

I was just fascinated by the technology and all the stuff that they're working on, especially like building a distributed operating system or kernel, and so that got me into into kubernetes and then I focus primarily because kubernetes was growing a lot of primarily focus on the resource management aspect like scheduling uh the community scheduler and how all these resource semantics play into it.

D

I I was lucky to give a talk at kubecon as well about resource management and then a bit later, I met lalith and it was always so fascinating because he also was working on a lot of stuff that earlier that I started to touch on and um but by the time he was working on it.

D

It was often like proprietary, like all special frameworks, for like large-scale institutions, and so we met- and I was like okay- maybe we can use kubernetes or we can use istio or some other open source stuff to use your idea and just make it quite more widely accessible and building on standard and open technology, and that kind of built, a great relationship between us and I always enjoyed it because he's very, very knowledgeable, especially about distributed systems.

D

Just google them up and you will find a lot of amazing papers quoted papers, and um so we got into sif like I'm, not jumping the gun here. Just but just this funny story how we uh got into the c4, because if you look at what steve does it's very you know very research.

D

He has a lot of knowledge and background which goes into sieve, to make it do whatever like we intend to do, but at the same time I was working on an fcd article, because I was fascinated with that: cd as well and lcd is very important for kubernetes right.

D

So I was working on this on this blog post about kubernetes and lcd, and you know how the informals work and the controllers and all these patterns- and um I was writing this article and lalis called me and said you know, michael, we have a question on etcd. Maybe you can help us like. I was like this is great because I was just literally typing the answer to your question. In you know the unreleased blog post back back in the I think it was two years ago, maybe yeah, that.

C

One, and so that was year, long call where we were just diving through the code and just confirming what we.

D

This was a very very funny coincidence, and so this got me also into collaborating and working a little bit on the safe work, but the main you know the brain is lolith and on the work definitely.

A

That's awesome um so uh one of the questions you know so I kind of put back to you too, is uh so so why uh is kind of doing work in the open source world? um You know better or more interesting or whatever than uh kind of the alternate uh places that you might have done. Development work.

C

A

C

Coming from sort of uh like an act like the academic half of me right, I I think it's one of the best ways to learn how real world software works. It's right out there right, like you, don't have to um I. I really credit a lot of my education in computer science to open source right. I never had to wonder what these concepts in the books meant when I could just look at real world software and see how it's actually implemented right, like whether it's the jvm or python or whatever, like hadoop.

C

All of these things are right out there right. So uh there's always that aspect of open source that appeals to me right. There's all these widely used software that we can actually see how exactly how it's coded right. You don't have to guess uh the other aspect of it is just it's more. It's just easy to get people on board and collaborate.

A

Yeah totally um yeah, it's interesting. My my my experience in university, though, is that they they a lot of students, are unfamiliar with open source um and I find it kind of mind-blowing. You know because, like when I was a college student, I was always looking for things that were cheap or free right um and uh you know if nothing else right you get that advantage with open source and so uh one of the things one of my kind of quote-unquote missions.

A

Since I joined academy you know academia is to try to make open source um more available and more prevalent amongst university students. uh So, let's move on and talk about uh steve. um You know this is one of those words uh that I learned from reading. uh So I pronounced it civ for a long time. um But uh can you tell us a little bit more about the project like? Where does it come from? You know: what does it do.

C

Yeah, so the uh the uh genesis of this project was another project, like I said where I was trying to build a kubernetes scheduler, and I was running into all these little quirks with you know, building just one piece of like one piece that runs in a larger distributed environment right. So every time I get some kind of stale input from the api server or there's a sort of some kind of jitter in the network or lag or something like this.

C

It turns out that these very small interactions with the environment would basically break my code right and at some point and every time I remember I would keep going to michael's like. Why is the kick? Why is this happening right? What am I supposed to do and then he'll tell me about some pattern I was supposed to use and I'm like, okay, fine and so after that project ended. I was like okay, this. If it's you know, I don't know if it takes two of us to write this right. I can't imagine this.

C

I can't imagine this is going to be. You know, bug free anytime. Anyone writes it right.

C

So, at the time it was also this thing where I would see all these papers on you know like trying to find automatically finding bugs in distributed systems was a hot topic, but I always had this frustration that, yes, you can build something for one system, but it's not broadly useful after that right, and so I um I then met uh the you know the uh the rest of the team at uh at a conference I was attending, so this is chenin zu he's a professor at uiuc, so I ran into him- and I pitched in this idea that, like hey look, I've been programming a bit in kubernetes and it seems like it's very hard to write, correct software.

C

There right write these correct controllers that you know no matter what kind of external input they get. They'll still do the right thing and not do anything goofy and and he's an expert in reliability and testing, and so on, and that's kind of how we you know, we clicked right there and we teamed up and his student uh would intern with us the next summer. That's judong um uh he's the lead phd student behind this project.

C

Pretty much the start of the show, I would say, uh and uh yeah that eventually led to c basically right click. It was a very organic process and, at the time like michael, was also looking into things like the list watch and all of those uh the details around that- and I don't know that's how the whole team got formed was very organic. I think yeah.

A

And michael, can you tell us how kind of you felt like you got involved like what what drew you to the project.

D

Yes, so I I because I already knew um lalith and you know every almost everything that he did or wrote about got famous in one or the other way, at least for me like fame. Is you know relative, but for me it was always very uh interesting, so I kind of followed his work and we um I.

D

I also spent a lot of time with him uh um and we were like just having lunch and sharing ideas, and so um that's why, whenever you know there were like kubernetes questions or debates around uh some of the the stuff that how kubernetes works, as lilith already said, uh we just hopped on a zoom call and we were chatting about it. And, coincidentally, there was this time where uh jodong and lilith were having questions about how hcd the some of the semantics that city provides to kubernetes.

D

um You know like for consistency, semantics and also how the informers then work in all these controllers and, as I was, writing those blog posts about lcd. I was very familiar with the ad city codebase and consistency, semantics and lalith. Had this one questions about the linear, realizability guarantees in scd and whether it provides like a logically and monotonically growing um order of events or changes in the system.

D

I was just literally writing that paragraph in the blog post- and so I I said yes, it's there, it's a number, it's just a counter which goes up uh the revision in that city, and he said this is great, because we need that, in order to you know, make some assumptions about the work that civ is doing and so and then I was asking okay. So what is this project all about? And we started working working on this um and I gave more input and some of the reviews in in the paper.

D

But again I was more like an advisor from the outside for the work and lalith and judong did the main heavy lifting.

C

The other bit is that, like for every everything we find michael knows at least 15 issues on the kubernetes github that we can actually reference. So that was the.

D

Other yeah, so he was always our bridge.

C

To the community, I would say right, like um he was more of the insider than we were at the time.

A

Well, I mean sometimes sometimes being on the outside uh helps.

B

Yeah interesting.

A

And right like it can be, it gives you a different perspective. You know whatever, um so that's pretty cool. So so let me just kind of ask a little bit more of a background question so, like I, I come across the project. um What do I want to use it for? What is it? What does it actually do for me, um as kind of a you know, a deployer or an application author.

C

Yeah, so this the main target audience is uh controller or operator developers. Right, um let's say you, you want to write, say a cassandra operator and you'd go through. I don't know you would use the operator sdk or you use one of the. You might build on things like controller runtime and you you're going to write some code that maybe it monitors some kind of custom resource in kubernetes, and then it takes some. It does.

A

C

Depending on changes that it observes to these resources being created, updated or deleted right, and now this code, that's running like whatever this reconcile loop, that's running somewhere in your cluster or maybe outside the cluster, is um just one entity in a large distributed system right and there are other independent actions happening and at the same time, you can have things like failures and network partitions, and all these other kinds of hiccups right and the tricky bit with writing this type of code is that it's very hard to test how your system will behave under it's very hard to know how it'll behave under any failure scenario- right, for example, like most code bases that we looked at, aren't testing for what happens if you crash, after every interaction with the api server, if the controller crashes after every interaction with the aps server but steve can basically do these type of tests automatically for you.

C

So you give us some tests. We currently have our own little way of writing these tests, but it could very well be the e2e framework or something like this right, but you give us some tests and what c will do is it will explore many executions where it introduces failures in these tests and if and what steve can do is that it will automatically be able to flag when uh an execution where it introduced a failure, actually produced a different result than the failure-free run.

C

This is one of the other uh innovations in the uh in c. Basically, um it can do that diffing and then it will tell you that hey look. I found some executions where I introduced a failure and it turns out that the cluster looks a bit different than it did before, or it went through some steps that it wasn't visible in the failure free run right.

A

So it can actually do this type of reliability.

C

Testing for you, yeah.

A

I gotcha so so uh in in a sense, uh it could have been a member of the like. The simeon army, um like netflix's chaos, monkey et cetera, kind of in the same vein right.

B

Yeah is actually a question I wanted to to kind of at least run through quickly as we get into to see. Even what it is is if I were telling someone else about this project at a high level, what what kind of testing does it do like? Would I refer to this as chaos, testing or um you know like? Can I put it into one of those pigeon holes that we kind of use as a shorthand to talk about like what surfaces, we're testing and what techniques we're using.

C

Yeah, so chaos testing tends to be um something that you would run in production live. Usually uh this is not how your you would typically run receive. It's meant to be a development time testing tool right, so you can think of c more as another layer of tests that you would run uh say in your ci platform or something like this right on a cluster of machines where this just runs. I don't know once a week or something like this right.

C

It's um it's more expensive than a typical unit test or integration test, because we are actually for each of those tests. We're running many tests right with, where we're introducing all kinds of failures, but one key distinction between chaos, testing and what sieve does is reproducibility.

C

If c finds a bug based on a fault it injected you will you basically have a what's called a test plan. It's a self-contained file that save runs where it knows that. Okay, it's waiting for exactly this event to occur in the api, for it to pause the controller and inject a particular type of failure like it knows. All of that is an encoding file with that. With just that file, you can reproduce exactly that execution over and over again right. So it's that's how it really differs from chaos.

C

Testing, it's is it running in production versus, is running on your laptop or some ci server. That's one distinction, and the other is the reproducibility aspect.

A

Right right so that you can kind of go back and and find the bug and fix the bug, um and then you know so in in a lot of ways. It sounds it's a it's a bit more like unit or integration testing, in the sense that you know it's, it's a well-prescribed set of tests um that you know, uh then you basically you know you know when you have introduced a new problem when you go to the next revision or whatever, um which you know, I think is.

A

Is uh software development component that is uh gets a lot of short shrift. um You know like we always like to write the first one you know, but having to maintain it over time is always much much harder and nowhere near as much fun um cool. So uh so have you seen um uh it's kind of the operators using it in the wild yet is it is it you know? Has it been around long enough? Have people been adopting um the usage.

C

We we've been getting like a trickle of uh reports from people who are not us trying to run the tool, so that's always promising, there's still more work that we need to do to make it easier to consume as well. Right like ergonomics is one place uh again, like we've been talking with michael about this on.

C

What's the right way to you know, make it easy for anyone say using the operator, sdk or controller runtime to just you know, come generate a project batteries attached to start testing with c right like so there's a bit of leg work there. So it's the least automated part using c, which is onboarding c1 to your project right.

C

So that's uh there's a bit of manual work that you need to do first, and I think once we are in that out, it will get easier for people to know, but yeah we are seeing the first trickle of reports coming in from people.

B

Do do you think there's potential for for building support for for receiver, for tools like it into things like the operator sdk or some of the other tool kits that have grown up around uh building operators and then second sidebar kind of to both of you and like, I hope, and it's something that'll be useful for the audience. Is it we have so far in this in this discussion, used two words interchangeably or even kind of neglected? The word controller and, like I'm, I'd like to kind of draw out like what's a controller?

B

What's an operator? How are they different when we talk about those two things so that we can move from there to understanding and making sure that that sieve is something we would use with one or both of those two permutations of the controller concept.

C

D

So um the good thing is that, whatever, like type of software you're building, whether it's a pure or plain controller, or it's a more advanced operator, sieve works in both places because at the end of the day, sieve instruments, kubernetes api and client, libraries and calls, and since both of these um controller patterns, if I want to call them this way, they all use. You know client go or controller runtime in some form, whether it's through cubel or the operator.

D

Sdk um cf would apply there as well, and the test plans obviously might be more complex in the operator world, because you know if, if you, if you have a system that does like backups snapshots, scaling platforms or like storage volumes, etc, obviously the test plans and that maybe even the time running these tests are would be longer than in a very simple controller, which just spins up parts right um but uh yeah. I would not throw any distinction there, where c would apply.

C

Finally, I think everything we've tested so far is an operator not uh yeah, so it's it definitely works on that spec. On that part more, I would say so far.

D

And maybe, as a concrete example, um josh the I was working a lot with the rabbitmq team over the last two years, mainly because you know canadian has this concept of a broker, and you know there are implementations of these brokers and rabbitmq happens to be one, and so we were uh heavily relying on the rabbitmq operator, and um so I don't know if it was lalith or me or maybe together we were like okay, maybe.

C

D

A good one to test and see like how good the the operator is written or how well it is written, and one of the bugs we found was an interesting one where, under some circumstances, the the operator, the rabbitmq operator, would um delete the wrong stateful set, um which you know would have been attached to a rabbitmq cluster. That someone deploys through a custom resource definition, and we found that back through sieve and the fix was an easy one.

D

It was just using a precondition during a delete which a lot of people- I've not seen using them, and you know a lot of controllers and operators, but, as you know, or might not know, when you create or update or delete objects in kubernetes, you have this conditions, field or options field that you can pass in these controllers, and not always, but often you can pass preconditions which, for example, could be delete that object, but only if it matches this id and we um this patch was added to the rabbitmq operator and the team you know was happy that this was fixed because it was a critical one.

D

If you delete the wrong statement set, obviously right.

C

There was another one with volume management again with the rabbitmq like that was one of the most fruitful engagements with the team we've had like they so c found a bug and then in response they've, they kind of redid how they cook about volume management, because there were some bugs around incorrectly managing volumes there as well, and then they actually added a corpus of tests to exercise things. The way sieve would have done it right like they. That was pretty uh cool to see as well, so yeah.

B

And so I wonder if, if there's some way like I'm searching or trying to come up with a question, that'll kind of draw out there's a lot of ways. I might test an operator that I built to to like even to directly adopting the example, we've kind of been working with so far with rabbitmq, like I'm, going to manage some volumes and the the bug that you were just bringing up lily, then what there's a lot of ways I might test like that code base? If that's what I'm writing yeah, what does sieve do.

B

That's different that that is the real key advantage here from like, say any other, like a different test framework that I might look at now in like in my in my understanding and a little bit I played with it I kind of understand, or at least have a clue about like I'm, I can. I have something that can kind of predictively look at a bunch of points along execution and then instrument those points with different things that might happen in the underlying underlying api.

B

I'm addressing so, I I think that's somewhere close to the heart of steve and I'm wondering if for me and for the audience, you could kind of close that circle and tell me exactly why is it like different from using a more general purpose to test cool toolkit or something that likes me like? I might have heard of in the past, or be familiar with, of course,.

C

Yeah sure so, let's I'll take a let's take this example of this any operator. It has a reconcile loop right and now, when this loop runs typically what you have is you'll have like. I know this can be. You know, scattered across many files, thousands of lines of code, whatever right- and there are many points in that reconcile loop, where it's actually interacting with say the kubernetes api or something external to the controller, and this all actions of that nature are asynchronous.

C

That is, you might create a stateful set, but then the pods corresponding to that state will set there's no guarantee that'll arrive by the time this blob of code finishes or something right. So everything related to how the controller interacts with the environment is, what c will perturb right and so see? Perturbs and execution like you, you give it a test. Workload to run c will run the test workload and then, when it it's running the version with the failures, it will perturb the execution in many ways.

C

So one of the patterns we introduce is, we call we we're basically checking if your code is making an assumption that your reconcile loop will always run to completion right, it will check for autonomous debugs. So in this case, what c will do is it will generate a bunch of plans where at any point that it can do it? It will basically inject a fault where it will crash the controller exactly after it executes one uh interaction with the kubernetes api, so every get or every delete or every update it will introduce a crash right.

C

So this is one thing that, like I, you won't find any off the box testing framework. That will do this for you and in fact I would say, like most test suites that you can find in open source controller operator projects will not do this type of crash testing right now crashing is crashing after every sort of uh client uh client go. Interaction is one thing, but we also do more right. For example, we'll also do this thing where we observe, during the fault free run, which um which messages are the control?

C

Is the controller reacting to right and then we have selectively skipped some of those, because controllers are supposed to be level triggered they're not guaranteed to be there's no guarantee they're, not supposed to assume that they'll receive every notification about the edges in your uh state.

C

Changes right, so we'll actually force out those hidden assumptions from your code by finding finding edges in your uh sequence and making sure that the controller doesn't see it right, we'll simulate all of those kinds of conditions for you, which is not something that any off-the-box testing toolkit will do, and it's certainly not something people do by hand either.

C

It just doesn't scale to anticipate all of those cases by hand, and we also like the hardest thing that we do is the steel state testing, where we simulate what happens when you get stale messages from the kubernetes api which is currently allowed, um and there are ways around it, uh not all perfect, but yeah, but yeah, because c will actually look at your execution and it will do the reasoning to this, rather than just sort of exhaustively try out every possibility in a very clueless way.

C

It will actually reason and say: okay, it looks like this particular message is only meaningful within this time frame and that's exactly the time frame where it will give you a stale message or something like that right. It'll actually do so. We do three patterns right now: autonomously, testing, um stale states and the sort of level triggering research triggering kind of states. So.

D

It's a bit um comparable to what jepson is doing in the database world like kingsbury and the team like they enforce, oh well, they can prove that there are bugs in the system, but they cannot prove that they find all of them. And um you know that's it's.

D

For me, sieve is a is a tool that you would run for conformance or you know in a qa kind of context, because it is, it requires more resources and time, obviously, than just running a unit test, as you already said, joshua, um but at the same time it's. um I would also describe it as like.

D

The co-pilot which looks across the shoulders like well you're violating rule number five of writing, kubernetes controllers, which is you're assuming order, uh or you know that you see all the events and all the stuff like say your code buddy a little bit and um because distribution systems are so hard and uh even understanding, kubernetes and all the semantics and how the libraries might change over different versions like nobody's an expert in there.

D

Maybe tim hawkins is right and all these small people, but in in general I would say that every line of code that you write in a controller operator is not bug free and so c helps to discover those yeah.

A

So, can I ask a little bit more of a background question which is um you know, you've put all this effort into sieve right, um which you know has a very na rel like narrow use case quote unquote. um Why is this investment in kind of supporting the development of operators such a good idea um like what is it about operators, kind of or controllers in general?

A

That you think is so important that you know I completely agree with you, michael, like testing distributed systems, is a non-trivial exercise like of any kind, um but obviously this particular space. You found particularly compelling that you wanted to go and support this. So what is it about that space that you think is so interesting.

C

um I think I don't know, I think kubernetes is a fantastic platform and that's why the excitement really comes for me from kubernetes the platform itself, but there's this nice thing that kubernetes enables and that most things running on a kubernetes cluster is probably third party right. It's not right, like most operators, controllers and control plane functions that come into any kubernetes. Cluster is third party and they all work pretty much off of the same core.

C

Libraries right like client go controller runtime whatever right, and that just means that, if you can, if you have that nice vantage point to do this type of testing and that's what we were able to find that there is such a clean vantage point from where you can do this type of automatic testing. That means all these operators and controllers can benefit from one testing tool right.

C

Usually this is quite hard to pull off in an arbitrary distributed system. They usually have very bespoke. You know internal interfaces, which are you can you can you can put in a lot of effort and make that testable, but then there's no reuse potential.

C

But let's see if we found, you know that this um clean separation of state and computation in a kubernetes cluster allows you to do this type of testing like you can create this testing tool once and a large part of the ecosystem can actually benefit from it and that's why.

B

Yeah, can I ask to to clarify a little bit um because, certainly in in reading material, about c before we before we join you folks today and having this discussion, I'm like etcd, which I'm really familiar with my background's at core os, um I, I wrote a lot of the initial first couple of versions at cd documentation uh more or less well, to the extent I understood fcd right, where you just referenced, to discovering a point where you could repeatably do this kind of testing?

B

Is that point in the kubernetes api, or is it based on underlying guarantees from fcd that filter up into that api like like, can can you help me understand?

B

Why is ncd as important as as it is to the background discussion of this? um If we're, if I'm then largely targeting the the higher level kubernetes api, when I'm doing the actual testing.

C

So from our vantage point.

B

C

From steve's point of view, or from the testing vantage point right, like the kubernetes core, the api servers and xcd is really just a database of objects right, so you run some workload. Let's say you run your you run some you bring up a cassandra cluster, you tear it down whatever, like you'll, basically see this at any point in time. If you look at the kubernetes api or this database right, you'll see some set of objects and every little control plane.

C

Action where you create, modify or delete an object, will give you a slightly different database right. So there's this history or this trace of database states that you can always observe, and the ordering of this thing is important, because it can't be the case that every controller is in a different order right. It has to be one order shared across the entire system and that's why cds guarantees are important there, but all we need to trace is basically at any given point in time when the controller did x.

C

What was the state of this database or what.

A

Set of objects.

C

Were in the api before or after each event, right, and so this point where the controllers are interacting with the kubernetes api- is also very simple: semantically right, you read, modify or delete or create objects right. It's you and it's literally, just a simple key value store from that point of view, which has very simple semantics right, so we can actually it makes it easy for us to reason about things like the lifetime of objects. I know if a delete went for a key x.

C

That means x is no longer going to be in the api after this point in time like we can make these type of assumptions quite easily, unlike with again like an arbitrary distribution like if you took hadoop and try to do this type of work, it's going to be very, very hard in comparison, because the semantics of their apis are very hard and opaque, and you can't even you don't even know what are these objects going out of these internal apis?

C

It's very hard to make that type of guess, but you don't need to there's really no.

A

Guessing involved.

C

When you're doing this inside kubernetes and that's what makes it this.

A

C

A

So I kind of was curious a little bit more about kind of the operation general. So I mean I guess what I'm also hearing you say right is that um you think a major portion of the ecosystem for kubernetes is and should be, operators whatever, and so the anything we can do to kind of enable their creation and and continued flow to exist. Right um is better. um You know for the for the kubernetes platform, because you know that's a that's a good delivery model for for these systems.

D

Maybe I would add to this is that if you look at how kubernetes has grown and been adopted and across the industry, like every cloud provider, almost every software company is using or offering kubernetes it has come somehow this defacto api that um is, is offered, and so for software, vendors or isvs. It's a nice. um You know nice platform to to use because you can just assume it's it's ubiquitous. It's always there right and um one of the interesting projects that I've also been following is the kcp project.

D

From from red hat- and I was I was having early discussions with stefan schumanski he's one of the lead engineers there as well and asking because kcp takes kubernetes. You know to the next level if you will like taking the core api principles and you know making it more than just called a a container orchestrator and there the same semantics apply. You know you use the same libraries or the same concepts. The same patterns are there, but you might do user management or workspace management.

D

You know across the globe if you will, but using the same kubernetes api principles, and that means that the complexity of these projects obviously is increasing and so will be the software quality and the bugs that you will find and that's why.

D

I believe that cf can can really help to change the game there in providing and writing better software on top of kubernetes, because something that we don't really do quite well in you know I.t and tech space, I would say in general, compared to other sectors like civil engineering or airspaces like we, we are probably still in baby steps and we're still crawling when it comes to the code and the quality of the software that we produce like.

C

D

Not every not everyone right, but but an average, but imagine I would just go out and start building a bridge and call it a day. That's probably not a good way to do stuff right, there's way more regulations way more. You know, descriptions and- and uh you know stuff that you have to put in place before you can actually start building the bridge and then in computer science.

D

um People like leslie lampert, they advocated for you, know, writing specifications first, something like a tla plus write the specifications in a non-code like language like use, mathematics to write the algorithm and then prove it and then write the actual code. But often we just start writing the code and then, if it works unit test passes. So it sounds like it's good right and then these these bugs just uncover some are more complex and subtle.

D

To um expose- and you know some just come up and and kill the workload and that's why I think seif at least helps to write and create better software, especially as kubernetes adoption is growing. Yeah.

A

So, uh interestingly enough, uh you know plugged our own show uh when our first episode when we interviewed clayton coleman, we talked about kcp and I'll, throw that link in the chat like I've been doing a bunch of others, um but I was kind of curious about. um Like you know, I think that kind of control, plane idea right is that you know kind of having a uniform api for all the things uh really does have you know some serious advantages.

A

um I was kind of curious about you know, and specifically, with uh you mentioned that you know, kind of sieve is a dev time tool. Is that what you foresee in the future? um You know is this something that that should be kind of running in production. You know, I mean one of the things that I've found.

A

So actually you know I've been working on distributed systems uh for a long long time, uh and uh you know even you know we- we actually a friend of mine and I uh built um a visual basic um remote tool over http when and and lied to it about it actually being dcom.

A

So this was a long time ago right um and we actually wrote a way to be able to watch for distributed system kind of communication challenges, as the you know, as the application was running, um so we also had a dev time tool, but we also kind of ran, wanted to run at production. It's the hardest part about distributed systems. Is you put it in production and all of a sudden you're like?

A

I have literally no idea what the path of of code that was actually traversed was, um and you know by design almost uh and so how? How can we help with that kind of tool, as we go forward? Is that is that a different you know is that sieve prime or is that somewhere sieve might want to go.

C

um So far, I think we're doubling down on the development side thing right, like I think uh as you, I think, you've uh hit the nail on the head there right. Like you, don't know what path you took to hit a bug in production, and I think our philosophy here at least has been don't even give a chance for a bug to happen in production.

C

If you can avoid it right, and so we are even looking at things like verification right now, because against eve as a testing tool, it's its effectiveness is only as good as the initial happy path tests that you provide and see. Will you know explore all its conditions based on those initial tests, but it's not going to guarantee the absence of bugs, for which you need things like verification.

C

On the you know, doing things live side, I've I'm involved in some projects where we are looking again, this principle of like if all of your state is in one globally ordered database, and you don't have any state off of it. You know say all state outside of it is ephemeral you it might help with doing some things like debugging um there's this project, I'm involved in it's called like dbos, it's with collaborators from stanford and mit, and here we are looking into some of these type of questions on.

C

How can you attach a debugger to a live system and find out exactly what what was the state as of a bug happening? And you can, if you can trace all of the information until then like you, can do things like record and replay debugging life, but it assumes a very different like a very restricted programming environment.

C

Unlike what we're talking about with things like controllers and operators and more broadly right, you have to restrict the programming environment in order to be able to do that cheaply. Otherwise, all these stack traces and whatnot like right. It gets much harder, so yeah.

A

No, I totally understand um yeah, I mean so you know, but I guess kind of what we should be looking for is kind of the um you know breath you know increased breath of steve's coverage around uh kind of dealing with these tool chains um with you know the hope that there will be fewer bugs in production. uh I don't know- maybe maybe I'm less uh optimistic than than you are that uh you know.

A

Certainly I I have never had a production system that didn't have the occasional bug, um so you know it's always there, but we can get closer right every every time.

C

Yeah there's no such thing as bug-free software, even with verification. If you ask me, but uh you can so there's, there's going to be this quality curve right, you can, you can push at it and you can change the you can play with the asymptotics a little bit but yeah uh you're not going to.

B

C

A

Yeah, I I still remember in college I had a professor who told me that, uh as I as I became more experienced, I would stop ever having syntax errors and I was like- and you know x, number of years later, like 25-ish, I'm still waiting for that day. You know, for example, so you know I yeah bugs bugs are a way of life and that it's more about being able to identify kind of why they happened, how they keep you know, keeping them from happening again.

A

You know the whole like blame-free culture and all that stuff, I think, is really really important to good software.

C

I think the important bit here is also: what is the cost of a bug right right, yeah like it's, um I think tools like sieve, I think, are valuable, because the cost of a bug in this space is quite large, like we you're finding bugs where volumes get accidentally deleted and things like this right, you're losing data.

C

There are also security holes that show up where there were. There were some operators where we found that, like a crash at the wrong time, would cost them to silently not configure dls things like this right like so. There is quite a price for these type of bugs and therefore it's okay to spend a bit more. You know resources and development time trying to find them for other kinds of bugs where you don't have this much of a price. Perhaps it's not yeah.

C

These tools might be overkill, but I think this is the right space for this type of work.

B

Right, I really like that that angle of analysis there well it's because, to the to the extent that we're using operators to manage the platform itself and foundation software that are pieces of every application we build above that surface like it may not meet any of the technical definitions, but it is in fact tantamount to operating system functions and has the expense and the the the damage potential of flaws bugs and security vulnerabilities in an operating system surface.

B

So, if I'm somebody who wants to push on that quality curve and I'm really interested in sieve and I'm so, this is a two-part question one if I want to start using it tomorrow to make my operators better. What's the first thing, I should do two if I want to start working on sieve tomorrow, to make everyone's operators and controllers better. What is the first thing I should do.

C

um So the first part is: what do you? What should you do if you want to use it? You know for your operator uh today. uh If you go to the sieve project, you'll basically find the sporting guide on how to you know brings even to your project, and it also will answer your second question, because you'll see a lot of the work uh that you're doing by hand that could be. You know, defaults that you get from, let's say using an sdk directly right.

C

So here, for example, you'll see that you're going to specify a bunch of version info like what kind of what go version. Are you using what kubernetes version are you using? What uh commit id? Are you using for your project to test and so on? A lot of those defaults, I think, can come straight out of like if you integrated this with say the operator sdk, and there were some batteries attached way where you just say, you know, generate this project to be save enabled.

C

A lot of those defaults could just come off of the project itself right and then, for example, steve needs. Some information like oh, where is the docker file for this project or where like it, also needs some information on the cr itself so that it can put in some annotations. That c will use during testing to find the pod corresponding to the controller. Like a lot of these things can just really be uh it's mechanical work that I think we can automate away.

C

If we actually uh wrote this as a plugin for say I don't know, controller runtime or operator sdk, or something like this, so.

A

C

You, if you go through this process once especially this crowd you'll, see exactly why there's a there's, a strong need for us to integrate this with one of these sdks a lot of those defaults and things that you're hard coding into a manifest file that c will use, can be yeah can just be assumed easily right, yeah and if you want to help us uh with the save project again like yeah, the this is one of the main things that we need, like.

C

I think uh shidong is working on a wish list that will be up soon, but I think ergonomics around making it easy to consume it in any project. Is the big the big thing for us right now.

A

And and when that wish list is done, where do you expect it to be? Is it like a set of github issues or like its own dock or.

C

uh It'll, be both so okay, there'll be a pinned issue and either it'll be self-contained there or there will be a dock.

A

For it as well, okay, I linked the project and the porting guide in the chat, so um so people should be able to find it. So that's cool um josh. Did that answer your question. Did you have another question you wanted to ask about it.

B

No, I think those are those exactly kind of the the two things I'm I'm. I was looking for and especially delighted by the observation that, in a crowd full of people looking for sre surfaces to automate as soon as you encounter the first order, problems of manual work you'll go looking to how to how to make them not manual. The second time you go through.

A

Yeah yeah, the the key to being a good programmer, is being lazy right. You know the lazier, the better, uh so uh actually we're just about out of time. So I wanted to maybe take this opportunity to say you know: were there final thoughts or final questions or anything that you wanted to bring up.

A

To either this or.

B

What's what's the next big exciting feature that we would expect to see.

A

B

And steve, if I wanted to go, tell if I wanted to give a five minute lightning talk tomorrow, I think I could I could get through the first three minutes. Here's what it is! Here's why you should care.

B

What's my hook to walk out with here's the next great thing about it, would it be integrations with build tool kits and sdks like who builder and operator sdk, and some of those things we've talked about is any of that work active like? Is there some of that going on um that we could take a look at now or.

C

um I think the one big thing that we're working on now is even not needing like not needing the assumption that, for example, right now, we assume controller on time. I think we're working to remove that assumption and work with just base client go and once we do that it gets easier to sort of go upwards as needed for the different other frameworks that show up right. So this is one big thing we're working on now, I would say going forward if the community would like to help us out.

C

I think integration with all of these frameworks, as you mentioned, is the big thing that we need to address, but this is something we would really appreciate some help on, and this is basically a big call to call to arms. I would say I'd like to come help us with this project.

C

um The long-term thing to get excited about is that we're looking at sieve as not just something to test controllers or operators, but also like how what part of this can also extend all the way to the applications that are being managed by these uh operators right? So we don't have any visibility into. Let's say what the cassandra cluster itself is doing today, right that the operator might manage.

C

Is there something that we can do to also make it easier to do this type of testing one level past the operator is something that we're quite excited about. So that's something to keep keep an eye out for yeah, so expanding.

D

C

Scope of this type of test.

D

And maybe something for the community would be because c found a lot of bugs and some of them, or most of them were already closed or there's remediation.

D

I think at least maybe presenting at kubecon about you know the bugs discovered and how to remediate would be a good talk because there's a lot there's not.

A

D

Kubernetes controller related talks lately, but the ones that um you know were presented. They were highly attended um by a lot of folks out there and does not, even though there's two good books that I know of um josh, you wrote one and stephan giumanski and michael housenblast were the other one, um but I think um more on this, I I called it the the mental, um the like the mode that you had to have in your head or the mindset. The mindset of writing these controllers there's a lot of things.

D

You know, like maybe the five or ten golden rules that you have to keep in mind um and relating it to like best practices and then writing these operators and stuff, like that, I think, sharing the knowledge. The learnings from the safe projects was also going to be very helpful and then obviously driving adoption of steve, because people are getting aware received.

A

Cool yeah, um those are those are those are great things to look forward to um and I'll I'll take that as a little bit of a segue um in that um I'll actually be doing uh an interview or a panel discussion with uh ford um at kubecon uh in north america in detroit in october, and we have a bunch of other crazy ideas planned uh so uh as uh kb insider.

A

uh So you should definitely join us um as well as we'll be doing a um kind of a meet and greet uh event at the openshift commons. uh A lot you know a reception uh the day before um so definitely join us there um and uh watch for the video output of uh and I'll give you, the teaser of you know maybe riding around in cars, interviewing people um with motor city and then next time on the show uh at the end of september.

A

So the last tuesday of september we're going to be interviewing ikus about uh k-native and chain guard, um and so talking about this.

D

Villain is gonna, be there oh yeah, that's gonna, be great. Next yeah say: hey.

A

Yeah- uh and uh you know obviously I'm I'm a little biased towards it anyway, uh because I think serverless and event-driven uh architectures are really the way all software development should be done. um Yes, I'm a little biased uh so, but uh thank you so much for being on the show. We really appreciate it uh and uh yeah. We look forward to seeing you again um thanks. Everybody.

D

B

You yeah. Thank you thanks very much for joining us folks.