Argo Workflows and Events Community, 20 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Argo Workflows and Events Community Meeting 20 Oct 2021

Description

02:50 Hera Workflows Python SDK demo
32:30 How LitmusChaos uses Argo Workflows

A

Okay good morning, uh thank you. Everybody for coming along today, um just a quick overview of what we're going to be doing uh today, we're going to have a demo of a new python sdk for our go workflows from flavio varden shortly. I think that's pretty interesting.

A

We know we've got a lot of python users and then we're going to have a presentation and demo um uh on litmus k here, os from show me and karthik and then an opportunity to kind of ask any questions or discuss any other topics that you would like to discuss today.

A

um If you can add yourself to the attendees list, that would be um awesome. That would be really good um for people who are new to this meeting. We tend to talk about argo, workflows and algorithm events. We tend to have people come along and demonstrate those those two products. I won't. I won't tell you any more about those because I'm sure you're familiar what they are. um Who are we who runs these meetings? um I'm alex. I am a principal software engineer on argo, workflows and argo events.

A

I used to work on arco cd as well um during this. If you want to ask any questions, feel free to ask. Typically, we have like a q a at the end of any kind of demo or presentation, that's a great opportunity to ask them, and if you want to come back and ask any questions afterwards, something that kind of slipped your mind at the time, then obviously you can come and ask those same questions on slack, that's fine!

A

We are recording this. So if you want to share it with some colleagues, the recording typically goes up on youtube um the next day, sometimes two days, and then you can um share that out.

A

um Just a quick announcement, um so we've just started um three weeks ago running a kind of a weekly meeting, um a tuesday at the same time, 10 a.m, pacific standard time, which is kind of an opportunity to kind of learn what people are working on right now and learn a bit about kind of what milestones and roadmaps were doing that's kind of aimed at code contributors.

A

So anybody working on actual co-contributions at the moment, uh if you're, if you're doing that- and we also talk about things like roadmap and what the future will be for our go workflows, events and also for data flow as well. Okay, so um uh flavio, are you ready?

A

I am cool, I will let you take over and I'll stop sharing my screen.

B

C

Okay, you can see it.

C

Okay, there we go um hi everyone, my name is flavio, I'm a software engineer at dino therapeutics and today I'm going to be talking to you about um what I like to call the missing article workflows, python sdk, it's called hera and it's the main intent is to make it easy for scientists in particular to adopt and use arc workflows by by providing them with a very easy to use. Interface um for constructing and submitting uh article workflows I'll talk a bit about dyno.

C

We are a biotech company in cambridge massachusetts and we focus on building viruses that are transfer mechanisms for delivering genetic therapies, and we have two main groups at the company. We have a laboratory and we have the computational improvement.

C

Our computational group is composed of scientists and engineers and our scientists uh work on numerous algorithms that are typically run in workflows for um running things like model training in in this realm of machine learning or computational biology, workflows that take data from the laboratory and they they apply specific processes to that data and we've used for organizing and scheduling and submitting these workflows.

C

We've used multiple platforms in the past, the one that was predominantly used when I first joined was keyflow um and, of course, we used it for notebook hosting, but we also used the kubeflow dsl and for the majority of our scientists, it was very, very challenging to debug the the workflows that they were scheduling and it was mostly because of the way the cluster was set up and observability was pretty challenging uh for them to understand, and in addition, the syntax and the vocabulary that's adopted by the kfd sl is not it's not very, very welcoming uh for numerous, uh perhaps probably for the um academic community and, and that made it very challenging to set up things like parallelism because of the way kubeflow will extract the variables from the payload that it receives in order to for for scheduling a parallel workflow.

C

So once we moved away from kubeflow, we adopted argo, and then we used at first the our google workflows dsl um to schedule some someone to restructure and schedule some of the workflows that we wanted to submit, and it was great at the start to be. It allowed us to obtain almost instant value from argo workflows, but again we faced numerous challenges and it's not um the most of the engineering group, but our scientists who didn't feel empowered to build these workflows on their own for numerous reasons.

C

But I think primarily, it was because this dsl exposes a lot of the um objects that that come from the argo workflow, the python client that uses the you know the open api schemas. um If I correct- and you also have to write this very specific syntax- to obtain input parameters and whatnot and all of a sudden now, our scientists need to know a lot of elements about argo in order to construct and submit a workflow.

C

In addition, it makes it challenging to easily ski skip steps, uh which is often something that our clients want to do during their experiments.

C

And, lastly, it makes it very challenging to request specialized resources such as gpus, because you need to essentially write your own internal library that wraps the workflow object that comes from this sdk in order to access specific fields on it, set up. Node selectors set up the whatever nvidia card you want um and things like that.

C

The second sdk we tried was cooler uh and it solved some of the dsl problems we had before. It is certainly more pythonic, but inputs are still a problem. I've in the past submitted um an issue mentioning this because we're using environment variables um for the container, but again you still need that very specific syntax and a lot of the workflow setup is quite confusing for a lot of our contributors.

C

In addition, the last time we used it, you couldn't submit to a custom domain. We have an article server running uh deployed to kubernetes and we want it sits behind an iap. So there's a specific domain. We need to hit to reach the server and for running cooler.

C

You specifically need cube ctl to access the kubernetes deployment, and this is a big no-go for dyno, mainly because uh for practical reasons, and also from a security standpoint, we primarily don't want our scientists to have to worry about these engineering tools, such as cube ctl in order to port forward, and things like that which conceptually doesn't sound that challenging. But there's the value for for our company is is in the scientific pursuit, not uh in in the use of cube ctl for for submitting workflows.

C

So we thought of building our own internal sdk uh with the with simplicity um at its core, and we had specific requirements for that. We wanted to easily understand dependencies.

C

uh We wanted an easy way to set up parallelism um over whatever jobs we want to um submit. We wanted to submit it easily.

C

We wanted to maintain some high level arc of vocabulary to maintain consistency with the argo ui, for example, because there's a lot of value in in using the ui for our scientists for debugging purposes. uh We wanted to support by dentex schemas, because they're easily json sterilizable, and it fits nicely with the built-in python json package.

C

We wanted korg's and we wanted to symbol to support any primitive type like strings integers and things like that because again they're easily serializable and lastly, we want a dynamic fan out because for a lot of our computational biology workflows, it so happens that we don't know ahead of time how many biological samples we're going to process, and things like that.

C

So we wanted to fetch them from somewhere and then in a single step. And then we wanted multiple parallel steps to follow up to process each of those uh samples independently.

C

With that I'll switch, my screen and offer of some demos and tasks, a task and a workflow are primary citizens of uh of pera, and we have this concept of a workflow service and the workflow service is the is a wrapper around the workflow service api that simply just sets um the right configuration for you to submit your workload to your own domain under the assumption that you're able to provide a bearer token to pass authentication for workflow submission, then you inject your service into a workflow that has that takes a currently a name.

C

This workflow service for submission and a parallelism element- that's currently set to 50., and then you build your task.

C

And all we're adding here is a function. That's that's callable, and this you can. This is a very simple toy example, of course, but you can imagine our scientists building something a bit more complex as this. You know unit as this atomic unit of execution that were submitted to argo for execution. So they get now. They now get our clients or our scientists, now get have the opportunity to exclusively focus on the implementation of the scientific code, with a very simple interface for uh workflow submission.

C

So again they build the task, they add their tasks to the workflow and they submit and I'm going to spare you from having you know, looking at the ui initialization things like that, and I just took a very simple screenshot for this example: the next one I wanted to show you is the classic diamond um that is offered in in in the sdk examples um and again you set up your workflow service, you make a workflow and then you get to focus on your tasks, so you'll notice that in this case now we have a parameter that we pass in and the way you perform.

C

This mapping is, you say you pass in this dictionary payload that says map message to this thing and then it's just gonna get it and it'll execute and it'll have that parameter for you and it'll execute this the print statement. The way you uh set dependencies is, you say, task dot next, this other task.

C

uh So in our case we have a a xp, a next c b next d c next d, and that's that's going to give us the diamond, and you can also use the right shift operator and I'll have an example of that as well. Again, you add your tasks, submit your workflow and you're done, and I have that schedule. A screenshot of that here for the dynamic final part, um we've introduced this input from a concept that takes a task name.

C

So in our case we have a generate and a consume task, so it it depends on the generate task and it takes the value parameter. That's extracted from the json payload that comes from the generate task and we've we resorted to a simple json, dom to says standard out for now, perhaps in the past, we're going to have something that looks a bit more friendly, uh but this is sufficiently easy to understand and there's we've added a lot of documentation around this.

C

So again, generate.next consume azure tasks submit and then it looks like that essentially, and we've also added three tries that um has a duration and a max duration and that's just a simple function that says, you know, generate a number 001 if it's less than 0.5 raise an exception. Otherwise, it's success, and in this case it tried twice when I tried it and as the last example that's a bit more complex, I have a task task, a that takes multiple messages now, and this is how you set parallelism.

C

You say, process this input and then process this out this input and this other input and then the task will just execute in a parallel fashion. It'll make a task group for you and it'll execute all of those. You know the function that you submit using those specific inputs and I made some linear tasks as well. Just for the purposes of illustration, and again you make your dependency chain.

C

You say a b b1, whatever all the way to d, you add all of your tasks, you could have put them in a list and then start the list of course, but um this is just a simple and then you submit your workflow, and you end up with something like this and it's overall quite easy to understand conceptually what a task does, um especially when you're, using things like default parameters and whatnot. So currently what it does. uh It has a lot of um options that come with.

C

You know defaults, so it uses the python 3.7 image, for instance. So if you don't have internal dependencies, you can just use this one, but if you have a dyno, for instance, we have a lot of custom code, so we inject our own images um into this this image field, and then we execute whatever code. We want to submit to argo using that specific image.

C

It has a command field um for supporting things like if you want to run the argo script in in parallel using some product like horobot, for instance, we use um horovod for distributed gpu training and we use a horrible drawn command to launch the argos script. um Whenever the the pod starts up, you can pass in environment variables um and there's this concept of resources that you can say.

C

I want whatever min cpu max cpu gpus of volume, existing volumes and empty directory volumes, and the reason we introduced pydentic is because you can build um schema validations, which are quite nice, especially if you apply any business logic into your workflow submission process.

C

It also has tolerations and node selectors and the reason we resorted to using volts electricity is because of differences between gke eks and azure ake, I think, um to to to provide the opportunity to use uh hera in the context of you know. It was really, irrespective of the platform you, the cloud provider that you have, that runs your kubernetes cluster, so that you can do things like request specific gpus using very particular node selectors that you're using.

C

um However, I also wrote a very simple example that allows you that that showcases, how you can write your own wrapper to even to simplify a submission. Even more so you don't really need the service, because, like your company will probably have a single way of generating the token, your company will probably have a single domain for your oracle server. So you can imagine building your own. uh You know my workflow service, that's a wrapper around your workflow service with a consistent domain and a simple way to generate your token.

C

Just have a mock here and, um however, whatever parallelism you want to put it put in. Similarly, you can have very specific tasks for specific domain at your company. That use maybe very specific images that that are only allowed to take in specific functions or have default, retries or and whatnot in our, in our case, where we use um the google kubernetes engine.

C

So we know that there's this label on the nodes that have nvidia k80s on them, for example- and we just have this default as have this as the default node selector for accessing um k, d, gpus or if you have specific commands environment, specific environment variables that you want to inject or volumes or something like that. You can write your own and that you know it just simplifies your workflow submission interface even more. It saves a single line, but at least you're saving your colleagues from the need to have these.

C

You know you know, uh write these details themselves.

C

So um it's currently, we currently have a repository under argor labs named hera and I'm thinking about using hero workflows as the pi by name perhaps, and it I. I have a note here- to remove the contextual elements, but they have already been removed um and what I mean, what I mean by contextual elements is anything specific to dyno, because it is ultimately a an internal sdk and again very you know very open to feedback um around the execution model. That's scheduling and things like that.

C

Are there any questions.

D

Hi floria this is great and, and I'm glad you've tried all the existing solutions before diving into this. uh This new solution, uh some good questions. So when you mentioned kf dsl, do you mean could be flow pipelines, dsl.

C

Yeah yeah exactly.

D

Okay, and could you elaborate a little bit why and what your data science team needs from kubictr, uh since you mentioned, uh you still need qbctr in for to walk around with workflows submitted uh via cooling right since cooley already provides things like status checking, and I wonder what else your team needs.

C

um So, to submit workflows to cooler, I I believe you need access to the deployment.

D

C

And that's, that's why you need! So that's why you need cube ctl yeah, so you need your! You need your kubernetes configuration to to be set in such a way that cooler can use it for submitting the workflow and that's that's where the problem comes in, because we do not want our scientists to have to worry about these types of things about having the right content, the right name, space, the right! Oh, you know whatever for submitting their workflows.

D

Okay got it yeah, since from my experience they usually just submitted the workflows, that's it and then they look at the ui for logs and status and everything without having to use super qbctr at all. But you mentioned namespace, maybe that something can be uh pre-configured uh instead, yeah yeah.

C

So for observability I just wanted to clarify that they were definitely going to the ui. um It's just the submission part that was problematic for us. Okay,.

D

And could you go back to your demo python script?

D

I just want to take another look at the python function that you it as a step.

D

So how do you import a module inside the save function?

D

Are you like just using the script template directly, or do you also wrap around a little bit to make it more python friendly.

C

Are you referring to how this is.

D

Set so I'm just wondering how could you go to maybe another example where you have an import statement inside your function,.

D

An input statement, maybe time on the example.

C

Or this challenge.

D

Oh yeah, so are you so for in order to use the function called generate as a step uh you just use using.

A

D

Template from our workflows directory.

C

Exactly yeah, so all of this stuff is inside a script, template and whatever types of whatever types your parameters have it'll import, for example, json and it'll serialize your your value, and then you know import. So this import json is actually there's two jsons in this script. If you look at it because there's an import json, that's automatically added in order for the script, to add something like message, you know: input.parameter.x.

C

And then it'll json downloads, that actually so this is automatically handled by the task. So you don't have to worry about uh serializing and deserializing your parameters. You can just use them and if you use things like pycharm and whatnot it's it makes writing code much more pleasant.

D

Okay, got it thanks. This is very helpful and I'm happy to chat further with you. Maybe after the meeting since I've been working on the cooley project and was involved in the kubeflow pipelines as well.

D

Thanks for the demo.

C

Yeah absolutely.

C

um Jesse asked: could workflow service also work with k? It's api, um probably yeah. I think so. I don't see why. Why not- and I think that's I I take your question jesse. As can we use the deployment the same way? Cooler does, um and yes absolutely um your second question jesse about how does the code become what exit what's executed inside the container that that's similar to what terry just asked? And yes, it is via script templates.

C

um I uh mentioned that it looks the hero. Workflows repository looks empty right now. Yes, that is correct. I'm currently working with um bala and and alex to to gain access to that repository, and uh I do have a community an argo project like labs community issue, submitted to uh to to um gain access to that repository. Essentially.

C

E

Do you anticipate um you'll run into like the size limit of stuffing everything into the script um template, especially if you have like a lot of um functions or long functions, and then what do you think the mitigation for that.

C

Should be yeah, this is certainly a concern and I can share from personal experience that we have some scripts specifically for training, uh machine learning, models that are quite long um and if we schedule you know things that are in the thousands, then we we start encountering problems. The easiest way that we found around it right now is to submit multiple workflows. um It's quite easy to do that, because here represents a very accessible interface for programmatic submission and yeah. We just wrap tasks in different work.

C

We inject tasks in different workflows and we just execute them in a different one.

C

But I've also we've also been aiming to just refactor code as much as possible, and you know isolate components and make them put them into classes or something like that, so that the container provides those dependencies and we make the script as short as possible, because technically you can just have a single function, call in it and then all of your code is actually provided by the container.

C

So your script is no longer limited by that size and then, of course, when you have a single line, you can go and submit 10 000, 20, 000, um parallel jumps or more. I guess.

E

Yeah, that's a good point. I think if you're your functions are getting longer and longer it's a kind of a smell that you want to maybe factor that into a library instead, so um yeah.

C

Yeah, the biggest the the reason we like using it. This way is because it it helps us iterate, very, very quick, so the script template is added. Sorry, the the the function that you submit is added to the script template as is, and as long as your container provides the right dependencies. You don't have to rebuild your image in the case when you change your ml code, for example, in a library you have, um you have to rebuild your docker image because now you're going to use a one, that's out of date.

C

So if you want quick experiments, you know build your script, as you know, using whatever practices you want, but then, when you want to productionize it, that's maybe when you want to consider putting it into a library.

E

So just the technical question, so did you use like a python inspection to get the source code of the file that um to get this, and then you take that? And then you translate that to the I'm just curious: how that technically works.

C

Yeah, of course, of course, um so there's a part that looks through the um arguments that are that are uh set. These are just you want arguments, and these are structured from obtained from parameters. So yes, you're right, there's a lot of in spec on it and I'll show you the script itself, yeah there's a it. It starts, it looks for the start token, it whatever grabs every single line and then then dedance. It does another d that that I don't remember the reason why, but I can probably look into it.

E

C

And yeah that you're you're, you basically guessed it right: yeah, yeah! Okay, that's that's! Pretty cool.

A

I in in data flow, we do something quite similar to this, but I think the requirement is the function's got a specific name. It'd be nice to remove that requirement. um I I had a question. I guess this is probably for both flavio and terry. What about requirements.txt?

A

What if I've got some dependencies brought in by that? So I just need to have a pre-baked image to deal with that.

C

Yeah, you do need that image yeah and that's why I mentioned.

A

Sorry go ahead is: is that a difficult process for people, or is it typically? They just have a big image with all the dependencies that they'll ever want, um and they just use that.

C

It's typically the second one that is the preferred way of going about it because we built it once and then it gets cached on the node and it's very quick to use that afterwards. Yeah.

A

Everybody's using the same image, presumably okay, interesting exactly yeah.

D

In my experience, if you have a pretty much your devops team are, there's usually a tool or valuable that helps data science teams build their images directly.

D

So usually I don't think image building is a a pen point for them, but uh the more user-friendly, the better hackers.

A

Yeah I've been wondering if there's like an intermediate between pre-baked images and one that kind of would bring in your requirements. But I think that's probably not the case, because either you've got pre-baked image that you you're happy to use. Or it's been a you know, product relationalized by your ops team and it's not really a problem.

B

um I have another question regarding the secret because I'm trying to compare the product with airflow a lot of times, I'm doing that uh airflow is a big thing, the secret management basically injected. So how we handle here like do you uh creating the secrets on the fly and reference it, or do you just like parsing as a string as an environment, variable.

C

um So there are multiple ways we can go about this there's a base environment specification that will make an mvar for you and there's also a secret and specification that will take in as a secret name and a secret key from a very specific uh secret. That's currently available in a kubernetes cluster, and I believe there are different.

C

You know other kinds of secrets we could include like the ones that pull metadata from the pod definition, uh but we haven't found a a clear use case for that and it's you can think of her as the stripped-down version of argo that focuses on the things that most data science practitioners will need and in time we're probably going to be expanding it.

C

As these use cases use cases come up.

B

Oh okay, so basically you don't handle it, so the data scientists assume they need assuming the secret is there, so they can connect to the data store, whatever they needed in the. In this way, I kind of like trying to connect the dot is how data scientists in your team like doing the local development, then the transition to be like in this way. Oh, is this can run locally without any problems, so.

C

One of the big disadvantages of had being able to to write these functions independently of your of your task is that now you can write tests for this. You can do whatever you want with that function. It's just. I can literally import it from here and write tests for it. You can write your function, a different file. You can have a task. You can have a file just with tasks and one with workflows, and then you can share these tasks perhaps, and things like that so um yeah.

C

If you write something that runs this on local yeah, you can absolutely.

B

Test it, so I I want maybe just like I'm, I'm general, I'm don't know much about data scientists. That's why I'm asking some dummy questions is: uh what's the common development like process like, I know the data scientists. Normally they are more focused on algorithms, so they don't care about they normally I'm not seeing all of them. They don't care about the wrong time.

B

So do you see that the data scientist will hand over the jupiter notebook or something to you or to your team, but like then, you help to reorganize the their script and make it like, like production, ready like turn this in format? How do you see they actually feel comfortable to do them? Do this themselves right now,.

C

I see use cases all over the map, um the idea of heroes, to empower scientists to run these workflows themselves, and you know I have colleagues who build software in a notebook, then use cara to submit uh whatever software they created to argo, and you know they measure metrics performance, whatever they need and then based on the experiments they will adjust their code and ultimately uh productionize themselves into a workflow, okay. Okay, thank you. This is great.

A

Cool flavio, thank you very much for that. I think you've got to run off to a meeting, so any more questions obviously just jump into the slack channel and ask about this. I think this is kind of really interesting discussion to have um about it. Okay, uh are you guys ready.

F

Yes, um I hope.

F

Okay, I hope my screen is visible too.

F

So hi everyone- this is karthik and I'm joined by shomo. uh We're maintainers of the cncf sandbox project called litmus chaos, which is a cloud native chaos, engineering project and we've been using argo workflows um underneath the litmus platform, it's part of the litmus platform to run complex chaos scenarios.

F

We just want to talk about um how that came about and what we're planning to do what's uh going on right now in the community and what we're planning to do going ahead in the next uh few months near tom. um We'll just do a very quick background on what chaos engineering is. What litmus chaos is about and a bit of history on how and why we embraced argo, workflows and um how argo workflows are being used today in the platform and what changes have been made over the standard when larko workflows.

F

So we will be doing a very quick demonstration of our workflows being executed as part of litmus and we'll then pick some questions. um I'm sure all of you have heard of chaos. Engineering most of you might even be practicing it. It's basically popularized by netflix about a decade ago, and you can see some standard definitions here. It's about injecting faults, controlled faults and identifying weaknesses in your system through fault injection and the idea of chaos.

F

Engineering is to inject these faults in a random way and direct unpredictable way and there's also a lot of paradigm shift in the recent times around how chaos engineering is being practiced. uh Originally, it was always being done in production, but the practice nowadays is to do it a lot in deep broad environments, especially with the advent of kubernetes. A lot of people are re-architecting, their applications to microservices model and they're, not really ready to go and do chaos, engineering, their fraud, it's being done as part of staging environments or cicd pipelines, etc.

F

So this is a simpler definition of chaos, engineering, something that we can all connect to in the current times. Just like a vaccine- and we inject harm willfully inject failures, it could be node failures, maybe go fill your disks. Do packet drops cause cpu or memory exhaustion on your parts and nodes, etc. So this is basically something that you try to do as part of fault, injection and chaos.

F

Engineering is a lot about injecting fault with a hypothesis around how your system should behave when the fault is injected, so you might want to know how the deviation uh is same in steady state. So you have an idea of how the system should behave, what we call as a steady state and then you inject the fault, and then you see some deviation there. Maybe you expect some deviation to happen and that is under limit. Sometimes you don't expect any kind of deviation.

F

You expect the system to work as they are and you might have a lot of other factors that you would want to validate.

A

F

Experiment things like mean time to detection of a fault how quickly your observability systems are throwing up your issues or sometimes about recovery, mean time to recovery, etc. So these are all things you would like to validate as part of an experiment.

F

So chaos engineering, basically is like vaccine and one of the reasons why we got this cloud native tag attached to chaos. Engineering um case engineering has been there for a decade now and we added this sub category, especially because of the way the community was looking at chaos in the kubernetes world, and this pyramid here basically talks about um the way your services are deployed in a typical communities, environment. You have your platform services, you have the kubernetes control, plane community services, you're pulling a lot of tooling from the cnc landscape service, discovery, storage, observability.

F

Then you have your direct application, dependencies message, queues, databases and then your apps and it's possible that failures can happen anywhere and if you add to the fact that most of these components are getting constantly upgraded on your clusters, the the whole practice of devops is about releasing fast.

F

um So you basically have a lot of change that gets injected into your deployments in into.

D

F

Deployment environment and you would really want to check what's happening at your cluster. One of these components is failing. You might want to repeat those tests over a period of time regularly, so this is about why chaos, engineering is needed and how it is really more important in our native world and the other aspect.

F

The practical aspect to cloud native chaos. Engineering is a lot of people today that are dealing with communities are used to a certain way of describing their applications or carrying out the regular tasks. Everything is declarative, it is all in a yaml file. It is all stored in a git repository, its guitar is controlled. You have resources and resource controllers. That is the way you um basically go about doing things in day-to-day work day-to-day communities world. So we wanted to basically do chaos.

F

Engineering, the same way, chaos engineering when I say the same way when you're describing the chaos intent. This is the fault you want to do. This is how you want to do it and when you're trying to add some steady state validation intent, you wanted to be able to do it in a declarative fashion and do it in a kubernetes native way, keep it homogeneous.

F

Let the developers and salaries have the same experience with resilience, testing and chaos engineering than they have with other things that they do on the clusters. That's when the latest project was born, so, typically the chaos experimentation process has this flow. You identify steady state conditions. Reading for your services, you introduce a fault.

F

You check whether the slos continue to be met. If they are, then it's resilient to this fault. You go on to the next scenario. If not, you found a weakness. You go back and fix either the application business logic, or maybe you fix the deployment or something on the platform. You basically go adopt some better practices in terms of deployment or tuning your infrastructure to ensure that you are going to be more tolerant to this one and for these particular aspects uh in this flow, these blocks we tried, came up with some custom resources.

F

So there's a custom resource that describes the fault and there's a custom resource that basically applies this fault on some uh service running on your system. There's an operator that watches these crs and launches some runners to carry out the fault, business logic and the result of this experiment is stored in another cr. They have been kept in separate resources because there is a lot of scope, rich scope for improvement in terms of definition of these processes.

F

So that's about um chaos, engineering and cloud native chaos, engineering. Why we wanted to introduce the ritmas project and how we did it using some custom resources and operators, that's the core of the project, and that was what the project was about quite some time. These are some details about the resources and what they define probably skip that in the interest of time.

F

Now, when we started thinking about argo workflows is when we started getting feedback from people using litmus in the community for doing chaos engineering. So there are complex scenarios where you would probably want to do more than one fault, and you want to stitch it together in a specific way. Maybe multiple faults occurring in parallel things like your.

F

You already have a degraded system, and then you suddenly have a part that got evicted. Let's say all your nodes are running to capacity. How would you simulate that kind of a scenario right, so you would probably need to create a chained failure. You had a fault and then that injected something into your system and it carried on for some time.

F

You have another fault happening. On top of that, which probably has disastrous results like you might want to really check how your tolerant to those kind of scenarios so creating complex scenarios was one need, and then there was this need to validate application. Behavior during the automated experiment runs. Sometimes you are doing the fault injection and you are also peering into your observability systems. You looking at the dashboards and you know exactly what's happening, but a lot of times. The chaos runs as a background service.

F

You just let the system go and inject failures at random times and random components, of course, within a controlled blast, radius that you've predefined, but still in a randomized way, and you want to validate. What's happened during the fault injection you want to factor in the steady state, validation as part of the experiment run so being able to do that so sometimes steady, state validation is not a simple step.

F

It could be a set of tasks that you do along with the fault itself, so that necessitates some kind of a workflow logic to be available and many times fault injections on your pre-product systems need to be done with some kind of load generation, some kind of stressful scenarios. You want to simulate production traffic, you want to run low cost, widget or k6 io or any such load generation, just as your fault is happening.

F

So that also probably needs us to run separate tasks, and there was another requirement where sometimes you need some pre-configuration before you run an experiment of course clean up after the experiment. So it's these are all better structured spot of workflow, and always you would like observability. You want to visualize these steps running in your system, so these things sort of came in as the requirements and necessitated creation of workflows and instead of building it out. We didn't reinvent the wheel, we wanted to look at existing options and which was argo.

F

Some of the reasons here, it's a kubernetes native solution, which is also the philosophy of litmus, and we had several executed types. um We settled on kts api because we had people running these workflows on different clusters having different runtimes and the ability to define source artifacts output. Artifacts. You get lot of reports, the experiments that you run, ability to visualize and a lot other technical reasons, and the other important factor was.

F

It is extremely well documented- has a great community as a cnc project, and there were initial adopters of witness who were already using the crs as part of an argo workflow. So it happened to be an organic extension. We we decided to bring it into the project and make it a formal part of the religious project and, as we started doing, that there were some parallel requirements that were emerging, such as the need for a dashboard or a control center.

F

That can manage chaos across a fleet of clusters, so we needed the workflows to be created and managed as part of this control center and some of the practical uh challenges that we had or um some of the additional things that we needed to do while adopting our workflows was the ease of generating the workflows themselves.

F

Sometimes experiments can look pretty long like this. You have a chaos engine with all steady state steps and details on the services that you are trying to target, and there are different steps that you would do in terms of dependencies installing some templates and running the tests. Cleaning up things like this, so for users, just just as fly view was explaining in the previous talk, wanted to make it a little simple. So how do you simplify uh the creation of these workflows by stitching together several faults?

F

Each of these faults need some tuning to be done, so we had to come up with a proper ux and a proper series of steps to be able to do that and make it simple for users and typically with argo workflows. You have containers that run a task and you you get to see the logs of the task that has been executed, but in case of litmus the chaos engine creation.

F

The cr creation is one of the tasks that would spawn a set of other paths which actually carried out the old business logic, something that we have referred to as secondary or generated paths, and we would like to see the logs of those as well. So that's something we needed to achieve and because of workflow is now like a scenario. A proper scenario, test scenario, test cases and test scenarios are maintained in some kind of test management system.

F

uh Mostly today, people are using github as well, so you need to be able to pull the workflow from source, uh ensure that things that are changed in your source are also reflected on your control center of the care center. So some kind of sync, with the gate repositories for the workflows needed to be maintained and the workflow status is actually the scenario status, and that needs to be factoring in the verdict of the experiment that we perform.

F

So writing the right images that carry out the experiments and ensuring that you have the right failure or past status of the individual nodes in the workflow is important.

F

And sometimes when you switch look at the several parts, you want to provide some criticality or weights associated with experiment depending upon how mature your inferior applications are to the workflows might want to execute the experiment. All the same probably gave it a lower priority than some other experiment.

F

So how do you define this weights and those are then used as part of some resilience score calculation in the platform, uh so we added some extra labels and annotations into the workflow, and uh basically we were going to track metrics of experiments and we wanted to know the parent workflow that actually ran the experiment. So there was some level of lineage that was added into the workforce to track the metrics, and so this was the instrumentation that we added to standard x, argo workflows and that resulted in what we call now.

F

Endlessness is the chaos workflow and the litmus chaos. Workflow is essentially an arc workflow, with some images being used as part of the steps to influence the result of the workflow based on experiment, status and experiment result and they are being stored in gate repositories and picked or sourced during the runs. So this is a summarization.

F

This is an architectural overview of the litmus platform and probably not spent too much time on it. This is basically the control center which helps you to construct workflows, and um you can run it in an execution environment. We call it as execution. Here's execution plane which happens to be the same cluster where you have the control center installed or it could be a different cluster or different name space, and there is a subscriber here which takes instructions from the control center to apply a chaos workflow.

F

Then you have the workflow controller that carries out individual steps in the workflow, such as setting up the dependencies, creating the chaos, resources and cleaning up, etc, and the litmus operator picks up the chaos resource and carries out the business logic. So this is the structure that we are using this, how argo workflows are being used um in litmus I'll. Probably stop at this point and like show me to the demo.

F

Some of what we just discussed now will probably be reinforced uh like much become much more clear if, when we see the demo, I just wanted to set some context here and then take some questions.

F

I think over to you.

G

Thank you hope. I'm ordered.

G

So um yeah so uh before I go into the demo I'll quickly give a explanation of the setup that I have here. So I'm running my uh local mini cube, setup here and have installed litmus. So I have the install litmus portal and all the other accompanying uh deployments like the subscribers, the event or other operators that are required, including the workflow controller um and the listening space.

G

So this is my control thing uh for the litmus itself and then I have a demo application uh called the power to head uh service, that's running in my demoning space, and this is basically a http web server. So uh that's all that this is. It has a service with it, uh comparing it, and I have a monitoring setup running uh on the my training space, the trauma system and the black box exporter.

G

So uh if I open up the litmus portal ui uh I'll quickly, log in and show the changes that I've made or any setup that I've gone through.

G

So this is the portal that the homepage that opens up when you log in for the first time and uh I'll, give a quick run throughout the options, and you know menus that we have here so, on the left hand, side, we see a navigation bar where we can uh see the different tabs and see the different. uh You know options that we have here.

G

So this is the home page that comes in at the very beginning, which gives you an overview of the project that you're currently in uh you have the option to switch projects. If you are part of any other project, but currently I'm just a part of the single project and in the home screen, you should see the project, the workflows itself, the agents and other project credit details itself. So this is just an overview on the summary of the project.

G

Then uh comes the network's workflows or the chaos workflows, as as we call it so here, we'll see all the workflows that we have run for the project, uh the schedules that we have in this tab, so these uh are this- can be crown workflows or uh single runways.

G

uh In this case, I have only one single workflow: that's uh you know just a single, as you can see here, and I've run it only once and if I click on this I'll get a beautiful uh visualization inspired from the uh uh workflow visualization itself, so we can click on uh on these items and see the vlogs and stuff, uh but we'll do it a little bit later on. um So, besides that, we have the chaos agents tab where we have the where we get all the agents.

G

So karthik was talking about a subscriber uh that basically gets all the requests for your you know. Workflow runs and sends all the data back to the main control thing. So we see all the agents that are the subscribers that they have connected here. So currently I have just a single subscriber in my setup. You can go ahead and connect more setup more subscribers.

G

If you want uh through the correct agent option, then we have the hub where we basically collect all different types of experiments and three different workflows that we have so currently there's only a single hub. As you can see, this is the public hub. That's available in the litmus repo, uh but if you have your private hubs- and you know, if you have your custom, uh experiments or workflows uh that you want to import directly into the portal, you can go and create a new hub and go ahead with the flow.

G

Then we have the opportunity tab, which is basically uh somewhat similar to grafana itself. You can add in dashboards- and you know, have an integrated uh observability feature directly into the portal itself, but currently I have not set up I'll, definitely use uh rafana for our demo, um but yeah. Besides the dashboards, you can also see uh inbuilt, uh you know, analysis or analytics for the workflows itself, so this is uh per workflow basis. uh Analysis, it's more useful for cron workflows. Over a period of time.

G

You can see how your application- or you know, cluster- is performing uh for the same workflow itself. So currently, there's only one run, so there's not much to see here and finally, we have the settings tab, which gives you a lot of options uh depending on whether you're admin or not. But uh in general you have your own personal uh options to change like the details. You know there's a name and stuff, then you have team management options for the project itself.

G

If you're an admin of the project to get those options, then you have user management options uh again. If you're an admin, then you can add new users and stuff, then. Finally, the main important things that we are going to be talking about is the detoxing. So for each project we allow a particular repository to be configured as a detox repository.

G

So in this case I have already configured my private repository, uh that's here as the source dip source and uh just show how you can do it I'll just edit it so in this uh to actually configure the positive.

G

What we have to do is basically have to provide the url and the branch that you want to use as your source and also add in the access token or ssh key, so that uh you know the portal can write to your repository, also because in our system we not just uh sync from the repository itself, but we also write back to the repository. So so, in cases where you are creating a workflow directly from the portal, uh the portal will automatically sync that workflow into your report. Configured get repository.

G

So that's how we need to access uh key, so I'll just leave it as it is for now. Also we have the image repository option where you can specify. uh You know if you have a custom repository where you have all those runner images, the different uh helper images that we use in our workflows, chaos workflows.

G

So you can do that by specifying custom values for this, but I already I'm going to use the open source ones that are already there publicly available, uh but you can do that and you can use that in your workflows automatically. So that's also available for you. So going back to the main thing: that's a work to itself.

G

So in the litmus workflows tab you get the option to scale the move through and when you click on that, you are brought to this wizard, where you are taken through a few steps where you can choose which target you want to run the workflow on uh and how you want to tune your workflow. So I'll just go through the steps and that'll probably be easier.

G

So, for example, I have only one agent or one subscriber here so I'll, select that as a target and move on and then you can create your workflow, so we provide a few different options to you know generate your workflow, so the first thing that we have are the predefined workflows that are present in the hub chaos hub that we had configured. So in this case I just have the single hub. That is a public uh sub.

G

So in that case I have only three different predefined workflows and usually, if you're, just trying out electrical for the first time, you might want to just run one of these workflows and see and test out the things right.

G

So if I click on uh one of these and I'll just show you how it looks so, if I click on here and continue to the next focus uh settings, uh you get the option to change the name, and you know the description and basically a metadata about the whole flip sense, um and I click on next and in this page, it's basically where you tune your workflow uh here we provide a pretty uh uh cool, uh visualization or kind of a simulation of the workflow, how it will look after its run.

G

So here you can see uh how your workflow is going to perform like performance, and you know how the steps are going to be executed and stuff. So here we can see that there's a series of uh nodes that are there series of steps under there, but at the end of the workflow we have two parallel uh experiments or two parallel steps that are being run.

G

So if I want to change that sequence, I can click on edit and uh you know make some changes here like I can drag and drop in between these, and you know get this. uh You know visualization updated and all these actual workflow manifests itself updated, also without actually having to edit the code directly. So um that's um mostly it for the edit part, but besides that we also have options to toggle or edit our experiments itself. So let's say you have in this case, I have selected a part, delete experiment.

G

uh I can edit it, and this will provide me some tunables for the experiment specifically. So in this case, uh first of all, I get the metadata details like extreme name, the context and stuff, but if I click on next, I get the options for the target application.

G

So here we'll uh here, the wizard provides you with the settings to mention which particular application or particular deployment, the particular resource. You are trying to run the experiment on so, for example, the first one is the app and is, if I click on that, I will get a list of all the namespaces that are there in my current cluster. So the portal is right now the subscript. The agent is right now running and cluster scopes, for I have access to all the namespaces.

G

But if you're running in, let's say uh namespace scope, then you will only be uh getting the access for that particular interface. But in this case I in cluster scope. So I have access to all the new spaces here. So I can, let's say: click on demo, uh which is the link is that uh we were using. um We are using for the demo, uh then I can select on the app kind, which is basically the resource that I want to target.

G

So in this case I'll just select deployment, but you have other options like stateful sets, uh rollouts and stuff, and once it's like that, you will have to choose which deployment so in so how we do that is using the app labels so clicking on. That will also give you a suggested list of app labels that you can use. So here what we are doing is we are fetching all the app labels that are there in the particular name space for the particular resource type that was mentioned.

G

So in this case it's deployment, and in this case we have the hello service as the particular resource.

G

uh Sorry, the app link right so hello services, app label and finally, we have uh the option to do some cleanup stuff. So in this case, we'll just leave it as retained, so retain is basically like. uh After the experiments is run, it will keep all the resources like leave, all the resources on the cluster itself and not clean up, but we also have a global uh cleanup uh called revertkiovs. That's talk that you can toggle on the workflow itself, so it will clean up all the resources at the end of the workflow itself.

G

So I'll just keep this retaining in this case. For now, and then click on next will take you to uh options of probes, which is basically how you define your steady state hypothesis. uh In this case, uh we have a predefined http probe, uh which we can add also, like you can add new probes if you want and there's a proper documentation available for the probes which we can share in the chat if you're interested, uh but basically we have a various uh like a few different types of probes.

G

So if I click on this, you see the different types of probes that are available. Http cmd uh k8s and from uh geographers so, for example, the http pro would be uh basically making uh http request or to a particular endpoint that is specified in this case. uh Let's say the endpoint is already mentioned as this one and this will allow uh and then we can just check uh on different criteria. So let's say I have a criteria to check whether I'm getting a response code of 200.

G

Then I can just set it as this and and I need to set some other timeouts and we try values also. But the just is that it'll, the probe will basically make sure that uh that particular service is returning. This particular response code, that's expected from it during the experiment or after the experiment is runners. uh Whenever we want to actually run the experiment, run this uh probe so to select that like when the probe is actually executed, you can choose on the probe mode, so in this case set to continuous.

G

So it will go through uh throughout the experiment itself, but you can also set on a start of the experiment or end of the experiment, and uh things like that.

G

So that was uh mostly it for the tuning part, oh yeah, so I I missed the tune experiment section where we specify how long the experiment goes on for so this section is specific to uh experiments itself.

G

The previous sections were generic for all experiments, but this one will provide you specific options that you would be able to tune for specific experiments, so we have different options for different types of experiments, so, like karthik was talking about node level, experiments and the network network experiments and stuff like so, each of these experiments will have different configurations that you would like to tune. So you can add those chainables here and finish your work, work for tuning and then continue to the next step, and this is where we uh finally select our weightages.

G

uh This is a instrumentation that we have done over the workflow. So to get our residency score, uh we need to uh mention, or do we need to specify what uh weights or weightages each of the experiments that we're running in the workflow hold.

G

So, for example, in this case, I just have a single experiment, but if we had two different experiments I could have different weightages for each of them and depending on the success or failure of each of those experiments, we generate a resiliency score which mentions how resilient your uh complete uh your target, application or uh this environment is right, and that is that depends on the weightages.

G

uh So if I have a weight for pretty, if I set up at a low weight, then if this experiment fails, that does not affect your residency score uh too much. But whereas, if I'm setting it to 10 and this experiment fails, then uh the resiliency score of your uh experiment or the workflow will reduce uh to zero. Basically, because I said there's only one experiment here so moving on to next, I can uh I have the option of scheduling how I want to schedule it. So we have two options here.

G

We can either go for a schedule now, which will, which is basically just going to immediately schedule the workflow as it is on the on the target cluster, but we can also do a recurring schedule which is basically going to schedule it uh with the front uh to crunch up. So we are in this case we are internally using uh the current workflows. So for the schedule.

G

Now we are just using the normal arc overflows that are there, but for the recording schedule we are using the convo flows for it, and here we are giving some predefined options like every hour every day.

G

Every week, every month and depending on which option you choose, it can also specify the minutes uh the time if it's or every day or something, if it's every month, then you can specify the particular date of the uh month and then the time you want to run it on and then finally, you can, uh let's see um like look at the overview of the workflow and then run it right, but uh I'll not run this one.

G

I already have a uh workflow that I've constructed specifically for this demo, so I'll just quickly go and uh take that up and I'll run that and finally, we'll look at uh the visualization part, so yeah so going back to the creation part again, we have the import option besides the workflow creation from the predefined workflows- um and I can click on this and create a workflow here itself, so I'll just click on recent and upload my m.

G

So this is the workflow that I've already generated for the demo and this experiment is this: workflow is going to basically uh be running a quad lead experiment on my demo application, so the application that I had previously shown so here's my application so I'll be running a particular experiment on this application, but at the same time I'll also be running a load test parallelly on it. And finally, we clean up the whole thing.

G

So I'll just go ahead and schedule the expand and it takes some time to schedule it because uh it's now pushing the workflow to your git repository first and then uh it's going to execute on the cluster. So now it's already executed, as you can see it's running um while it's running, I can quickly show you that there was a decent commit to my github's repo that I'd configured. So here we can see that there's a comment made 20 seconds ago from admin at the latest chaos.

G

So if I click on it and I'll see that there are a few other offers also, these are the workflows that I ran previously, but the most recent one is 20 seconds ago. So this is the workflow that I executed just now and we can see the whole workflow itself, the manifest of the workflow and the steps that are there and uh just to show that we are what we are running so in the steps in the template section. We have the steps that you have mentioned, so the workflow is gonna.

G

First install the experiment, then uh it's gonna be. Did I run.

G

One second, I think I oh yeah. I think I ran the wrong workflow, but uh it's still fine, um so yeah. So basically uh the workflow will show up here, but what we needed was a little bit different workflow. So what I'll do is I'll uh terminate the workflow so uh the workflows? We also have the option to terminate the workflows directly from the port register, so I'll quickly terminate the workflow and run the correct one.

G

So yeah uh demos don't go well without uh glitches. I guess so I'll just quickly select the proper workflow, which I think this is the one yeah. I think this is the one.

G

Let me just tell you inside.

G

G

Okay, so this is the workflow manufacturer.

G

Then yeah this is the one, so I'll just quickly go ahead and execute.

G

Yep yeah so now I'm running the correct flow flow, which is actually going to run the quadratic experiment on the application and run the loop test pattern with it. So again, uh going back to the repository, I if I refresh the page I'll, see another file being created here so which is the correct one this time? No, it's not yeah this one, so yeah yeah. So this one is the correct one. As you can see, uh the steps are installed. Chaos experiment.

G

Then we do a pod, build experiment uh and finally, we do the load test, which is running parallely actually with the power build as we can see. So it's running fairly with the port delete and then we are just going to revert the cameras and clean up the whole cluster, so yeah so going back to the visualization here. So I can see the experiment running right now and the load test and the experiment has both started. And if I go back here, I should probably see yeah.

G

The pod has been deleted and starting up again- and here we can see the custom yeah, the workflow odds that have been generated for the experiment itself um and clicking on the pod itself will get a few details of the experiment. So these are the instrumentations that we've added apart uh from this tunic of the tuning part. So we allow users to not only get the logs of the experiment, uh the aquapod that's generating the experiment, uh but also the experiment for itself.

G

So in this case we can see the details of the logs of the arc report. But then, if we scroll down, we will see logs from the experiment itself, so it would say that you know started chaos, experiment for delete and all the configuration details are also in there. So that's there and in the load test one. If I go here, we see that the load test is running and currently it's failing, because um there's only one um instance of the application running so, okay, so notice has actually completed, I think yeah.

G

So we can see that the html request failure is uh more than 25 percent. So all our requests that we're making is pretty much going to be vested because the application is down due to the quadratic experiment, that's being run here. So this is a weak uh scenario where our application isn't resilient enough to the experiment.

G

So we'll see that uh if our hypothesis is correct, so we also have a http probe here that checks for uh the liveness of the application, while the experiment is running. So, if the hypothesis is correct, then the experiment should fail and we should have a field uh node here once the experiment finishes.

G

So uh let's just wait for a few seconds yeah. So the experiment has failed. As you can see, um and going to the logs. We can also see the uh you know where why it failed. So we can see the probe has failed for that particular experiment and that's why the particular experiment completely failed, and besides that we also have the kiosk results.

G

So this is also being fetched from the target where we are running the experiment, and here we can see the final experiment results uh that we are generating, though this is the actual litmus experiment uh result resource that is returned. So here also, we can see the result. The probe status, which is uh mentioning that the particular probe has failed and the verdict has mentioned, has failed so yeah. So we have completed the expand and, besides that, what we can do now is.

G

We can also quickly run a more resilient scenario where we can see the particular experiment or the kiosk will completely pass it. So here uh going back to the uh yeah.

F

Yeah, sorry, I I think we're uh over time and I hope the audience has got an understanding of uh how the written workflows are being used in litmus. So I'll, probably um we can stop at this point and take questions if any.

A

um I I have a question so this has been bugging me for a while. Actually um I want to perform a chaotic test where a service becomes unavailable can we can does litmus take limits, support that?

A

H

Like my kafka, for example,.

F

Right so you're asking for the case where you're trying to make a service unavailable alex, did I understand that right.

A

Yeah, that's right, yeah. Basically, my http requests um to that service fail, but I've actually got persistent tcp connections to it as well, so I've actually got open sockets, which data is coming over, so it's kind of become unavailable um mid-way during that um tcp dial log.

F

Yeah, I think we've had scenarios where that has occurred as part of these experiments. There are also some chaos, experiments that are being built specifically to uh cause service and availability, either by a hundred percent, god loss or by probably uh corrupting this service object, taking a backup and then just removing it from the system kind.

A

F

That is, is that something that you're referring to.

A

Yeah, that's the kind of stuff yeah I I know you've got loads of different experiments inside the inside. The kit, yeah.

H

I think there's also another question um oracle data. How do you do that.

A

uh Sorry alex, would you please repeat, uh say I'm saying I'm running an experiment earlier every day, do you, how do you keep the history of that is that stored in a database yeah? So there is a.

F

Right there is a uh db database, uh stateful set replica, that's running as part of the control plane. So all these runs that you're, seeing here on the dashboard, are actually recorded there, and then there is an ability to run some analytics over it to see um how your runs compare over a period of time.

F

Maybe those runs have been executed against different builds or lasers, or maybe the same experiment with the similar set of tables have been run across environments. uh Qa is staging one production, for example, and you get to compare them. It is stored in a database. You're right for the chaos results, as well as the workflow details.

A

Does anybody else have any questions they'd like to ask.

A

Okay, that's cool, um so sean mccarthy. Thank you very much for doing the presentation.

H

Thank you thanks for having us, let's, if people want to find out more, how can they find out more about limbs, chaos yep, so I'm just sharing the uh repository here. It.

F

Is that christmas chaos slash waitress, that's the git repository and you can also join our slack channel on the lit on the kubernetes workspace. We are slash witness on the communities workspace if you're already there you can join this channel.

F

Like is uh a self invites like workspace, so I'm pretty sure a lot of us are.

H

Already there so I expect I suspect so.

A

Great stuff, okay, um so thank you very much for the presentation.

A

um What I'll do is, if you want to ask me more questions about this, I know we've had a lot of people had to leave for their elect for 11 o'clock meetings. You can obviously ask about that um and I'll um be putting a recording up on to youtube um today or tomorrow, for people to be able to watch the rest of this.

A

um Thank you all very much. I hope you have a wonderful day.

A

Thank you have a great day.