Argo Workflows and Events Community, 16 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Argo Workflows and Events Community Meeting 16th Jun 2021

Description

02:00 Jupiterflow Demo with HongKun Yoo
14:30 Argo Dataflow preview with Alex Collins

A

Okay, is everybody ready? Am I ready, I hope so good morning. Welcome to the june argo workflows and events community meeting, I hope everybody's very excited at the moment, they're all going to be able to get out of our houses and go and get back on with their lives and enjoy a bit of the summer sunshine.

A

I know I know that I am we're going to have um two demos stroke previews today, um one will be from now uh correct me if I pronounce this incorrectly from hong kong, who is going to demonstrate a little piece of software he's written, called jupiter flow, which is kind of one of the really interesting bits of software. We've discovered inside the argo workflows ecosystem and I always really enjoy things that build on top of our go workflows, I'd love to see what people are doing with it.

A

So we can understand the kind of different directions that things like data processing and machine learning are going into, and then I'm going to give a demo of a proof of concept or preview. Alpha piece of software uh in our go labs called argo data flow. That's oriented around data processing and we'll probably I'm hoping will be quite interesting to anybody using um argo, workflows or argo events, so we're actually not gonna you're, not gonna, see very much in the way of argo workflows, nargo events at the moment.

A

So if you are interested in that, of course, we have some great videos from version 3.0 and 3.1 is, is just right around the corner as well. So I think that's going to be very exciting.

A

Okay, so hong kong are you? Are you ready to give your demo.

B

Yes, um actually, I have a uh a slight presentation and a demo. Okay.

A

Go for it I'll uh stop sharing my screen, so you can take over.

B

Okay, thank you.

B

Can you see my presentation.

A

B

Okay, first of all, thank you for having me. um I want to share you about jupiter flow, a better way to scale your uh ml job. My name is hong kun, yu and, and my targeted audience is as follows, uh and first of all please understand my english, because english is not my first language.

B

Okay, the target audience is follows for data science, who are not fully familiar with kubernetes, but want to use the power of kubernetes and for envelopes, engineers who want to provide an efficient ml environment and for analysis looking for a better ml2. uh So I want to introduce ml2 for data scientists.

B

uh Today's scope is mainly on model development, especially for training.

B

uh My name is hungunyu and I run a blog post named coffee whale mainly on machine learning and kubernetes, and I work full at line, which is a messenger widely used in asia, including japan, taiwan and thailand, and I work for as a data platform engineer based on kubernetes before I start start. This is not. This project is nothing related to my company and it is totally my personal uh project and I do not speak on behalf of my company.

B

Okay, everyone knows that kubernetes is great and it's very uh useful for ml projects yeah it handles model management, uh node management and job skill, scheduling and resource management and monitoring. It's very good and every loves it. So a lot of people use argo workflow for data pipeline or machine learning pipeline, but I think there is no free lunch.

B

Not everyone is happy with kubernetes, especially for data science, who are not very familiar with software skills. So I come up with two main reasons of this uh issues. First of all, it is containerization and second writing. Manifest file was a little difficult uh for more details. What is containerization already you know about this, but I want to emphasize that every time I update my machine learning model, I have to rebuild my container image and push it again and run the container it was.

B

uh I know how to do it, but it was really tiresome. It was every time I have to do it. I think about. Is there a better way to do this and another thing is uh yamang hell as um every time it's not every time, but I have to write a yaml code for running my ml job, and maybe somebody is who is familiar with this might feel easy, but for data scientists won't be very easy to learn in. They need to learn about kubernetes and very detailed stuffs about it.

B

um There is also unicorn that who can actually are good at data science and software engineer, but not everyone is uh that unicorn. So I want to introduce to peter flo that you can run your ml job right away on kubernetes.

B

uh So uh I prepared a live demo, so you can um go into this website and do it yourself. um I will show you in my computer: can you see my jupiter health page yep? Okay, I will share.

B

It I will share it here, so you can log in with your um github.

B

Github account, so this is just a regular jupiter hub, um jupiter half platform and first you have to install jupiter flow. This is uh this is not all you need, but this is the other thing that the scientist needs to do it oops.

B

What have I done wrong and then next just write, jupiter flow run and then an option and just say echo hello: this will create a torpedo, flow, argo, workflow yaml file, and then, if you go right there, the actual argo workflow young pile will be created here, says hello.

B

You can also run a script like uh there is a. I made a train.pi. This is a regular mnist, uh keras uh training file, it's a just, a a simple training script and you can run it on your jupyter lab.

B

It's worked well, but uh in a lot of cases we want to scale our machine learning job, so we want to run this script on kubernetes. So how could we do that? We just run jupiter flow at jupiter flow run and then write python, train, dot, pi and then the actual code will be appearing in jupiter flow.

B

So the actual script will run and it also it will connect with volume. It connects with volume. So here's a download file that I want to download mnist file and if you run jupyter flow, run python download.

B

The jupyter flow, the argo workflow starts and then, after finish, there will be a mnist file in my uh jupiter lab and also you can write a more complex um that bag workflow.

B

So uh if you have three commands hello world again- and this is similar with the uh airflow dependency expression, so you run the first job and the second and the first job and the third, so you run jupiter flow run and with f option workflow.yaml, and then you can.

B

You can run more um complex, workflows.

B

This is it so you can run like this kind of more complex stuff, so this is uh it about jupiter flow. So I think jupiter flow is an interface of kubernetes for data scientists, so data scientists originally need to use kubernetes directly, but if you use uh jupiter flow, this jupiter flow will translate your machine learning code to kubernetes argo workflow yaml file, so I think jupiter flow is uh some sort of translator for data scientists.

B

So oh. This is why I started this project, uh so I I also um run machine learning code and it was a great idea for using kubernetes because it handles scaling, node management and scheduling all kinds of good stuffs, but using directly kubernetes and kubeflow was a little bit tiresome for me because every time I um change my code, I have to rebuild my image and I have to uh run my code, so that was little uh troublesome, so I wanted a better way to run my model for efficiency.

B

So whenever I uh write my code on jupiter lab, I want to run right away. So this is why I made jupiter flow so to wrap up. This is my personal, open source project and it's early stage in development, and it's has bugs and lack of features, but I think there's still no uh defecto standard ml2 in this field, maybe kubernetes, but I think current uh maybe could be flow, but I think kubeflow, it's slightly difficult for me to uh run very uh lightweight.

B

So I think jupiter flow has a great strength and opportunity in this area in training, machine learning models. So um this is the blogs that I the source of my presentation.

B

There is a good blog post that they claim the same problem that data scientists don't care about kubernetes, and this blog post is my blog post that I introduced my jupyter flow to solve this issues. So thank you, that'll be all.

A

That's great stuff, thank you very much for giving that presentation to us. Does anybody want to ask any questions? This is a great opportunity to ask about this.

A

I certainly have a question. Okay, I want to know, do you? Is it written in python and does it transpile the uh does it use a transpire to convert it into uh workflow yaml under the hood.

B

Yes, it's yes, it's reading python and it's translate it to yaml file and it uh throws the yaml file to the kubernetes using the kubernetes python sdk nice, nice.

B

um Actually, I have a uh architecture, it's it's really simple, but um so there is a jupiter half and there is the argo and if you um fetch the uh pot spec in jupiter flow, we can get the image and the um storage volume so jupiter flow. Just do this. Just fetch this information and build it to the uh build the yama file and throws to the kubernetes and the argo workflow controller run runs the rest of it. Cool, okay,.

A

Thank you, okay. Can you? I might get you to ask you to drop your slides into the chat, so people can can review them. Okay, okay, cool, okay. Anyone else have any questions: cool, okay, so I'm going to steal the screen share from you, okay, and before I do that, let's go back to the right slide.

A

A

And does that say, argo dataflow? I can't uh see what it's sharing.

A

Let's try and I'm gonna do that again, because I don't trust it there we go there we go. The green line has appeared around. Can you guys see uh my slides, yeah excellent? Thank you barna. um So today I'm going to talk a bit about a a new project. We've been working on um to help uh build out a solution for some of our internal needs um at uh the company that we work for the core team that works for and the and the project's called argo data flow.

A

um The use cases for argo data flow, typically um processing, um large numbers of of data items, so things like um click analytics from a website, we're going to be particularly focused on anomaly detection with inside software systems, but also things like fraud, detection and operational analytics.

A

So, typically the use cases. You have a lot of items of data coming through your system, typically in real time, and you want to do some kind of near real-time processing on them rather than batch, based, processing and you're, somewhat tolerant to um errors.

A

This is what a typical data flow pipeline looks like. So a pipeline is composed of a number of steps and argo dataflow will execute that pipeline and the steps typically read data from some source and then optionally write it to a sink.

A

This is quite a complicated example, but in this example there is a container writing data to a nats streaming subject. That's then read, and then it's run through two filters, one that filters out cats and one that filters out dogs and written to another nats streaming subject and finally, there's some processing and those the processing on the cats and dogs is different and that's written to an output topic. That's pretty common you're reading from it from a data source right into data.

A

Sync optionally, you can write from you know more than one sources and you can write uh read from more than one sources and you can write to more than one sync, so it allows you to do kind of fork, uh join processing as well on those items of data, um the blue icons on this I don't have. We don't have quite a name for them, but you can call them a processor and we have a number of processors out box. We'll talk about that in a second okay. So what are the?

A

What are the currently supported sources? um So we can have a cron schedule as as a source which produces an item of data every you know minute or two minutes, um a kafka topic uh and that's um streaming subject, which is basically the same as a topic, um both of which are durable or a http endpoint. So you can put a http service in front of a step within your pipeline and consume data from that, and then you can go through.

A

um We've got several built-in operations that you can use um filter map, um which I I don't think I'll, even explain those two for you, I'm sure you can guess what they are and flatten expand, which turn out to be popular operations, of flattening a large structured data down to key value pairs and expanding key value. Pairs back up to structured data quite popular after makes it easy to process data grouping so grouping data as it comes in and then emitting single chunks of data that have been grouped together.

A

So those are the main built-in operations there's also one called cat, which I haven't mentioned here, which is obviously map but identity and then there's a couple of more useful ones for users. What a git operation basically checks out code from git and runs. It is the operation.

A

um A container just runs a container image that you've specified. Maybe that could be called image and a handler allows you to actually just write your code directly into your pipeline and um allows you to code it in the yam and it'll it'll, compile that code for you and run it for you, there's no need to actually um build and publish an image which I think we all we all know is a bit of a pain.

A

Oh and then we sync it to um pretty much the same kind of places you you know the sources, so kafka, nats or http endpoint. Also a log sync. Anybody who's familiar with argo events, will know that isn't there's a uh sensor called uh sorry sensor. I don't think I mean sensor, I mean something else.

A

I think it's censor margin in argo event. Yes,.

C

A

Yeah, a sensor sensor, there's a sensor called log which allows you just to write those messages which is intended for debugging, because obviously one of the challenges in a kind of distributed system with a lot of messages is you need to be able to trace your messages through the system, and then you can scale your processes up by using either hpa. So if you want to scale up and down using cpu or memory, that's one option scale them manually or scale them based on the number of pending messages in the queue.

A

So, if you're using kafka on that streaming, we can determine the number of messages appending. We can scale up and down to meet um your load automatically.

A

Okay, so uh you can't really get away from yaml if you're in the cloud native universe. So this is an example of a pipeline specified in yaml.

A

It contains some pretty much conventional metadata, somewhat inherited from argo workflows that allows people to describe their pipeline, uh who owns it so who's responsible for if it goes wrong, you know a lot of people have trouble uh determining uh if a customer who who owns a particular resource with inside kubernetes, so they're an out-of-the-box annotation for that and add above description as well, and then you can see that this particular specification.

A

This pipeline contains a series of steps and, if you're familiar with workflows yaml, you can see that this they're, both called cats and the first one reads from a kafka topic and and then writes to a nats streaming subject.

A

uh Stan is a: u is a um acronym for nat streaming, so it's nat's backwards and then a second step called b, which reads from the subject of rights to an apple topic, but nobody likes that so we've written a um nasant python uh library that you can use to write your python in a kind of builder format. So this is not actually not the same pipeline.

A

This contains two steps here, um both reading from a cron schedule, um passing to a handler, which is a python function specified actually in the source code here um and if you use kubeflow you'll, probably recognize this kind of way of doing things of having a pointer to a function and then a second step which does kind of the same thing, and this one actually also showcases a retry policy and I'll come back to that shortly.

A

Okay, so it's python, so you can use it in a jupyter notebook. This again is very new, I'm kind of very interested to get people's feedback on how they might want to author their pipelines. This this decision has been based on the fact that we know that you know people don't really like gamble. They want to write things in python.

A

That came back very strongly in the survey earlier this year, and so here is an example of running a pipeline in a uh a jupiter notebook, and we also contain some kind of prometheus metrics out of the box, so each step within your workflow sorry, your data flow pipeline will emit messages such as the number of messages going through the rate which they're going through the numbers are in flight and the number of replicas. So you can you can easily monitor your system to make sure it's processing it.

A

This is especially useful if you're processing pipeline uses downstream systems, so she has to reach out to another service to get some kind of data, so this will help. You understand that and you can also look at pending messages and other things to see if data's kind of building up in your system and build very standard, wavefront or grafana dashboards with you know very standard alerting on that. So you know when things go wrong.

A

Okay, so before we go find out what I wanted to just go through how you could try this out, so we have a quick start guys here.

A

um Basically, a quick start gamble that you can apply into a namespace called argo dataflow system and that'll create the controller whose responsibility is will be to execute the pipelines and by default we provided basically a namespace scoped install so I'll just go into a single namespace it'll, only listen to pipelines and steps created in that namespace and then there's a user interface that comes with it as well.

A

Maybe familiar to some of you. I don't know if you've seen this particular user interface style before there's, basically a new option on the left hand, side above the event, option for pipelines and lists all the pipelines and then we're just going to I'm just going to go through now and just um the pipelines are all provided in a series of examples. So it's quite easy to go through them and the example startup 101 for easy examples and go up to 301 for advanced examples, kind of showing you each of the new features.

A

So let's have a look at the examples.

A

So the first one is kind of a hello world example. This just reads from a source and a sink, and actually this is also most of the examples are also done in a python format.

A

So you can have a look at the python that produces this particular pipeline and this particular one uh said released from a cron schedule: cats, the output, so it's an identity map operation, then writes it to a log, and this will then be represented in the user interface here, showing the sources it's reading from and the uh places it's right into now. Cron and log are quite useful for experimentation because, of course, kafka is typically quite heavyweights, as and and that's just kind of moderately moderately heavy weighted.

A

So if you're experimenting, this can be quite useful uh to help you if you just click on the particular step within the pipeline. You've got a couple of different tabs containing some kind of useful information here and the first one just contains an overview of the status telling you how many replicas you're running in last time this this particular step was scaled up.

A

It also contains some information for each of the sources and sinks, giving you the total number of messages, the the current um messages per second transactions per second, and in an example of the recent message, you can kind of see what's going through a particular step and the same the same with the sinks here as well, you get a bit of a similar kind of information, typically with a cat operation. You might expect the number of messages to be the same.

A

You can also have a little look at the logs here, so there's a couple of tabs um for the different types of logs. So you can see this is the main um uh main container. With inside this step and like an inargo workflow, you have a weight container sidecar. This has a cycle called sidecar whose responsibility is for reading and writing messages to and from the topics.

A

You've got events, so you can see when scaling happens and then, finally, you can just see the the manifest here as well. If you want to look at the specification.

A

uh Let's go back to I'm just going to go back here, I'm just going to go back to the examples, as I'm just going to talk a little bit about the examples. This second one is an example of an a pipeline with um two nodes or two steps in it just going to make sure that's created.

A

Now, if you go back to the list here, you can just see I've just created that as the top of the list here and you can see, this has got two uh two two steps in it. I said two nodes should be called two step step and step b. This is the one we looked at a little bit earlier.

A

This one reads from a kafka topic perform some kind of processing, uh writes it to a net streaming subject and then writes it to an output topic and the status on this will just show you you can see. This is currently great because it's waiting to actually uh schedule this particular pipeline, because my cluster pro doesn't have enough space to do it at the moment.

A

We talked a little bit about things like um filter flat and expand and map I'm just going to dive into filter, um which shows your filter operation. This basically, this particular pipeline will filter to only include messages that contain uh the word capybara, which is a type of small rodent. I think and read them from a catholic topic and writes now.

A

This expression, syntax, is the same expression, syntax we use quite commonly on argo workflows now, uh so that should be familiar to a lot of people.

A

Let's just go back here as well, and then you know we talked about flattening expanding, so flattened down to uh you, know, dot, saturated key values and expanding out of it. Then a map operation, so this one I'll just I'll dive into this one as well. This one's a map, operation and basically prepends the string hi to the message that comes through the pipeline.

A

um Now I haven't talked about the format of messages. um The the lowest common denominator. Message format is a byte array, so nat streaming only uses byte array doesn't have the ability to add metadata in the way that kafka does or even http requests. So messages are all currently byte array.

A

One question we don't know at the moment is: if we need to have some kind of internal format for messages to support something, for example like the cloud events, message, format or something else that allows us to add additional metadata to to each meta, which one so interesting to see what people's thoughts on that then we have an auto scaling pipeline example of the auto scaling here. So the way that scaling works out of the box is, you can define a replica ratio, I.e the number of replicas.

A

We should be running um for each n, uh pending messages. So, for example, if your ratio is 500 because you think you're able to process 500 messages a second, then you might have a ratio of 500 and then, when the number of pending messages goes to a thousand you'll be wearing two replicas one thousand five hundred three replicas, two thousand four applicators and so forth up and uh into into a bounce. But you can also it just implements the standard um scaling, that's used by hba, so you can also just use hpa based scaling.

A

If you want to scale it using um something more sophisticated, which is typically more complicated to set up, because you'll need to install probably metric service and hpa as well, um so that may be more complicated than a lot of people need by default.

A

Okay, so let's have a little look at a go handler. This is.

C

A

Let's look at a python handler because go java very similar to python, but python's pretty more interesting.

A

So this is a python handler and basically you could define the code that you want to run in terms of a handler function and just a runtime, and what that will do is it will build and compile that code for you. So this is suitable for very simple use cases where you can probably write the code that you want in 20 or 30 lines, and you don't have any significant external dependencies. Your every dependency is kind of out of the box, and this is probably familiar to people. Who've used aws lander as well.

A

It's a similar kind of syntax.

A

Then we have a git option and these are both intended to address the difficulty of having to build your own image. The git option actually will check your code out of git and run it with a particular image. So this one here checks out this particular repository checks out. The subpath example slash git on branch main, because your branch is called main these days and then we'll actually run that with inside uh this particular image- and I think it's usually pretty interesting for us to have a look at what this contains.

A

So, let's have a look at the examples repository.

A

A

Slow and I'm not even on the corporate vpn today, so I'll have to wait for it to load. Here we go so here's an example of a git. Basically, this allows you to provide something very similar to a docker file here and in this example, I'm basically providing an entry point, which is the code to run. I need to provide a handler function, as you mentioned before, and then a main function, which is kind of um copy and paste code.

A

I want to just go back to the examples, whereas the examples.

A

um It's possible to run a pipeline to completion or not to completion, so there are two ways to run a pipeline, and so one that runs the completion would be one where you're just processing a kind of a finite amount of data, and the way that you run to completion is simply by exiting zero in your container just to indicate you've finished and in a a pipeline that runs to completion, you can actually mark it as a you can mark steps as a terminator, so in a particular step within your pipeline runs to completion that the whole pipeline is terminated.

A

So if you've got three or four steps, you might have the steps passing information between one another or from the first step through to the last step, and if the last step exits, then the whole pipeline will then be shut down for you automatically or you can have them run. um If you're processing an infinite unbounded stream of data, then obviously your pipeline would run um add in for an item, and we talked a little about containers I'll skip over that as well.

A

Simply you can specify the container that your pipeline runs and then I've got a couple more sophisticated examples of a veterinary and a word count, one as well. So again these can be seen in the user interface. So here's the here's, the go one. You can see. That's running there processing a small number of messages at the moment and you can have a look at some of the more complicated ones. This is an error, error, error, handling pipeline, demonstrating retry policy, so this is actually kind of an interesting example of where the steps aren't connected.

A

Typically you'd expect to see the steps connected, but I disconnected them in this one and the top kind of sub pipeline uh reads from a crown topic and the handler itself randomly emits an error. So it's running, you know if, if random of two equals zero, then raise an exception, otherwise, just to return the value in the first one, it's um retry, because if you retry randomly uh you'll ultimately get success that uses a back off policy a retry back off and this second one here that has a retrace policy of never so never retry.

A

So you can see around half the messages of fouls, because you know it's a it's a coin toss as to whether they're failed and that's reported in this and that's kind of useful, because it allows you to have a step- that's robust to your container being killed by kubernetes, which can happen at any point. Anybody who's, written workflows knows one of the biggest challenges of writing a workflow is is getting your retries set up successfully because, of course, your any particular step within a workflow can randomly fail.

A

It could be killed by kubernetes because it wants the resources for something else um and datapo has that same kind of challenge. So it has the ability to to read reliably from a topic and retry if there are any kind of issues with the processor, for example, if the process is now unable to connect to a downstream um service, then it can just retry, and that means it can give a very high reliability guarantee, not quite 100 but pretty close to 100 percent. You know one one thousandths of percentage.

A

I think the goal is to get it even more reliable than that, because I think we know from speaking to users who want to use this now that they're, you know it's okay to lose one or two messages a day, but not any significant percentage of them, and so you know losing one in one message in a million might be tolerable.

A

um And we talked a little bit about reliability, and you know the different sources and sinks have kind of different reliability guarantees, but typically we're aiming for uh at least once delivery of uh messages from both kafka and stan. um It should be at most once for 999 990.

A

Okay, so I went through the examples.

A

um If you want to find out more, you can find argo dataflow in um I'll go project, labs, kind of really keen to get people's feedback to think about the use cases, and so we can make sure we build up the right kind of features and capabilities for what people need. We've got some idea from speaking to our own customers, but um we knew the feedback from the community is really invaluable.

A

um Hopefully, you'll see that some of the concepts have been borrowed or adapted from both argo workflows, narco events, you know some of the best concepts around kind of starting and stopping pods and containers and doing things reliably on kubernetes, which is you know, an interesting challenge in his own right.

A

Okay, does anybody have any questions they would like to.

A

A

I feel like I can hear somebody who's very quiet. Maybe I'm imagining.

C

It maybe um did you want to explain like some of the use cases that we're trying to solve by writing this.

A

Yes, so uh we're not looking to kind of replace stream processing tools like apache beam. I think we and we're not looking to replace tools like argo and vents and go workflows. I think it sits in between those two um and the kind of a very general. um The very general use case is processing uh items of data from some kind of topic or subject, uh but we're aiming at operational analytics processing.

A

So in our initial case that would be uh events so about application deployments uh about requests going through api gateways uh that go that needs some kind of pre-processing or um adaption, or you know, enrichment. Those kind of operations before they are put into um some kind of ai tool, or some kind of you know, data processing tool to extract anomalous events from from the list that would be the initial one, but but anything that would actually process. Streams of events is targeted.

A

Okay, so I will just go back to our menu for today, so um I hope you guys enjoyed the demos that we have today. If you want to learn more, we'll include some links in this document, so you can go and read the slides yourselves and also be able to kind of ask more questions. You can obviously come and ask questions on um the cncf slack, which we've now migrated to in the argo workflows channel. You can come and ask questions about that.

A

I think there's an argo data flow channel, but there's not much in the way of um things going on there. If you are interested in presenting at the community meeting, we always love to see people talking about what they're doing well yeah. We go it's great, always great to see people's tools that they've built on top of argo, workflows and argo events, but also it's really good to see people's kind of other use cases what they're doing with it. We all find that really interesting, and we appreciate that.

A

Okay and thank you very much for all joining today. uh Oh and the other thing they always ask us, is this being recorded? Yes, it's been recorded and the video will be available on youtube later today.

A

Okay, thank you very.

A

A