Red Hat OpenShift AI / Machine Learning | OpenShift Commons, 22 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons ML Briefing: ML/AI Data Pipelines on Kubernetes Daniel Whitenack (Pachyderm)

Description

Daniel Whitenack (Pachyderm) discusses how to enable Machine Learning and AI Data Pipelines on Kubernetes and OpenShift with the Machine Learning on OpenShift SIG of OpenShift Commons.

Learn more at http://docs.pachyderm.io/en/latest/getting_started/getting_started.html

Join OpenShift Commons https://commons.openshift.org#join and join the conversation

A

Alright, it perfect so so I'm excited because you know Michael and and Carol and others have already kind of given given part of the motivation for what I'm going to talk about. So I can kind of talk talk a little bit quicker. So thank you for that, and thanks for the great presentation, I'm gonna describe a little bit about pachyderm today and how we're enabling machine learning and AI and and other pipelines on top of kubernetes and on top of OpenShift.

A

You know production pipelines that we that we have going with people in in a lot of different different spaces, so I'll describe that a little bit more. What I'm, gonna do is I'm gonna kind of describe as motivation. You know what a typical machine learning a pipeline that we see with our with our users looks like some of this motivation again has already been given, so I can jump over it pretty quickly.

A

Then I'm going to talk about how that translates down to the kubernetes layer and what we actually need on top of kubernetes to enable these sorts of pipelines and then I'm gonna show you a demo of pachyderm in action with a couple, a couple of pipelines and I hope to leave a bunch of time at the end for for questions and answers, because I think that's, that's probably what will be most useful?

A

Okay great. So, let's start talking about machine learning pipelines, so I shamelessly stole kind of a David's format from from last meeting to illustrate some of these things, just as a reiteration of some of the things he said and some of the things that were actually already said by Michael. You know a lot of emphasis is put on training and inference when people think about machine learning and an AI and a lot of people are kind of.

A

You know, see the value of AI in their in their business, but really when it comes down to actually integrating machine learning and AI in their in their infrastructure and building out. You know pipelines that can be managed over time and scale, there's a whole lot more. That needs to be needs to be thought through, and this is really where, where we see the challenge being so kind of, like David said you know in the last meeting, there's a whole lot more than training an inference. There's a whole.

A

You know whole host of things related to pre-processing, feature engineering, there's there's you know, model export and an optimization. There's data transforms when we do when we do inference possibly post-processing and visualization, and you know we might not even be using the same frameworks and tools and languages for all of these steps.

A

We could, you know, be needing to mix, you know tensorflow and r, and julia like was mentioned by carol, because you know different teams are building different pieces of this puzzle and not only that there's the whole data side of things with which Michael emphasized as well, in the sense that there's actually input and output data associated with all of these different phases of the process that we need to manage and I definitely see a lot more than than I.

A

Would like to see you know, people trying to use file names and their own sort of like, like tooling, to figure this out and in a not very successful way. You know saying you know this is my feature set for training from config to with this timestamp dot CSV. That goes into my model, training right and it's just not not sustainable or scalable over time and in addition to just having those pieces of data, there's an element of actually the the sequence that these things are happening in right, I actually need to move data.

A

So it's not the scenario where I like spin up one environment and I do my training and then I'm done I actually have to possibly have you know, environments and toolings for each of these steps and be able to get input data into one step which produces output, data, which goes into other steps which is utilized by those stages which has output data, which goes two other steps, and so somehow we both need to manage all of this diverse processing.

A

We need to manage all of this data and we need to get all of it in the right place at the right time. So this is. This is really what what pachyderm is seeking to do on top of kubernetes. So in terms of like now transitioning to think about what do we actually need to do on top of kubernetes to enable this and sort some sort of sane way?

A

Well, let's say that we just have a couple of these stages that we're dealing with some pre-processing and some model training and then we're going to use that model for for inferencing. Well, you know, as everyone on this call is kind of a.

A

Fan of you know, we know that one way to get these pieces of processing off of our laptop and running in a dependable reproducible way somewhere else is by erisa these right. But we don't want to have you know people kind of spreading these docker images out and data scientists, logging into a bunch of machines and and running docker run, and this is really where kubernetes has come up right like we're. Gonna have a bunch of docker images that we need to deploy on a diverse set of resources.

A

We need to deploy those in a portable way in a reproducible way and kubernetes allows us to get those things running on a set of nodes in it in a very nice way, but actually, if we, if we think about this now like in terms of the pipeline that we just talked about these individual stages of processing, aren't isolated right and they're. Actually not the only thing that we're managing we're, also managing data all right. So let's say that we have all our data in an object store. So now we have all these pieces.

A

We have maybe our our stages that are that are running as containers as pods and kubernetes. We have data somehow we need to solve the problem of getting data to and the right data to the right ones of those pods to run and then collect the corresponding output from those from those stages, and we need to do that in a sustainable way. Like Michel is saying we we need to do this in some sort of version controlled way, so that we can.

A

Actually, you know, remember what we've done debug, what we've done maintain what we've done have audit trails for compliance and somehow we need to string these things together in a series of events. So I want my pre-processing to run on certain data. I want that to output, other data, which is used in training and then maybe that outputs, a model which is used in inference.

A

So all of these problems are kind of additional things. On top of kubernetes that we somehow need to enable similar to like was already mentioned, solving certain things on top of kubernetes, like service mesh or like secret management with with vault, or something like that. So the things we need to do are get the right data to the right code.

A

We need to process that data on the right nodes. So not only do we need to get it to the right code. In certain cases, we need to use the right types of resources like GPUs and other things.

A

We need to trigger the right code at the right time with the right data right so as as all of you are aware and kind of trying to enable these, this gets really complicated really quickly and then, if you add, on top of that, the ability to kind of version, your code and data such that you can have full audit trails of everything that you've done, such that you can revert to previous versions of your training data set such that you can debug and maintain your pipelines.

A

All of this is kind of kind of the extra stuff that we're really concerned with, and so how do we do this? Well, like our solution is, is pachyderm, so, if you're not familiar with pachyderm pachyderm, is this open source data pipelining a data management layer for kubernetes? So you know thinking about like other layers like where it was already mentioned, to whether that service mesh with with SEO or or secret management with Nomad bolte pachyderm, is providing this layer on top of kubernetes. That does specific tasks to accomplish these.

A

These sorts of pipelines and those things are, are both the pipe lining piece, but also this data management piece and those things are done together in a unified way. So the different components of pachyderm but are kind of the core features that enable these sorts of workflows is first data. Versioning like like I've already mentioned, we need a way to be able to version our training data sets. Our parameters are our visualizations such that we can. Both.

A

You know go back in time and run specific processing on specific data for reproducibility, but also we need that versioning, especially as we work on larger teams and we're working with data scientists and data engineers. We of course utilize containers for analysis, so this'll gives us the flexibility to run any languages or frameworks, whether that's tensorflow or Python, or Julia or OpenCV, or whatever it is, or just a bash command.

A

Then we want data scientists to be able to develop these stages of processing and whatever the languages and frameworks they want, but also to scale those. So we have this concept of distributed pipelines where each stage of our data pipelines are individually scalable, so you can automatically paralyze each state of stage of a pachyderm pipeline and pachyderm will take care of the data sharding and and getting the right data to those workers and then gathering all the results on the other end as well.

A

We utilize some of kubernetes resource limits and request such that you can such that you can very easily schedule. Let's say one one piece of your pipeline on a GPU, node and then another pipeline.

A

You know parallelized across CPU nodes and then the final piece that we really think is you know one of the unique things that we're striving, after which really only comes when you combine the data management piece with the pipelining piece, is full data provenance, so the ability to say that this particular result came from this particular version of my model, which was trained on this particular version of my training data, which came from these particular versions of my raw data, etc, etc.

A

All the way back, no matter how many stages you have so again, this is important for compliance, whether that's like gdpr or healthcare, or finance compliance. It's also just super useful in terms of debugging and maintaining your pipelines over time.

A

Okay, so because you know this is a technical audience, I definitely wanted to give you a sense of you know how we actually enable this and I'm happy at the end. You know when we're going through Question and Answer to give more details on any of these things. But basically what happens is you know again? We have kubernetes, as as the foundation for all of this and I'm backing objects.

A

Tour pachyderm, the pachyderm server is deployed as a pod on on kubernetes, so it's just deployed like any other application on kubernetes, and then the user interacts with that pack D that pachyderm server to build pipelines, to version data, to string pipelines together to pull data out to push data in and and all of that data again is backed in that backing object, store, pachyderm manages all the metadata associated with those versions and all of that, and then it launches all of the processing stages associated with the pipeline as other pipeline worker pods on kubernetes.

A

Great, so let me give you a little bit of a sense of what this actually looks like in in the real world. So I can go over here. So this is the pachyderm dashboard. This is one way to interact with the pachyderm cluster, so under the hood here, I again, I have a kubernetes cluster I, have an object, store, I deployed pachyderm to that kubernetes cluster and now I'm interacting it with it via our our dashboard, those other ways to interact.

A

You know via COI and and language client, but this is one way, and you can see here that I've deployed a couple of data pipelines to the cluster I'm gonna talk about start on the right here with the machine learning one and then I'll emphasize the other one at the end, just to kind of illustrate some flexibility. So the first thing that you can see here is that I have these blue icons.

A

So each of these blue icons represents a version collection of data, so here I have a training data set and in this machine learning pipeline I'm just kind of doing the the hello world of machine learning. Both the ires ires demo, so I have this set the CSB training data and, if I look at this, this repo, or actually this one's, probably a better example. These are attributes that I'm putting in that I'm wanting to do inferencing on I can actually see up here at the top some information about this repo.

A

That tells me that this is a version. Collection of data I can actually have branches of this data. I see multiple commits into this repo, and so I didn't see that you know in a previous state. I actually just had one file here right and then in the most recent state. I have two files, so I put an additional file in here and for any data for any number of ways that you change the data. All of this is automatically version, so that's this first core principle of data versioning.

A

The second piece is around data pipelining and those independently scalable pipeline stages. So that's what these other icons represent here. These are processing stages in my pipeline here this first one. This model stay is doing model, training on that training, data set and and and then outputting a persisted version of the model, and you can see here that the way that this pipeline stage is defined, it's just be a docker image that that my code is running in and then I'm just running Python code.

A

To do this, training and I can show you kind of at the end. I don't want to take time now, but this Python code it isn't like pulling in any sort of like special pachyderm libraries or anything. This is the same type of python code that you would run locally to do your training or in a jupiter notebook or whatever. It is and then that that processing stage produces output, that output is version in an output repository on the other side of this pipeline.

A

So you can see here, I'm I've output, a pickle file that serialize as my model, and then I've changed that to the next pipeline, which is inference and that inference pipeline is running another python script that pulls in that model and pulls in those attributes and does inference with the model to produce my my eventual results, which are the species of those of those iris flowers. So one thing to note here is that, like I have all this data version, I have these processing stages created.

A

You know, pachyderm is aware of what data is changing, because it's versioning it right. So actually, if I go over here and I put new data in so now, I'm going to do this via a second way on the command line. I'm going to put a file into my attributes, repo on the master branch I'm going to put this third file, would I'll actually see if I go back here. Okay, this is automatically updated right. So my new file is in that repo, and you can see that I've made three commits now.

A

All of that's automatically version not only that what pachyderm was aware that I added some data in that hadn't been processed yet, and so it knew that my results weren't up to date with the current state of my input data, so it actually went ahead and ran this inference pipeline again and it updated my results such that now I have my third result so yeah. This is kind of how we think about our pipelines.

A

We think about them as kind of dax of data where you put data in the top and pachyderm triggers all the stages downstream. That need to run to update your results. The final core principle of pachyderm that I wanted to emphasize here was that data provenance element. So if you remember data provenance was the idea that we could tie any specific result or actually any specific version of data do all the pieces of data and processing that led to that result.

A

So if I look at this particular inference, repo, which is my results, I've I have four commits. Okay and if I look at one of those I can see that all the data associated with that, but I can see. Also all of the upstream commits that actually contributed to that. To that result, so there was training data.

A

There was a there were attributes, there was a master model version that was used in that inference and there were a couple of specs, and these are the way that your pipelines are defined, so actually these would represent the processing associated with that particular result. So again, this is definitely something that that we view is very important. The final thing I'd like to emphasize is just that this is so this is we try to keep this flexible in the same vision as as kubernetes.

A

We want you to be able to to deploy this anywhere and scale it on any infrastructure. We also want you to be able to use any types of data and any types of framework, so you can see here in this pipeline just to give you a sense, I'm, actually, processing image data right, I'm, using open CV to do edge, detection on that image data, and so this is just to illustrate that you know you're not constrained by the type of data or the type of processing the cost.

A

The processing you can use is anything you can run in a container and the data is anything that can be stored in an object store, which I would say, is pretty pretty flexible. We also suggest to kind of circle things back here. Before, we jump into questions I just wanted to give kind of a couple of future directions that we're going and also give you some some resources where you can find out more. So let me go ahead and present here again, so future directions.

A

We already have deployments on top of open ships, we're working on another one, actually, a production deploy right now. That includes some pretty interesting components, so I'm very happy to be part of the sig and happy to you know, see things moving forward on that front. We're also really excited to further cooperate with with openshift and others in the future, and actually one thing I wanted to draw your attention to is the actually. Just yesterday we submitted a proposal for a little bit more seamless integration between cube flow and pachyderm and I actually have so.

A

We have kind of an example here of running distributed, tensor flow via TF job as a stage of a pachyderm pipeline and we're working to kind of improve that functionality as we go as well as working with people like Nvidia to better support things like the DG X and other boxes like that. Also draw your attention, so I'm gonna send out these link the link to the slides, of course, I would recommend you know, maybe watching our cube con talk.

A

I did a little bit more of an advanced workflow there that included GPUs and tensor flow. Of course, you can run all of these machine learning examples locally and mini cube and and try it out. We have public slot channel and docks where you can get get help and, of course, anytime. You know feel free to to reach out to me so I'll kind of quit flapping at this point and and see if there's there's any questions.

B

Looks like there's one from Eric Eric, you wanna, oh.

C

Yeah, but thanks to a demo, that's really cool I guess so you talked a lot about kind of like we know when you rub your data reruns, the dag and I. Just wonder you feel like have the sort of corresponding case where let's say I Rev some code and my feature extraction. That also kind of like implicitly implies a rerun right.

C

A

So so, actually I can show you that here. So if we go back if I go back here, this is the the job specification that I used for my training and actually here I have another version of this which doesn't use an SVM model but uses an Lda model. So when you were to update your code like let's say, I ran something I updated my code I committed that to get all you would have to do is just update your pipeline and that will, like you, said, trigger the exact same thing.

A

So, if I look at what jobs are running, oops sorry for the wrapping here go up that actually automatically started a new job, that retrained my model and as rerunning inference with that. Now, of course, that's configurable. Some people don't want to don't want to rerun like if they update their model or something- and that's that's something. That's totally configurable.

C

Nice, so you show it like a lot of kind of like push model or just automatically revs data products for you, so you talked about like going back. So probably you can do the pull model, or you say, like I'd, actually like to rerun using this particular Shaw.

C

A

Each yeah you're exactly right. So each of these, if I look at the commit structure, each of these commits in a repo has has an ID associated with it, and so you can either reference. If you want the latest from a certain branch, you can just reference that branch name, but you can also reference any commit ID from history to get that particular version of the data, and you can do that either. You know via the the dashboard or CLI, but you could also do it like.

A

If you spun up a Jupiter notebook and used our Python client, you could pull in any version of any data set via that client.

C

Cool. Thank you as.

B

Carol had said in in that's in the sidebar here: the visuals in this are just wonderful and I'm, quite interested in the topic of data provenance as well. I think that's something that we don't talk about a lot, because it's a really hard thing to do so kudos for getting that into the into your story and into your workflow I. Think. That's that's going to be very important. A lot of the folks that are trying to utilize this does anyone else, have any questions or is anything further? You want to add I just.

D

Had a quick question, great job Daniel come a long way with.

A

An interesting yeah.

D

On the data cleaning side say: I've got like I'm using an open data set or something and I'm doing some data cleaning. Is it in the scope of pachyderm now to track what changes I've made to that data set? Yes, this.

A

Would be like an open data set that would be would be stored outside of pachyderm. Is that basically, the right.

D

Right like a government-run or something yep, so.

A

Oftentimes what we see in in this in this sort of scenario- and this actually goes the same for like if you're ingressing data from a database or something like that, what we with we actually have like pipelines that pull in rather than you know, necessarily or driven by pushes. So basically, what you could do is trigger that pipeline. That would pull in the data from that external source.

A

You know and maybe save a timestamp associated with that, but then on the output like it would it would version your transformed version of that data set as the output of that pipeline in a similar way. If you were to pull in from a database each time you make that query to pull the data in the results of that query could be version and pachyderm.

A

So we wouldn't like actively go out and version that whole data set that's stored somewhere else, unless you want to pull all of it in, but oftentimes people just kind of version that that particular query or transformation of the data set. That's.

D

A

D

A

It excellent yeah reach out I'm happy. It's happy to point you to resources on that front.

E

Then you know thanks a lot for that. That was awesome and a question regarding you might remember pachyderm backing from the time, but it was certainly not in that Austin.

E

Do you have any plans, or do you have anything the pipeline around server list? So my thinking was that, in the same way that data scientists, you know, would use the notebook to essentially work that for data engineers and or developers it would be really cool to have a service integration there and anything from a service doesn't really matter what probably of risk, but anything around that yeah.

A

That's a great question, so I'll kind of answer in a couple ways: I guess so there we've added a lot of flexibility in our pipeline spec recently. So that's pretty small. So let me let me pull this in here, but one of those things that's relevant to this is the scale down threshold. That's mentioned here, so you can actually what like, let's say that you're running batch jobs every once in a while, or you kind of only want to spin up these these pods when they need to do the work and then scale them down.

A

So that's really what's meant meant to or that's what this field is meant to control in the sense that like if you want to spin up a thousand workers to do some batch processing in parallel, but you only want to do that. Every Friday right! You don't want to keep up all of those workers all the time, so you can spend those up immediately and then scale them right back down afterwards. The other thing in terms of services that we have is actually so. This is a fairly recent. We just actually oh yeah.

A

It is included here so this is this service function is actually a way that you can spin up what you can kind of think of as a long-running pipeline, but it's basically it's basically a service that would have access to version data internally and that you would be able to ping on an internal or external port, depending on where you're connecting from so.

A

If you wanted to like serve version data or or handle version data in some way as a service, that's what this is meant to include so like I say this is I would say a experimental feature at this at this point, but it's it's meant to kind of kind of handle. Some of these use cases like you, like you, mentioned.

E

Thank you. Thank you.

B

Right, then, if you can put the link to your slide deck in the in the notes and I'll add in some of the face things, so this this has been great for from me, I hope it's been really good for everybody else. To get this, this background in here and.