Red Hat OpenShift OpenShift Commons Briefings, 4 Jun 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Briefing: Continuous Development and Deployment of AI/ML Models with Kubernetes

Description

OpenShift Commons Briefing
Continuous Development &Deployment of AI/ML Models with containers and Kubernetes
Guest Speakers:
Will Benton (Red Hat)
Parag Dave (Red Hat)
Peter Brey (Red Hat)
hosted by Diane Mueller (Red Hat)
2020-06-04

A

All right, everybody welcome back again to another openshift Commons briefing. This week we seem to be having an AI ml theme going on, so we're really pleased to have will Benton and pyrite Dave from Red Hat with us today to talk about continuous development and deployment of AI models with containers and kubernetes. I saw a preview of this earlier internally at Red, Hat and I thought it would be a great thing to bring and have a discussion around so as I'm gonna, let the guys introduce themselves talk about it.

A

Do some demoing and at the end, we'll have live Q&A. So if you have questions you can type them in the chat either here in blue jeans or on Twitch or in YouTube or Facebook or wherever you're watching it live stream. Then we'll relay them back here. So with that said, will please take it away and introduce yourselves in let's rock and roll.

B

Thanks so much Diane really appreciate the opportunity.

B

I'm will Benton I'm an engineering manager and an engineer at Red Hat in the office and CTO, and my focus has been on helping Red Hat's customers build machine learning systems in the cloud with kubernetes and OpenShift, and one of my sort of particular passions lately has been figuring out how to use contemporary infrastructure to make data scientists lives easier. So how can we improve that machine learning workflow as well as sort of running the system in production.

C

Brian I'm, probably I'm a member of the product management team, the developer tools, bu, my focus has been you know, aligned with what those looking at, which is. How can we enable developers to create and deliver the applications across from dev to test a prod in the fastest way possible? So you increase the frequency in the most efficient way and the optimal way, and then what are the differences that have happened when it's a specific kind of a workload, whether it's AML applications versus a an IOT application versus a traditional Java application?

C

So what are the differences? What are the similarities and we want to bring that together and offer a solution, so that makes it easier for developers just get going with it. So we're pretty excited to align the continuous delivery and deployment from traditional world to an A.

C

So, let's start by looking at a few preambles around: how do businesses really benefit from a inm are difficult organizations. You know globally across the board in trying to understand you know what are the AML initiatives that they are chasing? What are the benefits that they would like to derive from these, and, if you put them in terms of various categories, you can we see that my adopting certain a I powered, intelligent software businesses can actually drive a lot of benefits from areas like customer satisfaction.

C

So this we can, you know, gain knowledge about what the customer usage is. What is happening, the trend sentiment, analysis, train analysis and you'll be able to actually increase the satisfaction with the policies and services can also gain a competitive advantage by creating some differentiated digital services that were driven with in AI driven. You know the concepts behind it instead of a rule or a process driven and a philosophy behind.

C

Obviously, if you can optimize, you can leverage AI and deep learning to kind of optimize your current business services that are out there and you can hence increase your revenue because you're, optimizing them and also drive some new revenue streams by you know offering similar services and, lastly, is in charge in automation. Right if you're able to automate the manual repetitive time consuming business operations, you can actually reduce your operational costs and this allows you to be more efficient and yet offer a higher customer satisfaction.

C

So a IML internally in organizations is being leveraged to kind of drive, increased value in these four areas, and that is being done by out-of-the-box products or sub that has been built in-house in the next type is well. Here are some examples. You know when we looked at how companies are leveraging AI and machine learning to achieve some positive business outcomes. So, for example, you know financial services. We all are part of this. You know outcomes like reduce fraud, so you know you've heard of credit card fraud, detection engines.

C

You know things that can predict whether the transaction is real versus fake. That's driven with AI and machine learning, so it's something which is being you know, driven very strongly by the financial services markets. Similarly, in the medical field in the healthcare, we see a lot of medical diagnosis being done today. I right so there's a lot of work happening there, but they can speed up the time it takes deliver diagnosis and also increase the accuracy of the diagnosis by augmenting medical professionals with an AI and ml driven application.

C

When we look at insurance claim industries, we see that a lot of the automation that is happening around processing of claims around a pool of claims is increased. You know the the amount of claims I get that approved and also decrease the amount of time the customer has to wait to actually get the claims approved and obviously we already bought a Platinum is driving right. I mean the self-driving cars. These are all driven lots and lots of data being processed on the edge with a inm all applications.

C

So what's, if you look at like what's driving all this right, it's basically. What is you know why now right, and what is happening now is the growth of AI is actually driven by easy access to abundant computing power, faster processing with specialized computing processors, rapid development, with some rich open source frameworks that you can actually use n technologies for AI learning models and it widespread awareness and acceptance amongst all of us in the world of AI.

C

Like we don't see, AI interfering in our lifestyles, we see AI, actually augmenting our lifestyles and so within the awareness and the acceptance has spread. It has actually led to all of these initiatives not be taken up by the e companies and the computing processing and the computing power. What is to run on supercomputers and not be deployed to a cloud environment at a fraction of the cost and time so these two factors have combined together to basically make AI ml real.

C

The next up is so now: let's look at okay, so you know we understand the benefits of a IML for a company. So what is the end-to-end flow that is followed in order to create an AI ml application? So, let's take a look at a typical development lifecycle and you will see that it exhibits many similarities to traditional software development, so it starts with the business leaders right so the business leaders.

C

When you look at the outcomes that we discussed in the pen as we covered earlier, they will define the outcomes, the goals that they would like to achieve from the AI and Hemel application initiative that is being undertaken at that point in time we see that data engineers will then work with the data and to build the architecture and assistance that we ready to do the data processing the data storage and make sure that it's in line with you know, what's being thought, is optimized and also with the enterprise go standards and policies that are in place with the architecture and the goals defined and the system available data scientists will now start working with the data to build the models.

C

They will develop the AI ml models. This would be you know, collaborations will be done with engineers to make sure that their area models are able to leverage the architecture in the systems with the data of azides, once the am models are created. The next part of that is well. They need to get deployed, so the data scientists typically collaborate with the app developers to integrate the deployment of these AI mo models into the entire application development process. So this is where the applications consume.

C

The models make them productionize and put them into the application that the end user would not be consuming to leverage these models. The app developers also lead the deployment of the applications. Is you have to deploy them out right, so they deploy the application and when it's deployed at that point in time, yeah my models will now start running right and so they're running its infant's capabilities. New data comes in.

C

They have a kind of in for the data to see how accurate it is, and these models are being monitored and managed together by the data scientists and the application developers, because they want to make sure that the models are delivering. The desired outcomes now IT arts is typically continuously engaged across all the aspects of this lifecycle, so it helps you, the awareness of manager and monitoring and remediation the entire system. While these models are being monitored by the data scientist and app developers, they want and making sure that the correct predictions are being made.

C

It requires you to go back and retrain the models. So the loop is it's a big feedback loop that happens right models will get retrained as needed, say, for example, you're making predictions. So your goal is to increase the accuracy of the predictions. So you have a you deploy it, you can new data coming in. You see how the outcome is and then you go back and you never end up retraining the model so to make sure that the predictions are accurate. So this is what happens.

C

Similarly, when some new data comes in and it's like, oh, we did not train on that data over a I am a Mollison to now handle this new data. So now we gotta go back. We create the models again and then do the entire deployment process. So this continuous feedback loop is always happening from training developing in the air model. Two deployments tube actively training them it and, as we covered like you know, this involves personas from all over the place.

C

Now, if we take a look at this typical life cycle, you will see that at the top you know we have the project lifecycle, setting the business goals and the data and engine isn't they need to work, and if we kind of gather the data prepare the data make it ready in order to execute all this, you need it. Let's call it the machine learning software tool chain. So, for example, it starts with tensorflow jupiter notebooks. You know python stacks, for example, for development.

C

Yeah then comes the flow of getting the models across for so you have some CI CD products, the data lates that you need for data services.

C

You can have sequel server, no SQL, that's where the data service architecture comes in right, so we can take all the pieces where the data resides and then the models have to kind of work, with the data that's being stored and made available in these systems and then the CI CD tools, which is basically your automation or testing for deployment, have to come in, and all of this has to be supported on a self-service, hybrid cloud.

C

So this hybrid cloud platform basically empowers data scientists and their engineers and developers to be agile and collaborate it through the entire process, and you don't depend too much on IT operations for individual tasks. Now this hybrid cloud platform, because it's self-service, also needs to be optimized for the kind of AI ml application, you're building. So, for example, if there are hardware accelerators like, for example, TB user can help you speed up the development of the ml models and you're on the inferencing tasks.

C

Then you want to bring it into the cloud the cloud of the support, then. Finally, the hybrid cloud platform has to offer a consistent experience, as the name was hybrid cloud across all different kinds of environments, whether it's on-premise, but it's on the public cloud.

C

If it's you know, processing data, that's coming from an edge location, then that needs to be covered as well and has to be done in such a way that IT operations can manage it from a single place can manage it in the same, you know in a singular way, in a consistent way, rather than trying to adopt for each particular environment in the infrastructure where it's landing. So this, if you take a look at the entire lifecycle and in the tool chain, this is where all of it lives right.

C

We start with the with the bottom infrastructure. Where things are gonna get deployed. You have a cloud platform, the runs on top of it as the compute power has the knowledge of what goes into GPUs. What doesn't go to GPUs provides the architecture for the data provides architecture for the tooling. That includes the CI CD process for continuous deployment and faster deployment, and all of this leads into thee into the you know the intro and flow of delivering the applications.

C

So now that we know that this is the tooling and the the Infosys that is needed. Let's look at the benefits. What the container base architecture and kubernetes brings to this development and deployment of models through that lifecycle. We just covered C with containers and and kubernetes driving their orchestration of the containers, data scientists and software developers can act, develop ml models and the associated intelligent applications powered by these models, with a very high degree of agility flexibility, Portability and scalability. So we think of leveraging the power of kubernetes right and infrastructure is code.

C

You can now automatically set up your AML environments across the infrastructure. You know hybrid cloud, so there's public clouds or on-premise. You can set it up automatically because you're declaring it as code, and you can now do on-demand provisioning of your computer resources during your development process of the models during the deployment process of the models and then during the running of the models.

C

So, as your demand grows, you can actually scale out the running part of it or, as your data demands grow, as your data gets bigger and bigger in your training sets, you can actually now have higher compute that helps you develop the models, so the power that kubernetes and containers bring to the table. The biggest one is around scalability, because you can then scale as you need, and also around H, a right, because if it's in a real-world environment, many applications are running.

C

If you have downtime or failures, you know whether it's Network failures, hardware failures, your entire solution can keep on running and it can actually be automatically provisioned. Where else it needs to go to broad, uninterrupted service to your customers. When we look at portability now we talked about how the models need to run across weight as part of the infrastructure, which means that we don't want to create a model that can only run on on-premise with a particular public cloud.

C

It has to be really effective or we factor the idea is use, containers and kubernetes to allow us to fold these models to run no matter where the end environment ends up being, and so this on just-in-time, inventory, just-in-time, scaling, H a portability and being able to quickly deploy changes to very specific pieces of the products versus updating a monolith application with one change that you had to make to take care of. Maybe a new data model, or maybe to some bug that was found.

C

It's much harder to do that when it's being driven as a single application versus a containerized set of application and a continuous set of microservice is the make of the application, because then you can update as you need for the respective pieces of them. So it makes you more agile in how you will respond to either new requirements or bugs, and also about new computing requirements that you have on the scale. So now, I, don't know what do you will? So we can take a look at this in action, but openshift thanks.

B

For ugh so I wanted to sort of do a deeper dive into that machine. Learning lifecycle from from sort of a practitioners perspective and talk about how we'd use this to solve a concrete problem. So Parag talked about a lot of problems that are actually driving business value. I'm not going to talk about such a problem today, I'm going to talk about a problem that everyone went under, but no one is looking to build a new solution for right now and that problem is spam.

B

Classification we're going to start with a hypothetical data set where we have two kinds of data sources. We have data on the top which we're calling legitimate documents, and that looks like data on the bottom, which we're calling spam documents and if you look at these and think about them, you could probably say well. I can see some differences between these things. I could see a way to tell them apart. If you really think about it, you might think that the excerpt on the top sounds suspiciously like Jane Austen and the excerpt on the bottom.

B

It sounds suspiciously like it came from a user comment on a recipe site or a review of a food product, and that's in fact what our data sources are. We're gonna call legitimate documents, documents that have been generated by a generative model trained on Jane Austen's, complete creative output and our spam documents are going to be documents that are trained on fine food reviews from a large internet retailer.

B

So the idea is that we can tell these things apart by looking at them, and we should also be able to write a program to tell them apart. So, let's dive into that workflow and see what we do in this specific case to sort of solve that problem. The first task that a data scientist is going to do again in conjunction with business leaders and stakeholders, is figure out a way to formalize the problem.

B

We need to figure out what it means to succeed at this problem and turn success into a number, and that could be metrics that we're already collecting or metrics that we need to invent and record in the case of document classification or spam filtering success could mean not missing spam messages right, like I, never want to see a spam message in my inbox. Now, of course, we could say: I'm, never gonna see a spam message in my inbox by sending everything to the spam folder.

B

So that's obviously not the whole story, but it could also mean that we don't miss file, legitimate messages right that we don't see that we don't have a lot of legitimate messages that would wind up in someone's spam folder. Now those are metrics that we can test when we have a training set when we know what the truth is or something there are also busy metrics we might care about, and in this case it could be feedback from our users. How many messages did we send to someone's inbox? For example, they got flagged as spam.

B

How many messages did we send to someone's spam folder that they moved back into the inbox? Obviously these aren't the whole story, because people aren't perfect right. Someone is not going to go through every message in their spam. Folder and say: did I really mean to read this, and even if they did, they might not give us the signal by sending it back. But these business metrics are an important part of the problem and responsible data scientists will focus on the whole picture.

B

Looking at all of these metrics together, once we have those metrics out of the way. Our next step is to collect clean and label data. In this case, that means going from raw messages where we have labels to labeled messages in a regularized format where we have individual documents that are examples that we've labeled as either spam or legitimate.

B

We're now into the sort of core of the data science workflow, where we go from that clean data to what we call feature vectors and that's going from regular data that we'd be happy to deal with in a database table or in a conventional perming language object to points in space at a high level, machine learning, algorithms are, or just figuring out ways to split up space or map from space onto smaller space.

B

It's basically just summarizing points in space, and so what I want to do is encode every document that I see as a point in space in such a way that similar documents correspond to similar points in space and then I can say interesting things like oh, it looks like there are a lot of legitimate documents in this part of space. So I could maybe say that my model is going to distinguish between things that are in this part of the space and things that are in other parts of the space.

B

Just as an example, once we have those features, we can use those as input to a model, training algorithm where we take the label data where we know the truth. The approach that we use to turn that label data into future vectors, and then we allow the model to identify patterns in those vectors that we can use to answer the question we care about in this case. Is this document spam or not, and really at a high level? All of this model, training algorithm is doing is identifying good trade-offs in how it summarizes the data.

B

You want a simple model that has good performance by some metric that we care about, so once you have that you basically have a function that takes these points and maps them into predictions, and those predictions, as you can see on this slide are not perfect right and actually we probably should be suspicious if they were, because that gets us to the sort of next phase of our model training process, which is that if we were to train a model just to memorize everything it had seen, we could have a perfect model right.

B

It would perform perfectly on our training set, but it would be extremely unlikely to perform well on data that I hadn't already seen the next step in our process called model validation is where the data scientist goes back and tests the performance tests, the metrics of those model on data, for which we know the answer, but which we did not expose to the model training algorithm in the first place.

B

So we want to make sure that the performance we have on data we haven't seen is comparable to the performance we have on data that we have seen.

B

The last couple of steps in this process are actually putting the model into production as part of an application as a service that you can interact with the rest of your application and then.

C

Monitoring that.

B

Behavior in production, if you think about the early days of automated spam filtering, you may recall that you would see a class of spam messages in your inbox right. You know.

B

Maybe it was ads for online gambling one week and online gray market pharmaceuticals the next week- and you know mortgage discounts, the third, but there would be various topics that would elude the spam filter and then someone would sort of identify that these were getting through to the inbox and push out a new version of the spam filter that caught those things, and so the spammers and the spam filters. We're playing this Katan scheme in the real world in general.

B

Models can start misbehaving, and the interesting thing about models is that conventional software components, if we're lucky break and obvious ways they don't build, they don't deploy, they obviously slow in production models, though remember we just have a function that makes a prediction and the way that this can can misbehave is sort of more insidious than the way that our conventional web apps might misbehave, and that's that the model could keep giving you answers. They might just be wrong.

B

Far more often than you can accept so by monitoring the behavior of the model in production, we can identify this before it causes us a business problem. Now, as prague said, this is not a waterfall right. This is actually a iterative process and at a lot of stages in the workflow, we're backtracking and changing decisions we made earlier. Another really interesting thing about this lifecycle. Is that, because of all these loops, we have, we need to be really careful about the latency between phases and a lot of organizations.

B

Data scientists when they need new infrastructure to try a new approach to solve the problem, have to file a ticket with IT. They have to get something supported in a lot of environments. If data scientists are wanting to build a model service that can be incorporated into an application they're either going to develop that service themselves, using a skill set, that's probably not where they'd rather be spending their time or they're. Gonna have to have a communication exercise with an active team and they're gonna have to say, hey, look at this technique, I developed.

B

Can you figure out how to turn it into a production application and based on our experience of seeing this work flowing in person? There are some teams where this works very well, but for some teams this turns into a lot of time spent at a whiteboard and a lot of raised voices and a lot of eventual apologies.

B

So we really want to streamline this every loop in this process as much as possible to increase the velocity and the teams that are building these applications and we'll show you how we can do this automatically now on OpenShift, so I'm going to show you here.

B

How we solve this problem from end to end? The first thing I want to show you is the open data hub operator, which is a community project sponsored by Red Hat that provides an end-to-end data science and data engineering discovery environment with a single click instead of filing a ticket with IT. If I'm, a data scientist that has access to open shift, I can install this myself. If this is already installed by my organization, I, don't even have to install it.

B

I can just go to an endpoint and get an interactive development environment for data science. Now a lot of data scientists prefer to work in conventional IDs, but a lot of IDs, but a lot of data scientists also like to work in these so called interactive. Notebook environments and I'll show you what these look like here: I have a directory of notebook environments that I've launched from Jupiter hub on the open data hub, and this is basically just a way to do literate programming in a document.

B

So I have some pros here and I have some code and then I have the output of that code and I can change this code as it runs and edit it. So this is a really nice way to experiment with techniques interactively right. I can say: I want 23 rows of this data set instead of 50 and I, get a different result and I can edit this and it's it's also a communication tool right. This is for a lot of data scientists. A lot of their job is communicating results to stakeholders.

B

So we want to explain what we're doing show the code. Let people in the code se3 reproduce our work now. The interesting thing is we can also have. We can have these sort of tables and we can also have plots right. So we could say: is there a clean separation between the points in space for this problem, and we can see that yeah? There basically is so I could take this notebook, use it to develop the technique and then and it over, to a stakeholder and use this as the basis for presentation.

B

This is sort of more than code, but less than a paper right. It's it's sort of a little bit of both really thinking of it as a literate programming environment where you can, where you can sort of develop a technique and then explain it to someone else, is a really good way to look at it.

B

Now for this concrete problem, we've looked at a couple of different approaches here: I've run them already, but I can just restart and run this again, and this is a feature- engineering approach where we're basically going to turn documents into vectors so that we can feed them into a machine learning algorithm, and we can see that we have some sort of sanity checking our spam and legitimate documents. The spam document look is by talking about cake ups and dad coffee.

B

This legitimate document is talking about things that upper middle class people in 19th century England are doing, and this thing is talking about tea and dog, biscuits and baby food, and so on. So we see that there's some clear distinction between the kinds of things that these documents are talking about. If we go on and look at the rest of this, we can see that you know we were able to sort of trim these into these large vectors and then have and then sort of save that save that pipeline.

B

There's nothing in this notebook that knows about OpenShift. Crucially, this is just a communication tool that a data scientist would work with now we're gonna train a model, and we can again I've sort of run this in advance just before we started, but we can go through and look at it. Here's some metrics on how well our model is doing. This picture basically means how many of the legitimate messages did we actually predict with legitimate how many of the spam messages did we actually predict or spam and then, on the other diagonal?

B

How many spam messages do we call legitimate and how many legitimate messages that we call spam and again this is the it's just a communication tool right. This doesn't look a lot like something you could immediately drop into production now in a conventional workflow. A data scientist would take these notebooks and send them to an active team and have the app dev team figure out how to implement them and in a service.

B

But we know that with open shifts, developer experience, we can do better right and we actually have a source to image, build as part of a Tecton pipeline here that will take. These notebooks extract the code that trains the model and build a micro service around this after training the model in a build. So I've already run this in advance.

B

We can take a look at what it did and if we look at the build logs we can see you know after we are sort of setting everything up, we're actually going through and installing the Python requirements for this notebook we're actually doing the model training and we can actually see our metrics report from the model in a tech town build where we have this performance here, and this performance is not great, because these notebooks come from a workshop where we teach people about machine learning.

B

So we let them we let them to in the model to make it better. We're just showing you the first stage of this lifecycle, but then we go through from there and we actually deploy a que native service based on the model that we trained in those notebooks in production. So what we've done is we've extracted the code that does the feature engineering and the model training from these notebooks.

B

We run that in a build, and then we take the code that does the feature, engineering and the trained model and put that together in a rest endpoint. So we have a service that takes raw data and returns a prediction.

B

Now we're running that in we're running that in a can ativ service right now and I can show you what that looks like.

B

That's running right here in this pipeline service, we also have a parallel build that just uses regular source to image. I've also built a a version of this that just uses a conventional sourced image build to. So if you have an adopt a Tecton, yet you can. You can still use similar techniques we like to show the latest and greatest, though, all right, so here's how you might interact with this in a in an actual application and I have I have a couple of different URLs here, we're using one of them.

B

This is the one for the K native service. We also have one for the conventional OpenShift service down here, so I'm, defining the end point that I want to interact with I'm, declaring a very simple client library, where I just take the text that I pass in and post it to that rest service, that I created so.

B

Again I said the model performance isn't very great, so we're gonna look at some example predictions with two very short documents. One document is dog food. One document is, it is a truth, universally acknowledged.

B

um We would hope that this would get predicted as Jane Austen, but again we left some room for improvement in the model, so these are both show up as spam, but let's try this with some more documents: I'm gonna load in the training data I had and I'm gonna take a sample of these and look at how well the model performs on these examples. So we have.

B

We have a lot of examples here in a lot of predictions, and the interesting thing here is that we can actually go back and track metrics about the predictions we've made and the the the interesting part is that we can sort of look at this service and see what it's what it's done right remember. We talked about data drift right.

B

We may not know whether message is spam or legitimate in real life, but we may know that we expect that a certain percentage of messages we see are spam right and in the real world, maybe maybe 90% of all email traffic is spam, and most of it just never makes it to your inbox, maybe 95 percent of spam. But if that distribution changes over time, we know that the data we're seeing in the real world no longer corresponds to the data that we trained our model on.

B

So we can simulate this with an experiment where we're going to track the values that we that we see over time with a given proportion of legitimate to spam messages and we're going to see how it works when we.

B

So if we start with like a hundred thousand a hundred thousand examples with five percent legitimate 95 percent spam, we should expect that the distribution of these messages is roughly comparable to 95 percent spam and five percent legitimate, so we're tracking these metrics from the model, and we can actually see them in Griffin, ax and as soon as Griffin, that catches up will see. Those metrics reflected in this dashboard here, but you can see how we've sort of built it up over time.

B

This is a log scale graph, so we're looking at the slope of the line to see how quickly the individual, how quickly the individual lines grow its so we're looking at rates of growth rather than absolute numbers and the slope of the line and we'll just we'll just go back and.

B

Run some different experiments and we'll see when those catch up with when Griffin it catches up with those experiments that we've run. If we say, 25 percent of messages are legitimate. We should see different curves in that graph, so we're getting a little bit of a tick up here as as the metric system catches up, but we can see that these these curves are gonna catch up over time, we'll use a shorter time window. So it's a little easier to see and.

B

You know as we go on so we see that there are a lot of legitimate messages and spam messages with this latest one, and that's not what we'd expect right. We'd expect that these would be growing at the same rate, because the proportion between them would be staying the same so in a real in a real installation. We wouldn't just have a data scientist monitoring this dashboard waiting for something bad to happen. We'd want to let them do something more productive with with their time, but we could define an alerting rule for a like.

B

Has this distribution changed, or even have another model that detects anomalous behavior in these predictions? So just to recap what we've seen in this in this demo end-to-end is: we've seen using open data hub on open shift to provision a self-service discovery environment we've seen using that open data hub to do interactive development and sort of produce a machine learning technique in an interactive notebook.

B

We've seen how we go from that interactive notebook, which is really a communication tool and not what we think of as a conventional software artifact to an actual production service that we can incorporate into our application, using open chips, developer experience and Tecton build pipelines. And then we've seen how we can monitor the behavior of that model in production so that we can detect when it misbehaves.

B

So I want to just close I think by putting this overall architecture picture back on the screen and sort of showing folks again sort of how we support that entire lifecycle from end to end thanks. So much and I think we have time for questions now.

B

A

That was great and thank you for that explanation. We have a couple of questions in the hair and I'm gonna unmute Pete Bray who's, also from Red Hat is here and he's been answering a little bit of the questions in the chat as we've been going and you could see Pete there, and so one of the questions and I think it's a good conversation was around storage and we'll lead had asked that question about. You know what is trending now and storage for AI and m/l data. I, wonder if you could address that Pete sure.

D

And I'll paraphrase a little bit what I wrote in the response back? The answer is, it really depends. We are seeing some particular trends, but it really depends upon the types of data and in general, you know. There's there's really two large actually there's three large categories of data structured data, which you normally would think of as things like customer records or things that would go into databases they fit very nicely into. You know a tabular, common or columnar type of format, but we know that not all data is nice and neat like that.

D

In fact, there was another category of data called semi-structured, which is midway between being very structured and columnar to being very unstructured, which is actually the third category of data and in the unstructured category. These are things like files and I think what lead who had asked. The question was specifically asking about unstructured data files. Basically that were he, it looks like he's using NFS for today and so I skipped over the middle section, which was semi structured data. It's basically a combination of both structured and unstructured data.

D

You might hear that oftentimes referred to as data warehouses, whereas the unstructured data quite frequently and I think Parag use this term at the beginning of the presentation here de Lakes, although data Lakes is a very overloaded term, that can mean a lot of different things, but most commonly it's it's where objects files are stored.

D

You know from a storage standpoint, increasingly what we're seeing and- and you know much of this is being driven by you- know: popularity of Amazon s3, we're increasingly seeing men towards an object, storage based data Lake to be able to support these names.

D

Now the the challenge for many customers that we work with- and many you know upstream implementations also is that not all applications are yet ready to support s3 interfaces so we're in kind of a transition period right now where there still are a lot of file based NFS based applications that are out there, and so quite often we find the challenge of how do I support my file based workloads, typically using NFS, but I eventually want to migrate over test3, and so we've had a lot of experience.

D

Helping people be able to do that to get to this new necessary environment, you might ask well, why would you want to do that? There's a lot of different reasons. The most primary reason is that s3 presents a very flat namespace, which is massively expensive and when you're building a data Lake that could potentially be hundreds of petabytes, that's very, very important, and that's actually one of the challenges with traditional file systems like NFS, is there's limits to their ability to scale because it's more of a hierarchical type of namespace.

D

So we are seeing a trend that people are increasingly adopting s3 as their underlying storage technology.

A

Awesome thanks I figured you could answer that one pizza was that was up your alley. So that's good coming in mohammad chakra is asking. How do you address data, lineage versions, sensibility and data versus model we'd like to take that one on very.

B

I'll take I'll, take a crack at it and I think you probably have some thoughts here too. That I mean chime in if you'd like, but so there are a lot of technologies in this space that solved issues of data, lineage I'd really look at I, really look at sort of managing the model lifecycle. A big concern here is reproducibility right, and there are so many facets.

B

Reducibility says so: I'm gonna start by level setting and then I'll get get to your question like with Jupiter notebooks, you saw how I went back and edited things right and I ran things in different orders. You can do that in a notebook. If I do that in a notebook, the output in the notebook is not going to be what someone else is going to get if I send it to a colleague and she tries to run it right.

B

If I don't have the same. Libraries installed that you have installed, you will get different results than I will. If I have a library that has soft dependencies right where it behaves one way, if an optional package is installed versus a non-optional package installed, you may get different results than I do running it running a model, and then finally, there are all sorts of other concerns related to, like you know, making sure that I specify random seeds, making sure that I use random number generators in a way.

B

That's safe for the kind of parallelism that I'm exploiting in my application, making sure that making sure that any native libraries that my Python or JVM code is calling out to are the same versions and have the same behavior. If you really need, if you really need bit level reproducibility of your model, which which many people do, then you have a whole host of challenges in the code and that's what we focused on today. You also have a whole host of challenges with the data right.

B

You know, your model is only as good as the data it gets. Your model is only reproducible if you know which data you use to train it and how you got that data and I think in terms of actual data, lineage tracking. um There are a lot of great projects in this space that address that component. It's not something we addressed in the demo today, but you can look at technologies like pachyderm, for example.

B

There are other projects like I. Think DBC is another good example, or the quilt project has sort of a metadata layer for for machine learning. Data sets as well, and it's it's a tricky problem right, I think I think what a lot of people like or what a lot of want to have is something that looks like a get style interface, where you have a content-addressable set of trees. You can say I built this model against the immutable data that I had in this particular hash of a tree.

B

Right in this particular in this particular snapshot is what I use to train the model. I can always go back to that now.

B

Ceph, of course, is immutable by default right, like you're, not overwriting things unless you have to so. It's a case where sort of our platforms here provided or the read hath life forms provide a provide, a primitive that you could use to support this, but I mean again, there are a lot of community projects that solve this problem really well and I. Think they're. Those are those are all worth looking at in this case, and the way we've been thinking about this problem.

B

Is that really you know you don't just want to track your code and your libraries and your hyper parameter settings and your ending seeds. You also want to track your data and in terms of actually thinking about lineage in terms of pipelines. If you have a sort of classical data Lake to data warehouse architecture, where you're going from raw data to sort of incrementally cleaned data in multiple steps, you need a way to replay those pipelines that you need to track the identities of the data you're dealing with at every stage. I do Pete.

B

Do you have anything to add there yeah.

D

You've made some really good points. William in you know at the very high level about you, know the the code piece of the equation here, as well as the data piece of the equation and at a very high level I. Think many of us probably heard the statistic that you know as as Parag was presenting this flow. That's on the screen right now. You know the gathering and prepare data stage is actually probably one of the most problematic stages right now for data scientists, I think a lot of people cite.

D

You know the statistics that 80% of the data science, it's time, is spent just gathering and preparing data and I think that's like an overarching issue.

D

But what you know we're talking about here and what william was talking about, is an even more specific case of this problem, because you know not only do I have the problem gathering the data, but how do I ensure that there's reproducibility and I was gonna answer in exactly the same way that there there are lots of different ways to address this there's obviously commercial packages, but there's a lot of open source packages. Also that can help you with this particular problem.

D

It is something that you know the industry I think is focusing on because it is such a profit. It is such a broad problem with respect to my earlier comments about object, storage technology becoming much more prevalent. This is an area also where object storage, as the technology can help, because it has built-in versioning capabilities for objects and so you're able to maintain that data. You know, as you know, the objects, the files whatever it is, may potentially change.

D

You would hope that it's not changing that your data sets would be static, but not every situation is going to be a static environment, so it's important I think to think about the multiple layers of the cake here, so to speak in terms of yes thinking at the storage layer and do I have the capabilities you know to provide not only the governance, things that you know what we'd was mentioning, but also the ability to version objects and things like that, but then wearing on top of that.

D

The tools to help you with this problem also is something to think about it.

A

Alright, well, you mentioned a couple of open source potential projects and things like that and there's another question in here, and maybe we can tease out a little bit too about how to get started on openshift with all of this through a question molly is asking about: are there open data hub cookbooks recipes for all these AI ml processes and steps that one can refer to and I think that's kind of we've talked about it, you've demoed it. How do people get started where, where are the resources and things of that nature for this?

A

What's the best next step, if someone's looking to do this.

B

Yeah absolutely so I think open data hub IO is is a great place to learn about the open data hub. We have a couple of github github projects and github orders. We have a github organization where we've collected some of these tutorial materials, and you know I'm happy to follow up offline with anyone, who's interested in reproducing some of these things. They're trying some of these things out, but I cannot as soon as I'm, not sharing my screen. I can put a couple of links in the chat all.

A

Right there's another one: that's coming in from YouTube Marcus is asking how to handle a case where the resources such as RAM CPU, from completely used from being completely used up by the ML application and open ship, because usage of resource beyond limits makes VMs freeze, yeah.

B

A

Been there alright.

B

So that's a great question and I'm going to go back to the to the demo. There! That's okay! Let me find it just a second.

A

Always fun to see everybody's screen, savers minimized by.

B

Minimized my window so I'm needing to go back to where we were and yeah. Let's just go back to this open shift.

B

Oh it's because I had it on full screen that always always catches me so I'm gonna go the routes here, I'm gonna go back to Jupiter hub and what I'm gonna do is I, didn't show you this during the demo, because I wanted to sort of streamline it, but this is the interface that the open data hub is actually gonna present a data scientist when they, when they first log in to Jupiter hub with openshift they're gonna, get this launcher right and there are a lot of interesting things going on here.

B

So remember: we talked about a key aspect of reproducibility is: do you have the right libraries installed? Well a great way to solve that problem is with having your development environment stored in a container image, because then I don't have to worry about whether or not the libraries I installed are even still available, which is probably C. Surprisingly often or whether or not you installed the exact same versions.

B

I did I can just say: hey I, published my development environment as a container, and so I can use that one in particular, but the other thing is we're not running in VMs I mean we are, in many cases we're actually running in diems we're running in containers, so we do have firm resource limits on these containers.

B

If I go into one of those Jupiter hub notebooks and try and allocate six gigabytes of memory in a way that might crowd out other people, who's Jupiter notebooks happen to be running on the same VM or the same physical node on the Linux kernel. Openshift would terminate my notebook kernel and then say you're using too much memory now.

B

The question is: can I get work done in this in this environment and the facility for that is to say we can set resource limits automatically, and these are profiles that you can configure as an administrator when you install the open data hub, but we have basically t-shirt sizing. Is it do you want? Do you want whatever the default is, which is typically small?

B

Do you want small, medium or large, and by requesting those environment you can you can get, you can get more or less resources, and the idea is that people who need more to get their work done will request those resources. And, ideally you know you have the sort of sort of cultural, cultural, mores where people don't take more resources that they need and release them when they're done with them.

B

But there are technical solutions to that problem as well, and then just as long as I'm in this launcher, we can talk about the sort of other aspects. We have a way to sort of preloads. We have a persistent volume, backed by SEF running in the open data hub and I can pre-populate that with the contents of the git repository, um we also have integration with Seth.

B

I didn't use it for the demo today, but if I had an object store, the open data hub would actually fill in my users credentials as environment variables, so I don't have to have those in a notebook. I just have to sort of refer to these by BOTS as environment variables and access them, and we have all. There are a lot of other demos. If you go to open data, opt-out, I/o or search open data hub on YouTube, you can see other demos where they're showing that stuff integration more more in-depth.

A

Perfect I think, let's see if Rakesh comes back with any other questions.

A

And I think we might not have everyone's silent on all of the streams, which is amazing, so anyways do you have a final slide that links to resources or anything that you can throw up so in case people want to find you or do some interesting research on top of open ship and tested test. Your theories and practices out.

B

You know I actually, I, don't think we have one of those, but I can make one before the end of the call I'll put that together all.

A

Right, that's all good we're we're good and Rakesh. So as you answered his questions, so that was great, so it is. If there's any final words, we got about five minutes left from Peter prac or anyone about what's next for ml on kubernetes and maybe specifically OpenShift. If there's anything coming down the pipeline new operators, new partnerships or things, we should watch for I.

C

Think one of the things that we are you know as we expand the life cycle out, one of the things we are focusing on that my DevOps perspective. What can we do to make it better and make it easier.

B

C

Data scientists and for the app developers like the personas, every song on the lifecycle. How can we make it easier for them, depending on the kind of AI ml application that is being built in so the tooling the interactivity part of it right? Because you have you're touching a lot of points you looking at data and models and Jupiter notebook, but then you also want to do work in an IDE level kind of in a work as well in a roller than a notebook. You want to preview them you're, bringing in so we're looking at.

C

How can we make now that we have identified? Is we have you know the equipment is an open ship. How do we make it better and easier for developers and for data scientists to come in and start creating from scratch? If you are like you know, profile is my company's got something going on, but how do I start like? Where do I go? You know it's like it's like. How can we make it easier for them, so we are focused on that. Those leading political tracks aren't it. We should definitely see some good.

A

So um you see the resources screen here, there's one last short question: I hope it's a short question, because we're almost at the end of the hour is how easy is it to customize the Jupiter hub landing page he's talking about he's on creme and would not need the AWS feels that's a real question.

B

So those AWS fields actually apply if you're on Prem, because they're also credentials for Saif right. So in the open data hub we're deploying openshift container storage, that's as I said, it's used to back the persistent volume, so your workspace is basically backed by Saif in that case, and you can also refer to larger data that is stored in SEF, hosted on OpenShift as part of that open data hub deployment, so those credentials are gonna apply on prem to the to these storage. Back-End with the open data hub is provisioning.

B

It's not it's not for s3. You know necessarily s3 the service.

A

All right, I'm gonna, give it a pause for a minute I'll mention that we are probably in the not-too-distant future, going to be hosting a virtual, open ship, Commons gathering with an MLA I focus. So if there's topics you want to cover or people, you want to hear from reach out and let me know and I'll try and curate a very interesting day for everybody and reach out to some of the folks that are on the call here today and others to make that happen. But I'm not seeing any more questions coming in anywhere.

A

So please do check out open data hub, do and all of the Center of Excellence AI resources and tools. They're, doing awesome, work that lots of end users and customers doing really interesting things using this from Mass Cloud to anthem was, on the other, any thoughts of people doing some really interesting work on in ml and AI and data science. Next week we have the folks from how's my flattening dot CA, which is a bunch of data scientists who are using the Ontario data sets for kovat. So take a look at what they're doing there.

A

So they'll be coming in and talking about their stuff. So there's a lot of interest in this use case on open shift and learning as we go and hopefully enabling you to do what you need to on top of open ship so will Perry Pete. Thank you very much for taking the time and doing giving this talk today and the demo always always insightful and educational and thanks again to Chris short for producing it and making these live streams just flow so nicely.

A

So with that we'll let you all go and have a great day and we'll talk to you again next week. Tuesday I think is the next OpenShift Commons briefing tape.

C

B

Thanks. Thank you thanks. So much.

B