Continuous Delivery Foundation cdCon 2021, 6 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: MLOps with Jenkins-X: Production-ready Machine Learning - Terry Cox, Bootstrap Ltd

Description

MLOps with Jenkins-X: Production-ready Machine Learning - Terry Cox, Bootstrap Ltd
Speakers: Terry Cox
Explore ways to treat Machine Learning assets as first class citizens within a DevOps process as Jenkins-X MLOps Lead, Terry Cox demonstrates how to automate your training and release pipeline in Cloud environments, using the library of ML template projects provided with Jenkins-X.
For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/

A

Hello and thanks for joining me on this session, about uh using emma lops with jenkins x, we have quite a broad audience today. So what I'd like to do is start with a little bit of a history lesson back in the dim and distant past of not that long ago.

A

Really software development typically used to involve quite a lot of individual developers working on their own isolated machines, each with access to a complete code base, and they would individually make changes to the to that code base and then exchange their changes with each other quite often on manual, media disks or usb sticks or even via email.

A

So everybody would be making incremental changes to bits of code and then at some point uh somebody typically a release manager would be responsible for trying to work out how to integrate all of those changes into a known release version which could then be compiled and moved on to a set of machines where that could be tested and typically testing would involve a lot of customers using the code trying it out finding out what didn't didn't work and then reporting that feedback back to project manager.

A

Then the project manager would report that feedback back to the development team.

A

Who would then go through another cycle of code changes and then there would be another cycle of trying to reconcile those changes into a release and put that back through into testing and eventually, once you went round around that loop a few times, you would get to the point where the project manager would declare that you were done and you were ready for release to production.

A

So a release to production in those days typically involved uh packaging up an executable and some written instructions on how to deploy it and then firing the whole lot over the wall to an operations team and once that stage had been done, then there would be a big party and developers will get very drunk and then go on holiday or move on to a different project.

A

Meanwhile, the operations team would pick up this executable or stack of executables and the written instructions which they're just seeing for the first time and then probably spend the next three months, trying to work out how to build the infrastructure that they need to run this on and work out how to actually deploy and maintain that application.

A

So, as you can probably guess, lots of risks associated with this way of working, it's a very, very manual process, and so there were very, very many opportunities for human errors to occur and because there were so many knowledge gaps. I was very easy to forget to include things in a build or forget to document how to use something in the release package or just forget, to mention essential dependencies that were needed in production environments.

A

So, as a result, the whole process was very, very slow and not repeatable, and the outcome of this way of working really is that a large proportion of these projects would fail and in many cases they would fail badly enough to take out the whole organization.

A

Now, of course, this isn't the only way of working and uh many people in the audience will be much more familiar with the idea of a devops process or managing software release.

A

So in in a devops way of working, you integrate all of the required skills from day one. So you have your software development teams, your infrastructure and operations, teams, your networking, security, etc, etc.

A

All involved in a single operational team who are sharing information from day, one about the nature of what's required to build something that is a production-ready product. Now.

A

This involves not just writing code, but also specifying the instructions for automatically building the infrastructure that that code will be deployed into and specifying the testing requirements and acceptance criteria and any other governance information necessary to automatically validate and approve this product, as it goes through the deployment process, and all of this information will go into a version control system where each change is carefully managed and typically you're, using a gitops release process so that the state of your environments are known in advance and your continuous integration and continuous deployment system is reacting to changes in your git repositories to trigger new releases and new builds and as part of that release, process.

A

You're validating the assets that are being created and going through a highly automated governance process, where you're making sure that approvals have been put in place before. Eventually, your continuous deployment system is allowed to create the environments and populate your production system with the various containers containing your product.

A

So, as a result, what you get is traceable asset management across the the full product and it becomes much easier to to manage all of your software assets.

A

So under these ways of working you're, typically dealing with a situation where there are no changes to anything in a production environment without some sort of audit pathway and some way to easily undo any changes that might fail. For some reason in production, you've eliminated your knowledge gap, because you've shifted all of the aspects of the solution right into the design phase, where they can be taken into account properly, as the software is being built and tested.

A

And this gives you a much faster deployment rate and a very repeatable way of getting things into production environment. This level of working provides for very high levels of governance and quality management.

A

So the outcome with this style of working is typically that you have much improved odds of project. Success and commercial survival is much more likely as a result.

A

So now, let's look at machine learning.

A

Typically, what happens when you're dealing with machine learning assets is that you have a data science team and that data science team will be involved in aggregating particular chunks of data to make training sets and test sets, which have been carefully designed to prove specific aspects of a learning problem that you're trying to solve, uh and those may be very large data sets.

A

You know, potentially in the order of petabytes of data, and then you will be creating training scripts to train your machine learning models and those training scripts are often written in the form of jupyter notebooks and then, at the same time, you'll also be doing a lot of data analysis, work to try and understand the nature of both the data you're working with and the models that you're creating to validate, whether what's being learnt, is actually accurate and to detect uh whether there's bias in your data and in your learned models if the models are acting fairly to your customers.

A

If the opinions rendered by the models align with the ethics of your organization. But in many cases this is being done on local laptops, with people testing, scripts out and exchanging information.

A

So there's a fair amount of ad hoc data flying around and then at some point uh somebody will build some infrastructure to run a training event on which usually involves setting up some cloud infrastructure so that you can throw a bunch of cpu or gpu or tpu compute resource problem. And then you may run that for a few hours or days or weeks until at some point you spit out a model.

A

That model will then be evaluated to check its accuracy and once you're satisfied with the model you've got um the route to deployment typically involves moving that model into a model server to make it available for use in the broader application.

A

Now, what that actually means in practice is that you've just thrown that model over a wall to the devops team who are responsible for the rest of the product and to put things in context. The machine learning elements of a given product typically represent about five percent of the overall effort, and so there's a lot of activity. That's going on outside of the data science team to build the whole product which involves integration, and you know, user interfaces and sales channels, etc, etc, etc.

A

So, with this way of working, you are facing some very similar problems. You have a lot of human processes going on and therefore quite a lot of opportunities for errors to creep in.

A

You have big knowledge gaps between areas of different teams working on the same product and the process is slow and not repeatable.

A

So we're back in a situation where governance and quality management are difficult, and the outcome is that currently, the vast majority of machine learning projects actually fail to make it into production.

A

So what is the big problem with machine learning? Why can't we just treat it like anything else?

A

Well, that's actually quite a big topic and in fact the cdf have been spending quite a lot of time and effort to put together something called the mlops roadmap, which is intended to paint a picture of all of the challenges involved in dealing with machine learning assets and putting them reliably into production environments.

A

I won't go into the the details here, but to give you a feel for what some of the challenges are.

A

Typically, in order to be able to train a model, you need very large amounts of data, so you have to aggregate a lot of data and manage that data and provide it to the systems that are going to train the models and- and if that data is in the order of petabytes, then just moving it about can be a significant problem in its own right.

A

You really need to get a handle on versioning that data, because if you want to be able to have any sort of audit capability, then you need to know which set of data a particular model was trained on and potentially be able to replicate that training set and test the model against the original data. To see if anything has changed.

A

The data you're working with is often sensitive personal data. So you have the full stack of challenges around security and privacy for managing those types of data sets, and you are often dealing with dedicated hardware and cross-platform challenges, because you may need a lot of very expensive, gpu or tpu resource to train your model, and you may want to deploy your model onto an edge device such as a phone where you need to be able to execute that model on a completely different set of hardware.

A

And so lots of technical challenges, but then we also have an additional layer of difficulty, because, typically, if you're building machine learning products, you are you're actually designing decision making products rather than just decision support systems, and so that product will be taking on many of the responsibilities of a human when running in a production environment, and so the implications of a failure are significantly higher than they would be for conventional piece of software.

A

And that means that we need to move into risk mitigation for challenges like bias, ethics, regulatory compliance as no doubt be aware. There's a lot of focus on regulation of ai at the moment, and the bar to clear will be very high, and so governance for machine learning systems will need to be very tight.

A

So how do we actually start to address these problems in in practical ways? Well, realistically, what we need to do is put the focus on the product, not on the machine learning, so what we actually want to do if we're going to be successful as a product commercialization team is to optimize the management of all of our assets rather than optimize. For machine learning or optimize for user interface or optimize for back-end, we've got to think about this end-to-end.

A

So, realistically, ml ops needs to be considered as an extension to devops rather than an independent discipline.

A

Let's go back to our original devops image.

A

So what happens? If we bring our data science team into our devops team and treat them as an ml ops team?

A

Now we've got everyone working together and communicating together to understand what the requirements are for the full product and what the constraints are around the behavior of the model and how it needs to be used in that production environment.

A

So then we can introduce our machine learning specific assets like our training scripts and our data sets and we can introduce them into our version control systems and in this case, rather than specifying ad hoc blocks of data, what we're doing is we're specifying descriptions of how to collect the particular version sets of data that we will need later to run our trainings and to test the models that that we produce.

A

So then we can leverage our ci cd system to actually produce on the fly, the infrastructure that we need to train the model.

A

So we can dynamically create training environments with the right resources and hardware to then run our training script and in a controlled environment and how's that training script push out a trained model, which we then evaluate to make sure it passes our criteria and then actually put that model back into our version control system again, so so that all of the assets we're managing are starting to be versioned and controlled in a consistent fashion.

A

We can then wrap the model in an appropriate microservice, give it an api and integrate that model into the rest of our production system, and then we can deploy the whole of that production system through our ci cd environment.

A

But this time, including automated checks for all of the additional machine, learning governance processes that need to be taken into account and then use the system to actually deploy our model. As part of you know, an end-to-end traceable asset management process.

A

So that's the theory. Now. How do we actually deal with that? In jenkins x, well, it's actually very, very straightforward. In fact, you can do mlaps in jenkins x by just typing jx project ml, quick start.

A

What jenkins x does is. It provides a set of quick start. Libraries and those libraries cover a range of common frameworks for machine learning and provide example, projects for a whole range of different types of techniques in the machine learning space.

A

So by creating a machine learning quick start, what you're doing is getting a full end to end worked example, which will be set up for you in your preferred framework and hopefully using a technique that approximates to what you would like to base your approach on.

A

So get you up and running very very quickly with some pre-tested example codes that can actually be deployed in your staging environments. So you can then start to to work from a known good point towards where you want to get with your own implementation.

A

Now what those quick starts actually do under the hood is they create two repositories for you now. One repository is the training repository and that contains all of the infrastructure, descriptions and resource requirements that you need to set up the training environment within your cluster.

A

It contains the training script itself, any specifications for the training and test data that you need to use to run, that training any hyper parameters, etc that you need to be able to tune as part of the training process, and then it also includes the set of acceptance, tests and pass fail criteria for the model, that's being trained and then uh in the second repository.

A

What is created is the deployment associated assets that you will need for your trained model, and that includes the description of the infrastructure that the model needs to run on its resource requirements.

A

The code base for the microservice that will wrap your model and provide an external api that can be used by other parts of the application to utilize that model and then all of the aspects for handling non-functional requirements such as security and scalability, and things like that, and then on top of that you have the unit tests and integration tests for your deployed model.

A

Now, when you first create a quick start using jenkins x, you'll you'll get examples of these two repositories created for you and they'll both be deployed into your cicd system.

A

But initially the model deployment will have no model associated with it, so it can't deploy until you've trained your model. The training will execute within jenkins x. It will build you, the specified, training, environment, it'll run your training create an instance of a model, and then it will test that model to see if it passes your acceptance criteria.

A

Now, if it doesn't, then uh it will fail out, and you can run another training instance with some different tunings and keep repeating that process until you get a pass, but if it passes, then it will actually automatically create a pull request onto the model deployment repository and move a version of that trained model into that repository, triggering the release process for the second repository, and so you get an automated end-to-end process which allows you to run trainings and then deploy them into test environments where you can evaluate them as part of the broader application.

A

Now all the governance processes around jenkins x are all fit neatly into your standard release processes. So you can just add in whatever ml specific governance checks you want to have like automated bias, checking, for example, and just integrate those into the release, criteria that need to be passed before you can put something into a production environment.

A

So, to recap it's as simple as just typing: jx, project ml, quick start within a jenkins x, environment, and that will give you a short questionnaire, asking you what type of project you'd like to create what you'd like to call it and where you'd like to put it and then you're up and running, and can start to refine the quick start to target in on the solution that you need for your particular product.

A

So I hope that was useful. Here's a couple of links to help you find more information and I'm more than happy to take questions.