Continuous Delivery Foundation cdCon 2020, 20 Oct 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The MLOps Roadmap - Terry Cox, Boostrap Ltd.

Description

The MLOps Roadmap - Terry Cox, Boostrap Ltd.

This talk is an introduction to the CDF MLOps Roadmap.

A

Hi thanks for joining me on this session on the emelops roadmap.

A

uh I want to start by actually asking the question: what is ml ops, because uh it's clear from the work we've been doing today? There is a lot of confusion about what that means and what the methodology really encompasses.

A

So uh for many people, uh nlops is focused on the idea of taking machine learning assets and putting them into production environments, um but we would argue that actually, the real challenge is in allowing people to manage machine learning assets as part of a wider product or solution, which is the thing that they really care about in terms of uh controlling the life cycle and the release process.

A

So for for us, we define mlaps as the extension of the devops methodology to include machine learning assets as first-class citizens.

A

So, fundamentally, machine learning is part of your product. It is one tool in the toolbox that allows you to take your product to market.

A

So, what's the problem that we're facing in in this space? Well clearly right now, the majority around 80 percent of machine learning based projects are failing to make it successfully into production environments.

A

So so we clearly have a significant problem uh with our ability in general to convert our machine learning experiments into practical products.

A

Now we are also operating in an intrinsically high risk environment, so machine learning and ai is almost always working with large aggregated data sets potentially sensitive data, so we are always having to contend with scenarios where the the challenges of the data involved are significant and the the penalties for failure are potentially very high uh and in fact it's it's clear that we need to anticipate uh very high levels of uh regulation being introduced into this environment.

A

uh We expect that in in the future, there will be uh significant amounts of uh localized regulation in different territories, covering the use of machine learning, algorithms and techniques.

A

But there's a much more fundamental problem, which is that we we train data scientists in the practices of machine learning by by using convenience methods like stupid, notebooks, for example.

A

So we make it very easy for people to experiment and focus on the the core details of building and training models.

A

But in making that part of the problem efficient and easy to do we're actually neglecting a large part of the technical challenge is that we need to then take those models and put them into production environments.

A

So if you step back and take the bigger view, it is clear that jupiter nut notebooks are not production-ready. Assets are they're very easy to write and modify off the cuff, but they're, also very difficult to to manage effectively put a good governance process around and extremely challenging to scale and secure, and all of the other non-functional aspects that we need to manage when we're dealing with uh technology in production, environments.

A

So um the roadmap came about uh as an attempt to try and capture in one place, all of the all the requirements for what ml, ops and ml ops tooling needs to be able to do in the future to to get us past this phase, where we're we're still in the early infancy of a new discipline and to move us into a situation where we can, we can actually work efficiently and safely with these types of assets.

A

So the roadmap itself is a collaborative document that we publish on an annual basis and what that document seeks to do is collate all of the known challenges that exist in this space and capture them in one place, so that everyone can have a shared understanding of what the full picture of the challenges are in the space.

A

um We can help to communicate what the primary issues are, what we see as being the most difficult challenges at their moment in time, help to encourage um vendors and providers of emelop's platform tooling to work towards shared goals, to actually deal with the the challenges that we face.

A

So so the document has three primary sections: the the challenges chapter, which uh takes us through each individual, uh fundamental problem or issue in the machine learning space and and tries to spell out what those challenges look like in in terms of the impact that they have on a team or an organization.

A

Then the second primary chapter is the technology requirements section, so that reflects each of the individual challenges that we've identified and then tries to spell out some of the implications on the technology or the software that we need to to solve those problems.

A

Note that we're trying to keep the requirements as generic as possible and to not solutionize during this stage.

A

uh And then, thirdly, we look at potential solutions and the timeline in which we expect things to mature, looking forwards over the next five years.

A

So, uh for example, you know a a typical challenge uh within the road map would be the idea that we we need to be quite platform agnostic in the way that we work so historically.

A

Machine learning has been a very much a python based activity with lots and lots of tooling and libraries developed uh within python, um but that brings with it certain challenges.

A

So, for example, because python is you know an interpreted scripting language, uh it means that the the the source code of a python environment is always available in a production environment, and you are effectively able to if you have access to that production, environment, modify the source code and change the behavior of the system. That's running so clearly that could be a very high risk from a security perspective, be very easy to inject malicious code into python environments.

A

So um the other challenge in in in this space is that we are looking at a situation where you need to be able to build your code to train your systems.

A

You need to be able to run those trainings, and then you need to take the trained models and move them into operational environments.

A

Now the reality is that the the technology that is most convenient for writing training scripts is often different to the technology. That's most efficient for running the trainings and again the technology, which is optimal for them operating those models may be different again, so you know we. We fully expect that customers need to be able to easily define their models so that they can.

A

They can build new products as quickly as possible, but they will typically need to use a significant amount of training resources to actually train those models in a timely manner.

A

um Which may imply the use of say, gpu resources in the cloud environment to give you short-term access to a large amount of compute resource. uh But then the model you've trained is is quite likely to be some sort of decision making system that needs to operate in near real time in human-facing environments.

A

So then, you have the challenge that the deployment of your model needs to be able to operate in the real world, possibly on an edge device, and uh so the inferencing that you're trying to do needs to be able to operate a very low latencies close to the source of the data that it's collecting so yeah.

A

Clearly, um a significant challenge exists in how we provide an overarching, cicd process, if you like, which allows you to define models uh using one level of abstraction, train them using a different level of technology and then translate those models, so they can be deployed into a third level of technology.

A

So perhaps working from you know a notebook like syntax to a native language, more suited for rendering your model in in a gpu-based environment.

A

Versus then taking that model and abstracting it so that it could be deployed, for example in in silicon on an asic or fpga and put into a mobile device and shipped to large numbers of customers.

A

Clearly, this is a really big challenge and it's not something that uh we're going to fix overnight um and it's not something that we can do on our own. So how do you find the roadmap and how do you contribute it? Well, the roadmap document itself is probably available and you can see it here.

A

And uh there are many ways in which you can contribute to the work that we're doing uh we're more than happy to uh to accept pull requests on the document itself. So if you feel there's a there's, a challenge, that's missing or a technology requirement that we should detail, then uh you're welcome to uh to edit the draft version of of the document and we'll review and incorporate any feedback um you're.

A

Also more than welcome to join the the emmanop sig, uh we have uh regular meetings uh where we discuss the progress of the roadmap uh and we also have a mailing list and a slight group where you can contribute and contact the various members of the team.

A

So I hope this has been a useful session and, let's open things up for questions.

A

Right, I hope that was helpful and I'd be really interested to uh know who's uh who's involved in doing uh ops, related activities at the moment and who's uh planning to do so shortly so feel free to. uh Let me know in the chat if you're already on this journey and if you've got any specific questions that I can help with.

A

So uh the plan really is to try and evolve this um incrementally over time. You know: we've uh we've done quite a bit of work this year to put the first draft together, and uh you know we. We hope that you know. We've got a a starter for 10 in terms of highlighting what a lot of the the common problems are and some of the the bigger challenges you know.

A

Clearly, um what we really need to do here is to extend devops so that it takes into account the needs of of machine learning and ai projects, so so really, the the the challenge here james is to make sure that the focus is always on reducing lead time. So what what the practice is doing is setting you up to minimize the time it takes you to get from a decision about a feature to implementing that feature in a production environment.

A

uh Now, that's not necessarily as simple with machine learning as it is uh with some types of conventional software applications.

A

So you know there, there are some significant challenges in some areas um and clearly one of the um the big differences is that, when you're dealing with mlops, you have to manage your data sets as well as managing your code base.

A

uh In fact, many cases the data sets are much larger and much more significant in impact than uh any any code that exists with within or surrounding the model.

A

So um you know there are some significant gaps in our current tooling and capability in terms of being able to treat a large set of data as a coherent asset that we we need to manage over time, and you know to give you a feel for that. um A large data set in machine learning terms probably starts at about 10 terabytes of data um and may go up to tens of petabytes.

A

So uh you know it's: it's not a case of being able to take a snapshot of something and just throw it around in the data center. um You know you you're talking about very large amounts of data that take considerable amounts of time to to even move from place to place in some circumstances,.

A

So clearly we're um we're not there yet um and yeah. I think we'll we'll look back at this period and see it as as the very early days of machine learning and and doing ai in production.

A

um So you know clearly there are a lot of challenges where we're talking about solutions that want to use ai as part of a product, because one of the big differences between conventional software products and ai solutions is that ai solutions are going to be fundamentally decision making systems.

A

So we're going to rely on them to make near-human level decisions in lots of instances, and we want them to be able to do that in near real time. So so they need to live um functionally at the edge of the network. So they need to be on your mobile devices. They need to be in your vehicles.

A

They need to be surrounding you in your smart cities and in in many cases they need to operate very low latency. So you you need the system to react to what's going on and and respond in milliseconds, because you may be dealing with a safety related system like an automated braking system. On a vehicle um or a network of systems, that's interpreting traffic and trying to optimize the operation of a set of traffic lights.

A

So um one one of the challenges for ml ops as a as a process is that you know you. You need to gather a large amount of training data from the edge. Then you need to move that somewhere, where you can process it um to actually train a reliable model.

A

So that typically means moving a lot of data into the cloud and then using a lot of gpu or tpu resource to to build a model.

A

But then you need to take that model and transfer it back out to the edge. So it can run near the source of data at very low latency and do the inferencing, where it's close to where it's most needed.

A

So you know the implication of that. Is that, rather than building a piece of software in one ecosystem and deploying it in the cloud, what you're really talking about is having to go through multiple translation steps, where you take something in one layer of technology and train a model?

A

And then you take that basic model, and you turn it into an abstraction that you can then translate into another technology layer. For example.

A

um You know you may well need to take your model and convert it into a hardware description, language like pericode um and um then take take that and use that to make a a physical product, such as a you know, an fpga which you can then embed in another piece of technology and and so you then have a high speed model that can inference very quickly but is encapsulated in a piece of hardware.

A

um So it gets around the need to to have a lot of non-specific processing power in order to be able to run the inference thing.

A

But, of course, the challenges there are that you are you're having to deal with a lot of complexity in in your pipeline, so you're having to manage moving between multiple layers and having to abstract between uh very different domains.

A

So um what are people in engaging with at the moment? You know what problems have people come across today, uh serverless yeah! That's that that's a an interesting question.

A

The the challenge with machine learning at this stage is that it relies heavily on some sort of hardware acceleration.

A

So you you either need to be using gpus or tpu cards somewhere in the process to actually accelerate the the training to a reasonable speed.

A

And, and so part of the challenge is, if you're dealing with terabytes of data that you need to to work with, you are having to break that up into small chunks, so you can actually do the the math on it to to train a model.

A

So uh an individual gpu card might have you know, 40 gig of ram available so to to be able to train a model in a reasonable time scale. You know less than a week you, you typically need to have a lot of gpus and then a very high speed network so that you can continuously stream the data in 40 gig chunks through the network um performing a lot of calculations on it and then synthesizing a model on the back of that.

A

So um serverless um is an abstraction that we can use at one level. So from a data scientist perspective you can, you can say, yeah well. This looks like it's serverless in that I don't have to think too much about the infrastructure. I just tell it what I want it to do, but under the hood, the um the mlops implementations actually need to physically allocate gpu hardware to nodes, and then you know, individual containers have to mount individual gpu instances and we have to help to distribute the workload across those.

A

So uh it's certainly a domain in which the tooling itself needs to be um a much more aware of the hardware than the physical constraints than with conventional ci cd environments.

A

um What other challenges are people seeing?

A

um I know we've had some uh some interesting problems, just um managing the governance processes around uh these. These sorts of issues, I think, uh probably, if anyone's got any further questions um um it might be uh best to take them to the uh emelop's birth of a feather uh event. That's coming up next and then we can.

A

We can start to get our heads together and and dig into some of the detail of this stuff, but if there are no further questions here, then uh thanks everyone for, uh for your time and attention and yeah, please reach out get involved with the road map and look forward to collaborating with you in the future.