Red Hat OpenShift Detroit 2022 OpenShift Commons Gathering, 25 Oct 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Lockheed Martin Customer Case Study: AI from Training to Edge with MicroShift OpenShift Commons 2022

Description

Customer Case Study: AI from Training to Edge with MicroShift at Lockheed Martin
Red Hat OpenShift Commons 2022 @ Kubecon/NA
Detroit, Michigan
October 25, 2022

Speakers:
Ian Miller & Matt Wittstock (Lockheed Martin)

https://commons.openshift.org/gatherings/kubecon-22-oct-25/

A

B

Good afternoon Detroit, and thanks for having us my name- is Matt whitstock. This is Ian Miller. We are from a team and Lockheed Martin called AI factory and we're going to talk just a little bit about AI training to the edge. If you jump forward here, so the AI Factory, as you can see up here, is not actually an 8-bit Factory. We are sort of a team.

B

That's focused on the end-to-end machine learning life cycle, so we are doing everything from your data ingest your data, catalog version everything in the data side to have clean reproducible work there moving into the training side, deploying models which we're going to show you a little bit about here today uh into actually then maintaining those models to standing on monitoring, retraining and then a little bit of what you can see here. All of the hardware that underlies all of the work we do for AI Factory.

B

So we're going to see how a little bit of this goes. Well, we brought. Basically a hackathon along with us could go well, could go crazy, we'll find out in time. But if you want to jump forward, Ian.

C

Yeah, so a lot of the uh the use cases are out at factories around the concept of ml Ops, a reasonably new uh field, but really it's just taking devops principles and bringing in machine learning, because the life cycle for machine learning is a little different than devops. You have you have a lot more complexity in your pipelines. Your pipelines might not actually put out the same thing every single time, they're, not necessarily completely reproducible, there's inherent stochasticity in the middle of it.

C

So there's a new discipline forming around that called ml Ops and really like the need from that from from Lockheed Martin or any other. uh You know company, that's employing MLS the need to be able to trust what you're you're putting in the field with machine learning which becomes really uh challenging to trust. Once you don't. You can't see inside of why the decisions are being made all the time and so a lot of the machine learning the ml Ops is around really keep keeping track of every little step of that pipeline.

C

Where your data came from. How was it transformed? How is the model trained? How did that model get to production? How is it performing in production on new data and really taking into account?

C

You know how can we trust this thing that that is new, and we don't really know why it's making decisions? uh The way that it is there is a field of explainable AI, which is, is the study of like why a model makes decisions specifically, but what we found is is in addition to that.

C

You really just need this robust process to get your models to production, and then you can actually trust based on all these other inputs as well and, of course, all the stuff I'm describing it really needs a lot of elasticity and scaling and gpus are super expensive. So you might do that in the cloud, but you also might be doing that on-prem and from Lockheed perspective.

C

You know we might be doing that in disconnected areas, and so we need something that's flexible to to work in all of those different environments, whether you're in a hyperscaling cloud or you're, you're on your bare metal in your server room, and so that's.

A

Really why we're.

C

We're here is we take advantage of a lot of the the kubernetes ecosystem and openshift, and all that to to accomplish that oops. There.

B

You go nice, uh so talking a little bit about the platform and why we are here so we believe very much in open source. First, we use a lot of Open Source tools. One of the big things that we are all about is making sure we don't go anywhere near a monolith. We're really a lot of small composable modules pieces of software that we put together into what really makes up AI Factory. That also allows us to really when we work with different programs and different teams that we have across Lockheed Martin.

B

Let them pick the components and needs that they have for their business versus taking some very large piece of software and bringing it all in and getting that through a number of different processes that happen to exist. So we really focus on allowing a it's all open source we can bring in what we need. We can see how it was built. We then, like I, said modulize it and, of course everything runs on top of kubernetes.

B

So, of course, the platform is a little bit of a loaded term, but I'm going to talk about on this. Both the platform, as in the the software side, but also the platform, and where do we actually run this it's a little bit of both as Ian said, we have to be massively scalable, which means both scale to zero, sometimes because the cost is crazy if we're doing anything outside of a cloud world to scaling to whatever our need is for a very large data, set very large training jobs that we might have out there.

B

We have to always be changing so, especially in the AI field. New tools come in all the time. There's always new things coming in, so we're always taking a look at what is out there. What is available? What can we bring into our stack that meets a business need for us and then is there something, maybe a new tool that meets a need we had, and we already have something out there and there's a better version of it. Now same thing goes for all of the underlying Hardware infrastructure, the kubernetes layer.

B

Of course, all of that is changing consistently too, and we need to be always able to evolve and change with that.

B

As far as where we deploy we're all over the place, um we've got environments that we run out on our public clouds, as well as our government clouds. Of course, we have very large Central infrastructure. In fact, the one that we tend to use the most is one that we've built.

B

We have an Nvidia super pod that we do the vast majority of our work on 160 a100 gpus, that's just one of the mini GPU, HPC environments that we have and then, of course, we've got a couple of different things for Edge work, so one of them that we use quite often an hpe edge line, we'll talk about it more, it's actually what we're running a little bit of our demo. We didn't bring it with us. It's a very loud.

B

You wouldn't be able to hear us on stage but fantastic piece of hardware, and then we do a lot of work on embedded now. So this is where we've really brought in microchip. Moving from you know, a large scale, openshift, that's running on the rest of our environments, to a very light edge, kubernetes that we're running on devices like these right here, uh so we'll jump into that a little bit more as we go into a demo. So you can jump forward.

C

Yeah, so let's talk a little bit about what it looks like to deploy your ml model. In this context, we're going to talk about that the edge and then the Edge, Edge or.

A

The embedded Edge.

C

And how that kind of process goes, so we have a lot of our training just because the resource is necessary. You do a lot of your training on your. Your large clusters, like the Nvidia superpot, is a great example there, where we have all these gpus to bring to bear, but then you're in in most cases, your model, at least in the in the embedded or Edge world that we live in your model, is not going to be able to be run on those Cloud resources. You need it deployed elsewhere.

C

Maybe it's to the edge line is like a four. Is it for you, four node, uh you know a server box that you can take with you to places or even to embedded spaces like the the Nvidia devices here, and so really. We need to be able to move those models that we train to these different environments and they might be optionally connected or not, and so really the way that we've been able to make that the smoothest is to kind of standardize our platform layer.

A

C

Those different forms factors and so we're running openshift or kubernetes in the in the data centers, and also on these Edge boxes that we can take to. You know different places with us and then also now we're running that on our embedded devices as well, and so um it really because you have that standard stack and that standard um underlying platform. It gives you a lot of flexibility to move between those environments and we're. That's that's the success, we're seeing so really we take.

C

We train it in one place and we move it along with us with model Registries to these Edge locations and then from there we can provision it onto embedded devices and that allows us to the maximum amount of flexibility for for deploying these workloads.

C

Let's talk a little bit about the actual, like underlying Tech stack that we're using for those model serving environments. So uh really we we lean heavily on a lot of the stuff that we'll see here at kubecon, I. Think in the last like week or two there's been announcements for several of these projects, even getting uh uh donated to the cncf, which is really cool. But so really.

C

Our stack right now is like some AI Factory inner Source stuff, some red hat openshift on our large devices, and we have Micro shift now running on our Edge devices. A lot of the training and a lot of the AI stuff on the in the cloud is done with kubeflow and then k-serve, which is a project that came out of that. We used for model serving both in the cloud and.

A

Actually, today,.

C

We'll be running k-serv uh on some arm devices which I don't know if it's the first time it's been done, but certainly we had to build a lot of it. Forearms.

A

So we'll find out.

C

And then sorry, manager and istio, you know to complete that stack. So that's generally the tools we're using to do models here, serving both in the cloud and and we're testing that out on on the edge boxes as well. So.

B

Yeah, so let's actually talk a little bit about an edge use case. In fact, I don't have any links or anything in front of me, but there was quite a bit of press that came out recently. uh Just this morning, we did fly micro shift on top of one of our Lockheed Martin stalkers, so that but microshipped out in the air very exciting running a bunch of containerized AI workloads.

B

Talking about a little bit of the use case for sort of deploying, though more of this AI like we're going to show off the actual AI infants. The way we are some of our different use, cases often involve a lot of large data collects. So in this case you can see, we've got a little stalker flying showing a picture of taking a lot of video footage. We collect a lot of this footage, bring it to somewhere centrally, and then we train a bunch of models to better.

B

Do you know understanding of what is actually being seen in that video, and then we deploy that out into the field on these stalkers and now we're actually getting live immigrants. In this case of what I'm kind of showing here, so this is sort of a computer vision uh use case now. What we really focus on there as well is not only then just deploying that out very important, but now, as you're out there doing inference you're, actually both learning. How is this running?

B

Is it actually performing very well and you're capturing a lot of new data? So every time we do some sort of flight, we're capturing a lot of new data, and so then we have all of this to bring back again back into our Central environment. With all of this new data we've collected. All of that new data then has to be labeled stored, cataloged versioned, slightly out of order there, but all of these different things to the week and then take and train the model better again, you know use all this new data.

B

We might have some new things that we've saw on the field that we're able to train the model under better understand, deploy it back out again and continue that Loop over and over and over again. So one of the big things that's where we really focus on all the pipelines, all the training, everything that goes involved into making this as seamless as possible. So we can just collect data train. Send it back out again complete that loop I, don't know if there's anything.

C

Else interesting one thing to get into in terms of like for those of that that have done, embedded or even embedded AI, you might be like well, why do I need a container orchestration platform.

A

C

This kind of space and it to be fair, it's only recently really been possible with some of the more compute you can deploy to those embedded devices to run those. But really, if you think about it, you you need to be able or like a lot of our use. Cases need to be able to change their mission mid-flight or.

A

C

That goes on the software side, but it's even greater on the AI side, because you might need to deploy a new model or several models uh at a moment's notice and be able to swap between them, uh depending on what your mission is uh or like. What that you know, end device is doing, and so uh in Dynamic environments like that, the AI needs to be just as Dynamic, and what we've already found from from running these uh in these use case.

C

That Matt is talking about, is, is we're able to like have seamless transition of models with no downtime. But switching. You know how the AI is interpreting the data that's coming in, or maybe we can deploy a new capability from the AI perspective uh all seamlessly without without taking down the asset to go like re-flash it and then send it back up, and so um you know it's certainly possible to do without the container orchestration.

C

But we find that the container orchestration gives us a lot more flexibility in some of the dynamic environments that these embedded devices would interact with.

B

uh So this is leading up to the potentially fun part here. So let's uh we'll talk a little bit about the the demo that we're going to be doing I had to decide if I'm going to talk very slowly to preload the demo.

B

um So we've got a couple pieces of Hardware that we're doing some of this prototyping on and trying to show off here. So uh back in Denver, we've got one of our Edge lines that is running um some image generation and I guess a little bit of NLP all that.

C

Yeah, it's running to like state-of-the-art models that are open source, so one of them is whisper, which.

A

C

From open AI, it takes uh speech to text, and then we have another model called stable diffusion. We've pulled that from from hugging face another open source model and I think stability, AIS, who train that one and uh and that one takes text to image.

C

So really what we're going to do is this mic right here is picking up what we say and we're going to try to have the AI draw that for us like draw it not come up with a search, but it's literally going to create a picture using those awesome, open source models and that's all running on our Edge device back in Denver and.

B

So we'll see how that goes here in just a second, when we switch to that tab because we might might be pre-loading a lot of data, but before we jump there as I mentioned earlier these up here we are running two just Nvidia agx of the new Oren devices, their Dev kits. We've got these running up here with microshift on them. We were initially going to run a lot of this demo on there, but switch to just kind of showing how to do a lighter inference on these devices as we go along.

C

So they'll be running, we'll show it's running I'm.

B

Curious, what's on the other tab, the.

C

The devices are running k-serve as well and then just like a basic canonical AI example called cfar10 if anybody's familiar.

A

C

That so um yeah, okay, so like let's Prime this with, like maybe like dinosaurs wearing red hats um and it it is going to be a little bit. I think.

B

Our very first one.

C

It's hard to see the text is like what it's interpreting from what we say and then the picture will be uh whatever it is that we're talking about. So we've been talking about tech stuff, so that makes sense.

B

C

A little uh to be fair, these models are meant to run in huge data centers, so it takes a little bit of process and time to uh to pick them up. Cats in space.

B

The only dull and fall to this model is, as you can tell the way we built it the last moment here it's capturing all the words we say. So that's what's really building this, like very large sense going on here, I love, whatever I.

C

Think that's a red-headed dinosaur baby.

B

So we like to have to have some these kind of things running it during meetings, so in the middle of meetings, it's generating new pictures as we're talking, which is a massive distractor for the meetings. But boy. Is it a lot of fun, a lot of fun, um we'll.

C

Give it a few more and then we can. uh We can turn our attention to the edge boxes.

B

We'll see if it generates one more there's.

C

Your cats, maybe with a live demo, we leave it.

A

C

Again, we're not taking credit for the models, just the ability to run those on some. Some Edge boxes uh is what we would think, but yeah. So that's one of them and then, let's, uh let's take a look at I, gotta drop off the.

A

B

There you go, we've got it now, it's taken care of.

C

All right, so, let's jump on one of these boxes here and I'll talk a little bit.

A

C

A lot larger there we go just to give you a sense of. What's: oh, that's a little too large. Isn't it.

B

C

B

Fit smaller all right.

C

So uh really, this is like what's running on these devices right now, so we have Micro shift running on there. There's you can see a lot of the control plane notes if you can't see I'll narrated for you there's, uh maybe like six or seven six containers, I, think running the uh the core micro shift piece of it, and then we have a full assert manager running on there and uh in it's DOD.

C

The reason for that is, that's really like a dependency for the type of case serve deployment that we did um so K server is pretty cool. If you haven't checked it out, it basically allows you to it's like uh it has some custom resources for deploying models, and so there's a k-serve controller on here and it will actually interact with istio for you, it'll interact with server manager and it'll kind of build you all the stuff you need. So all you have to do is say here's my model at the base case.

C

You have to say, here's my model or my model is deployed out in some S3 bucket. Just go grab it for me: it's a tensorflow model and that's what I want to run and.

A

C

Actually go spin that up for you so pretty cool, um so we have all that control plane running here and then our data plane, one here, which is our our c410 predictor um I- did make the name space. If somebody can see that, but it's.

B

C

um And then, let's see, oh man, my preloaded curl is gone. It's that's fun. Oh.

B

You are gonna hand type it Carl, I.

C

Don't think so.

C

A

C

Is complete without searching history right.

B

C

And then an example of inference really like a model at the end of the day. Is software right? So really what you're doing is you're deploying it with a rest. Endpoint uh case serve also supports stuff like uh like akafka ingest or for grpc, but the base case is just deploy a rest endpoint.

C

That hits my model with your data that you want to make your prediction about, and then it gives you back the results, so I mean at the end of the day, it's it's literally just like uh giving some input curl or some input. Json it's running in the different IP.

C

I pre-loaded this and everything it's.

B

Probably, on the other one, there.

C

B

C

Again, the result is just Json because it's an endpoint, uh but you can see we're running uh the c410 uh on on the arm boxes, and so really what that entails is like I think, as we start to see more arm both on.

A

Edge devices, but also.

C

In the server room or I guess your laptop now, thanks to Apple uh I, think you'll see a lot of these. These Technologies transition to arm as well. That's an area. We see right for a lot of improvement for the tooling on top of.

A

C

Lot of this stuff we had to compile to arm and build new containers to make it work on these Edge devices.

B

Yesterday, yesterday, in the lobby.

C

At the Westin so uh yeah, that's that's just an example of the same technology that we can run in our data center running on microshift right in front of you, so foreign yeah, one other thing, I'll say since I think we uh we both talk fast. So we got a little bit of time to Vamp here, but one thing I think is ripe in this space is uh even though we can run these control planes on the edge devices.

C

um Obviously you're, like sacrificing some overhead there.

A

C

Processes that you don't have to have on the device, a lot of these Frameworks are built to run in Cloud environments, and uh one place that I think is ripe for Innovation is, is being able to have the control plane running even on the edge line type device, maybe not even in the data center, but be able to manage a bunch of these. You know embedded deployed models that are on these, like optionally, disconnected environments, so I'm going to put that out there into the world.

A

C

I know there's a lot of contributors in the world here, but certainly that's something. I see that's right for for improvement in this environment. You can run those on these devices, but there is only so much overhead so uh that.

B

Was an out loud feature request.

C

A

Know who's working on it.

A

A

Great presentation: how has the platform helped enable collaboration among your data scientists, sure.

C

Yeah I'll take that.

A

C

On our most of the collaboration for the data, scientists and machine learning, Engineers happens on our training platforms, and so we deploy uh multi-tenant Cube flow on there um and yeah I. Think it's just the ability to get an environment that has all of their dependencies and easy access to gpus has really accelerated their ability to train a bunch of models, and then we also deploy um you know. Speaking of other, like things that we have deployed in that training environment, we have a number of like distributed training tools.

C

So, for instance, kubeflow comes with one called khatib. The Rey Community is awesome. It has built some really cool, distributed, training uh stuff and then also determined AI, there's another open source tool that we use there I mean that allows them again like easy access to that, but they can collaborate on on building these larger training, jobs and, and uh so far, they've been sharing our GPU clusters they're, like scarce resources, but so far we've been able to have them share those clusters. It's a pretty good effect. Yet.

B

C

But it's really just the ability. What a data scientist or a machine learning engineer needs is like quick, easy access to to large-scale GPU and distributed training environments, and uh it's not necessarily easy to build, and so, from our perspective, we've built a platform team at Lockheed Martin to deliver that to our array of of Engineers. There I know that's not the reality for every company to have the resources to do, but it's it's worked very well for us.

A

C

To accomplish that.

A