Cloud Native Computing Foundation Kubernetes AI Day EU 2022, 19 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Debugging Machine Learning on the Edge with MLExray - Michelle Nquyen, Stanford

Description

Debugging Machine Learning on the Edge with MLExray - Michelle Nquyen, Stanford

A

All right, awesome, hi everyone nice to meet you all. I hope you all had a good lunch. uh So today I'm going to be telling you guys about ml x-ray and so ml x-ray is essentially an end-to-end debugging platform for your models that are deployed on the edge.

A

So just a little bit about myself, I'm michelle- and I said before I'm a principal engineer at new relic working on the pixi open source project, so that project is essentially a cncf sandbox project which is an observability observability tool for kubernetes and before my time at new relic, I was pixie labs's first engineer and pixie labs is where the pixie project was born. Out of.

A

So why should we even talk about debugging deployments of machine learning on the edge? So we see today that a lot of deployments or products and software are moving their machine learning models to the edge. So, for example, we have crews, which is self-driving- that's a very popular hot topic these days. So essentially your car is going around. It is picking up a bunch of sensor information so, for example, it's using a camera to figure out, you know.

A

Am I driving correctly in the lane it's using lidar to find object, detection and see if there's an obstacle in the way, so you don't go and accidentally hit. Somebody or we've had you know the amazon echo around for a while and that listens to you. You know you're going out your day, just talking normally and it listens and picks up some cues whenever you say alexa, and so all that is done on the edge. It's picking up sensor information and then it is basically running a model and figuring out.

A

You know some inference and what action to take based on the information it's gathered and then another example is just you know you want to deploy your applications onto different phones right so essentially on these phones, you're, running machine learning, models to do different things such as image, classification, for example, or for the case of like the pixel 6. One of the recent things they came out with is you can do like a magic eraser. All these models are running inside your phone itself and to kind of expand on that.

A

We have this idea of the traditional model, which is on your left, and so here, in this case the sensor is on a separate device. It is picking up a ton of input data, so let's say for a nest. Thermostat right, it's going, it's figuring out. You know what is the temperature at this time in this house and it might want to do something with that data and figure out. Okay. What should I do with this and run some inferences on it and it will send it to the cloud where the model is running.

A

The model will basically go. Do some inference and then return a result, and so then, when you move your computation to the edge which what actually happens is you now have these models running directly on the devices so, for example, of the amazon echo from before you're having the model run directly on the um echo itself, rather than going and running the model on the cloud, and so here what actually happens is that now you have a bunch of different environments right, you can deploy to many different edge devices that are built on different hardware.

A

They have different memory, compute resource requirements and whatnot, and so what are the benefits of doing this? Actually, so you can see here from this picture. You are no longer egressing any data out to the cloud so before you know you have a constant stream of data coming in and then you're sending out to be like okay. What should I do with this information?

A

What is the inference that I want to make? But when you move it to edge compute all the information stays within the device itself and a lot of the times? This is just stored in memory, and so that helps a lot because now you know you're not sending something out: you're, not waiting for the latency of that network request coming out to come back and tell you okay, this is what I should do and that helps a lot with latency and overall just egress, and then you also have security and privacy benefits.

A

So now it's like you're, not sending your data out to somewhere else right going back to the amazon echo case, it's like you're, talking constantly in your house, and you don't really want all that information, like whatever you're saying, is to be sent to some remote cloud, that's managed by someone else.

A

Rather you feel more comfortable where it's you know, this device is in your home and it's kind of all stored in memory and then probably at some point eventually expired out because it no longer needs that information to make an inference, and so there's a lot of security and privacy benefits for moving to the edge and then lastly, you have scalability right. So let's say you have millions of connected devices in the traditional model. You send all those all that data to your cloud and here, in this case, you're actually handling it per device.

A

So, even if you have millions of devices whatever you're doing in your cloud, it's not impacted by you know all these inferences that need to be made on these individual devices.

A

And so how? How do you actually go and start deploying these models to your edge devices, so how it usually happens is you're going to train it on your cloud, as you normally would with just you know the traditional model you go and you tune all those parameters.

A

You run your training data sets and then you know your models looks great you're. You know able to accurately detect dogs from the example earlier today, and then you go and deploy these models to your edge devices, and so in this image here I kind of label these boxes as different colors, because I want to make it very clear that these may not be the same architecture. They may not have the same environment, they may have completely different hardware and so you're just deploying these models to these heterogeneous environments and what can go wrong right.

A

So here's an example of what could go wrong on the top. You have this case where it's like, okay, well, one second is kind of slow, but you're running your model on your edge device and accuracy is just not right so before when you're training.

A

This thing on um the cloud and running inferences, I was like able to classify this dot correctly, but then now you've deployed it to your iphone, for example, and now it's starting to have some problems or in the other case, which is the bottom one you go and you deploy you deploy to your android and oh man, this model that ran really quickly when you're training it in the cloud it takes 10 seconds and you have no idea what's going on so you're running into all these issues, and it wasn't the case when you're running on your, like you know your single cloud environment.

A

So how do you actually go and debug these things right? You have a bunch of different environments right, you have an android, you have iphone, or in some cases you have like a sensor running on this and a sensor running on some other device.

A

How do you even go and figure this out, and that is where mlx rate comes in so mlxray is this project that came out of a stanford research group, so tons of credit go to all those wonderful people on the bottom for just kind of going and figuring out all this information, but essentially they've, built out this framework for providing visibility into what is happening on your edge deployments, and you can use those to figure out.

A

You know what exactly is going wrong with my model that you know usually works well, in the other cases that I've deployed it and ml x-ray. Essentially what they do is they give you an api that you can use to instrument your models so at the top you see an example of the python api and all you really need to do is say my mlx ray library. Let's start on inference start you run your interpreter and then on inference.

A

Stop you stop, and so this api once you've invoked that it starts collecting for each layer in your model, a bunch of interesting information that is going to be used to help you debug. What is going wrong in your system.

A

And so some of the information that it collects is you know the original input of the model. The output of the model you know is the result, correct and also per layer the input output, the end to end latency. So you actually know how long did this whole inference take and also within the layers themselves? How long did those individual layers take things like memory, I think, especially as you're moving to an edge device, and these have lower like memory and compute resources?

A

That is something you might want to hone on to and then for in the case of the android example, they have in their android api. It also collects other information, such as peripheral sensor, information, just like the orientation of the phone, the lighting that is detected in the room and that just helps you provide more context to the model that is being run.

A

You can also go ahead and add your own, just like whatever you want to log in the ml x-ray logs, you can there's the api, allows you to go ahead and do that if there are custom fields that you might want to go and pick up, and it also allows you to write custom assertions which I'll talk about in a little bit.

A

So now you have all this data coming in right, you've, instrumented your your model. All this data is coming out as you're running it, but don't really know what to do with this information. It's like okay, cool, this layer takes. You know this many milliseconds, and this one takes this many milliseconds. How do I actually use this information to figure out what is going wrong with the model that I've deployed on this edge device? And so the idea behind mlx ray? Is that there's a set of reference pipelines?

A

And these are usually you know the model that you've deployed on the cloud? You know that this is accurate. You know that this is kind of like the baseline, for just how you want your model to perform and what you do there is you run mll x-ray on that reference pipeline.

A

It gives you the logs, you know it gives you. This is how long each layer took. This is the approximate output and input of each layer, and then you run this on your development pipeline, and that gives you the same information, and then you basically do a diff between those to create a debug report to help you figure out. Okay, this is what's going on with my system. This is what's different when I've deployed to this environment versus the other one.

A

And the basic flow of this debug report is this, so first you do an accuracy, validation! You look at your accuracy for your reference pipeline. You look at it for your development pipeline.

A

Are they matching up and if they match up, then that's that's pretty great right, because now you're basically performing about how you might think when you've, like I've trained this model on my cloud, it looks like my device is accurately doing what I want it to do, but in a lot of cases like I mentioned before, you're going to realize that is not the case and the accuracy goes down and so then the next step of that is, then you want to look into each layer, specifically so you're.

A

Looking at the output and you're saying, is there a layer where this output, accuracy, just or the output, is very, very different from the output that's received from my reference pipeline and you in the case where things are slow, then you want to compare latency. It's like well. This layer took a lot longer than the other layer in my development pipeline, so you run through that and that helps you hone about in on which layer is having problems and then finally, like I mentioned before, there are some assertion checks.

A

So you can specify these are custom. Assertions in your code that check inputs and outputs are what you expect. So let's say you have the self-driving case that I mentioned before, and you know that when you're running your camera, whenever you make an inference, all the uh the width of the street should always be the same, and so then this assertion check would be like check that you know the width of the input of the model is always five feet or something or check that you know whatever is detected at the end.

A

The width is five feet.

A

And so what kind of issues can this pipeline actually help you debug? So there are three that I'm going to step into a little bit more detail for, but the first one is pre-processing errors. The next one is quantization inaccuracies and then kernel optimization differences amongst heterogeneous environments. So that's kind of the case I mentioned before, where you have a bunch of different hardware and just completely different environments that your models are running on.

A

So the first is pre-processing errors and I think, even in a case where you're not deploying to an edge device you're going to run into this right, you have something collecting information, that's you're, using to structure for your input to your model and that's going to be different from whatever that model is expecting, and this happens even more in the edge device case, because since these are all running on different environments and different hardware, your sensor might be picking up information in different ways. Or you know in the case where you have.

A

You know, you're, taking pictures using a camera that that could lead to the case where it's like. You might need to shrink.

A

Your information or shrink the picture so that it runs well on your edge device because it has lower memory requirements and so there's cases like resizing that I just mentioned where you might need to be down scaling the image or, in some cases, upscaling the image if the camera's, not picking up the right resolution or there's something wrong with the sensor or just the information is coming in differently and the the information might be rotated whenever you're feeding into the model, which can lead to a very low accuracy or, in some cases, there's a lot of models that might pick up your images and expect it in rgb format or bgr format, and you don't really know which it is, and so how does ml x-ray help in this case?

A

So this goes back to the assertions that I mentioned before, but essentially, whenever mlxray is running on your pipeline, it's going to go and run these assertions to make sure that it checks and passes so here in this example. This is using the python api and this is checking that it expects your input to be in rgb format. So it's checking.

A

If this this thing is accidentally coming in as bgr format, it's going to, let you know so: it's like hey your deployment pipeline is broken you're, going to need to go and add this pre-processing step to convert to the rgb format and just kind of stepping through exactly what this code is doing. It's taking in the input from your development, that's called edge out and then the input from your reference pipeline and saying do these look the same and if they do okay, that's that's great.

A

If not, then, let's try to convert your input from your development pipeline to rgb format and then now, if it matches, then it's like. Oh yeah, you had a channel mismatch, and so then it will raise the assertion and let you know that there's an issue with your model.

A

Another issue you might run into is through quantization, so in quantization this especially becomes important when you're deploying to edge devices, because you just like I said before- have lower resource requirements, and so therefore you might want to go and quantize this information so that it uses less memory or less cpu, and essentially what this means is that you're converting the weights and biases of your model to a lower precision.

A

So in this example picture you know, you start with the floating point 32-bit number, and then you do quantization to convert that to a int 8., and some of the issues that can happen here is that your quantization process could just be wrong.

A

So one of the methods of quantizing your data, it needs to know the min and max of your input, and what can happen in that case is: let's say you have your training data and there's an outlier in that training, data that will heavily scale or that's going to scale your min and max to some extreme end and whereas most of stuff should like follow somewhere in the middle, and you know don't follow what that outlier is, and so in that case, when you quantize, you actually get the wrong values for your weights and biases, and that's going to lower your accuracy and how ml x-ray helps in this case is that it looks at that per layer output and it compares it to the reference pipeline.

A

So you can see how is your development pipeline doing in regards to the reference one? So here we have two examples. So the orange line is this uh model that we know all the weights. All the biases have been quantized, they work and we compare that to the baseline, which is you know that perfect baseline image.

A

That has been a perfect baseline model that has been trained in the cloud and we can see that the the mean square error is right there in the bottom, it's pretty low and that's doing great, and then you have this other model that you've trained and you've quantized it. And you see that okay, comparing that to my baseline, the error is much higher, and so therefore I should go in and try to figure out just what I need to do. Do I need better training data to fix this or what?

A

What other processes can I do to quantize my weights.

A

And then last this one is very unique to edge compute because now you're deploying to a bunch of different devices. These have different uh hardware requirements, these just at the core of it in the kernel optimize different operations in different ways, and so this can lead to a huge latency difference or performance between uh devices.

A

So you know in one case you have something running it's very fast and then in some other case you don't know why it's the same model and it's really slow, and so we used mlx ray to help us create this graph here, and you can see that we've compared against different models.

A

How long it takes to run each layer, and some of the results are pretty surprising right. So you have this the exact, the quantile or not the quantile. The quantized version um pipeline that we used before is actually pretty slow in that second convolution step and ml x-ray helps you figure out. It's like okay, this layer there's something wrong and that's why it's slow, and maybe I need to deploy like a special model to this particular hardware.

A

So I'm going to walk through a little bit about what using ml x-ray actually looks like so first this is a nifty collab that we have. That just shows like an example model uh that uses ml x-ray. So the first thing you need to do is install the ml x-ray library and then you want to go and just create your model runner class.

A

So this is just using tensorflow lite and the important things to kind of pick up on here are essentially this m monitor so you're initializing ml x-ray to go ahead and start logging information from each output layer, and you know the inputs and outputs and then finally you go. You invoke the model.

A

This is not specific to mlx ray at all, so this is all just code for how to run the model itself.

A

And then you're going to want to run the model on an image. So this is an image classification example. You can see here that we ask for the log path to go to these specific files, and then you run the model. So here you scroll down a little bit more.

A

The models essentially run and in the background mlx ray, has picked up just like a bunch of logs about how each layer is running about the latency of each layer, all of that information. So what does that actually? Look like here's an example of an mlxray log and there's a ton of information in here right. You have the start time. You have the overall latency of how long your inference took. You have the memory usage and you have for each layer all the outputs- and this is I'm not going to keep scrolling.

A

It's just a ton of information. You also get your summary information, so this tells you for each layer. How long did it take? How much memory did it take the names of it? So it just collects a bunch of interesting information. So now you have all this information.

A

What exactly do you do with it? Mlxray has an api that you can use to go and start parsing this data and making sense out of it. So here's just an example: script. It loads in a bunch of things from the ml x-ray library.

A

This first function here it goes reads the logs in it parses it. So you see here it's reading the logs, it's getting the keys and the values and then essentially, in the end, it can plot the results, and we use this code to plot those results. Earlier that I showed back on that slide where I was comparing the uh the accuracy between the different or the differences between the output layers, so you can essentially very quickly get started with mlx ray okay and then jumping back to my slides, oops.

A

You can see that ml x-ray has some limitations and the first one is that you need code changes to go and enable instrumentation on your uh debug pipeline, and that can be annoying right because you might go deploy and then you're like. Oh, I forgot to add this.

A

I forgot to add this line in to go and invoke mlx ray, and you have to go back in and do that and generally when we're whenever we're doing observability, we, like you, know low touch, uh instrumentation, there's also a slight imp performance impact when you're using mlx ray. So obviously it's more noticeable on gpu, you're writing tons of things to logs. So that also has a memory impact because you're just storing all this data somewhere and then I think we could kind of see towards the end it was like.

A

Okay, I have all this data now. I need to use this python api to go ahead and parse it, and I can use that api to create a graph, but it kind of limits you and how you can actually go and visualize this information. What if you want to do more interesting things with it, because it's not in some standard output format that you can like stick into any tool that you want it's kind of hard to go and just build more interesting visualizations with it so kind of here?

A

How I got involved in mlx ray is. I worked on pixi I mentioned that before, and there were a lot of correlations between how we do things in pixi that I thought could help the ml x-ray project, and so just like a brief summary again. Pixi is an open source, cncf sandbox project for observability on kubernetes, and there are three pillars that I think kind of help in the mlx ray case.

A

So the first is auto telemetry, so pxe picks up information using tools like ebpf without you having to go and instrument things in your application, so it just automatically starts collecting information as soon as it's deployed, and that really helps in the mlx ray case where you have to go right now, you have to add that line to be like. I want to invoke mlx ray and start seeing information.

A

This also helps in the case where it's like you don't want this thing running on your pipeline, all the time right, you maybe want it when you're debugging, but in the future it's like when you know it's running. Well, you don't want it anymore, so you're going to have to go and take that line out of your code that invokes mlx-ray.

A

The second thing is that pixi really does well with edge compute, so that fits very well in this case, where we're deploying across edge devices and make sure you kind of follow all those standards where it's like you're, keeping all that data on the edge in memory and then, finally, I think the biggest thing that mlxray would benefit from is pixi's scriptable interfaces.

A

So here there's essentially a data format for pixi everything is inside a table and you can go and do whatever you want with that information to build visualizations very easily, and so this is kind of just a preview about just like how we wanted to apply pixi's use case to mlxray.

A

So we're actually going to go into this in more detail tomorrow on kubernetes on edge day. So, if you'd like to come by and learn some more, that would be great to see you guys all again, but uh here are some resources for mlx ray, so the first one. Of course, all this is open. Source mlx rate is open. Source pixi is open source check out the repo check out the code. Try running stuff yourself.

A

uh I also included the mlx right paper for those who are like more interested in picking up on some, like the very technical uh information.

B

I think we have time for one or two questions. If there's anyone that has a question. Yes, we have a question.

C

Hi, thank you for the presentation really great work. um I have a question, so why was the decision made to use logs to diff the layer, outputs between the cloud and the edge model? For example, why not probe the actual layers, because I'm assuming you own both the edge model and the cloud model right logs can run into issues, for example of formatting and also being really like large? You know your model is large you're going to be storing large text files and also the parsing is pretty expensive and can be error prone.

C

So why not probe the actual layers? You know in the cloud and the edge models.

A

Yeah, so I think that's a very good point, so the initial version of this does use logs and I think that's because it is hard to get this information on some edge device that you've deployed to that you don't have access to as easily, and so then, when I mention pixy later, we essentially do use probes to go pick up that information rather than going and recording it and writing it and storing in memory where you have to go and just you know, grab that file and then parse it later.

A

So luckily, the parsing itself, that's when you actually want to go and debug your pipeline, and so that's like done async and not actually in the model when you're running it.

C

So how did you end up solving the issue? You said it was difficult to parse the edge model. How did you end up solving this issue difficult to parse? You said it was difficult to probe the edge model because it's like on the edge, so you don't have direct access to it. Oh.

A

Okay, I will talk more about that if you're going to go to edge j, essentially pixi uses this thing called ebpf and that runs at the kernel level, and so then that can pick up a ton of interesting information.

A

B

D

So I understand a pixie telemetry model in general for like service monitoring, I was curious about ml model performance and perhaps those data also being interesting to be aggregated and looked at in a place where people are usually looking at ml performance. Comparisons like in weights and biases. Do you guys have like, like a a picture of like where those data could somehow intersect or how you could bring them together, like that.

A

Yeah, so I guess in relating to pixi, we use ebpf, like I said, and that kind of picks up you can use evpf to hook on to certain new probes, so that are like certain user defined functions and then that can collect a bunch of information. uh You can get like the arguments of that function. You can get the outputs of that function and you can send all that data to pixie to visualize it. I hope. Does that answer your question. I'm not sure I got it correct, but we'll be talking more about tomorrow.

A

So hopefully you can come by and see our demo about how we just like use, pixi to go and like probe all this information and what information we can get. Okay,.

B

We have another question from the slack channels, so the question is: is ml x-ray mainly for deep learning? All the examples shown seems assumed layers.

A

Yes, yes, so it is primarily for deep learning that is correct, okay, cool! Thank you.