Red Hat OpenShift Data Science | OpenShift Commons Gathering 2021, 28 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How to Build Data Science Pipelines with OpenShift using Ceph, Kafka and Knative Guillaume Moutier

Description

How to Build Data Science Pipelines with OpenShift using Ceph, Kafka and Knative

Guillaume Moutier (Red Hat)

OpenShift Commons Gathering on Data Science
January 28, 2021
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Data_Science.html

Find out more about OpenShift Commons, please visit: https://commons.openshift.org

A

Hi everyone welcome to this presentation on how to build data science pipelines with openshift using ceph kafka and canadiev.

A

My name is guillaumeucci and don't worry I don't usually carry my sword with me. I am a technical evangelist in the data services business unit at red hat, I'm the former cto of level university in quebec, city, canada, and with now more than 20 years of experience in various roles in the industry.

A

Now I'm focusing my work on data science platforms that are engineering well, everything that relates to data all throughout its life cycle from its gathering or gesture through processing, then to its archiving. But enough about me. Let's talk about data.

A

The most important thing that I want you to retain from this presentation is to embrace the cloud native way of doing things. I'm not saying here that you must run everything in the cloud. No, not at all, but architectural patterns that we have seen emerge from cloud services can completely apply anywhere, including on-prem, and this is exactly what is helping a lot for data pipelines.

A

But first thing: first: how do we define this cloud native approach? Here's my totally very opinionated short list of characteristics that we must aim at for a cloud native data platform. First is agility and elasticity. You know tools and frameworks and data sets they evolve constantly and very rapidly. So you must be able to act accordingly with your infrastructures, then cloud standards. I guess it's important to avoid any vendor lock-in with proprietary tools and formats, and we must embrace widely recognized open source protocols and standards, hybrid cloud architecture.

A

um You know what you are designing in terms of architecture must run anywhere without any change. Or maybe you know some small conflicts that you can adapt, but not the architecture itself.

A

Then automation, you must embrace the devops philosophy. Everything must be automated and code based and finally separate compute from from storage, and you know so that you can take advantage of the rich computing ecosystem against the central storage.

A

To sum it up, it's all about agility standards and this ability to run everywhere, and all of that will allow us to reach our business goals, which are speed, efficiency and adaptability. And I will say this last one again, as it seems to me, is the most important one adaptability.

A

Now, let's take a look at a standard way of doing things in this schema, I illustrated a standard data interaction where user would produce a file that would be later on consumed by an application.

A

What I don't like about this is the coupling that that you have at different points. You know the the user who has to mount- uh let's say a p drive on the computer and same thing on the server side, the application, which relies on a very specific configuration of the application server.

A

Therefore, making this difficult to scale, especially on demand.

A

So that's why I really prefer this approach using, for example, object, storage, where all interactions are done in the disconnected mode, it's purely a put or a get command from whatever location, where you have network communication.

A

So this is definitely more agile and then we can also use the bucket notification feature that is available with safe object, storage.

A

What it allows us to do in this example is to send a message to a kafka topic whenever a file is uploaded or deleted. This topic can then be consumed by an openshift serverless service that can scale from zero to whatever is needed to process this file.

A

This is what I call a clone native architecture pattern, and now, let's go for a demo where we'll see this architecture pattern applied to an automated x-ray analysis pipeline.

A

The use case in this demo is about pneumonia detection from chest x-rays using an automated data pipeline. So imagine the problem. Is this one? We have some x-ray images to review some from people having pneumonia or some people who have normal chest x-rays, and we want you to automate this process.

A

So, of course, we think that an ai ml model can can help and we can use tools that are provided from by open data hub, for example, you know with jupiter, notebooks and tensorflow, and we can train a model to be able to do some inferencing and on those images and determine if these, uh the the new images that we want to process are from people having pneumonia or not.

A

So we have this model, but it has to scale so we have to automate it so now. The question is: how can we analyze those images as they come in for a continuous flow of thousands of images, and if you want to retrain the model and redeploy it seamlessly at various locations simultaneously, how can we efficiently do that.

A

And again, my answer is to use cloud native architecture and patterns with bucket notifications with openshift container storage, kafka topics, with mq streams and candidates eventing and serving with openshift serverless.

A

So here is our demo environment. Let's say we are at a hospital and we are generating new x-ray images. Okay, what we will do is send all those images into a bucket into a self-bucket that has been instantiated by openshift container storage, and this bucket has been configured to send notifications whenever a new image is coming in.

A

Those notifications will be sent to a kafka topic that is linked to a creative eventing and serving function, and the container that is spawned when a new message is coming in will do a risk assessment on this new image. You know basically using the model that we have trained to try to infer. If there is a risk of pneumonia or not in a standout production scenario, all results will be sent to a doctor, but here I have. I did a special step because, of course you know not all models are totally perfect.

A

There is a certain degree of uncertainty, and this is what I'm doing here when the model is not able to have a certainty above 80 percent.

A

What the what the the process will do is anonymize the image okay, so that it can be further processed in a central data science lab, for example, and normally again in a standard production environment. You would have a doctor, a specialist doing a manual assessment and the classification of of this image, for which the model was not able to infer the result.

A

It would be classified as a risk of pneumonia or being normal, and this would trigger a return of the model that could be re-injected here back to our hospital origin through a standard openshift ci, cd, okay. This second part here with the model, retraining and everything we won't see it in the demo. It's not implemented because training a model like this takes a certain time, but I have a way to simulate a new model being used to do those inferences and to to make the link with the scenario that we described before.

A

We can imagine that in multiple hospitals there is also the same model being used to to make some inferencing on images. Okay and again, images for which the model is not uh is not so sure about. The the result, those images, would be anonymized and sent for further processing. Here. Okay, let's see that live okay, let me walk you through the environment I have prepared here is my openshift cluster, where I have a few things that that I have created for this demo.

A

First, there is here this deployment config of what I call the image generator. This is a container that will well, you know, in fact, it won't generate x-ray images. It will just copy randomly a source extra images that I have in in a bucket. It will pick randomly some images and send them copy them to an incoming bucket. Okay, we can see that here the image generator is deployed. There is one part running: that's the blue circle here indicate indicating that this deployment configuration is up and running, but at this moment it doesn't do anything.

A

I have a parameter here that is set now to zero, that that makes it you know, sitting idle, not sending any image. I have here a kafka source that is called x-ray images.

A

So what this container is doing is just listening to a kafka topic and waiting for some messages to to come in, and we can see here that this kafka source is connected is linked to this service. To this serverless skin native, you can see the the logo here, the creative service that is called risk assessment.

A

So this is a full serverless container serverless deployment, so meaning that, right now, as there is nothing to process, it's just also sitting idle. So we can see here, there's no blue circle around the container, meaning that is, it is scaled down to zero. There is no instance of of the risk assessment container running.

A

I have a few other things that are deployed. First, is my kafka cluster deployed through the mqstreams operator, so very basic here only one instance of kafka and zookeeper. It's totally ephemeral kafka cluster. Please don't do this at home. Normally you don't want to run only with one instance of each, but you know for resource purposes. Here it's it's enough for what we want to do.

A

So all the notifications from the the the the self bucket that will receive the image will be sent to a topic in this kafka cluster, and this is to with this cluster that we have here or kafka source uh subscriber that is listening to the the specific topic we want. Okay, we have also a deployment of grafana with the its own operator. That's a dashboard that will allow us to see what's going on, and I have also a few helpers here.

A

I have a small database, a very basic mariadb database, where I will record the names and timestamps of the images being of the images as they're coming or being processed or being anonymized, and this is what we will also display in the graphine dashboard. And finally, I have a small image server. As you will see on the dashboard, we will display directly the images as they are coming in. So here is everything that I have deployed and we are now ready to launch the image generation.

A

What I will do now is launch the demo and to do that, I will patch the image generator. Remember the value that is set to 0 to idle it. I will make it now be one once again, so that means that now a new image will be generated will be copied inside our incoming bucket every second, let's launch that so I've launched the command and the image generator will be patched with this new version. We can see here that it has already deployed it went very fast.

A

It has deployed the new version, and now it will begin to copy every second, a new image, and we can already see that something is happening here.

A

You have noticed here that my risk assessment pod is has now been spawned. There is something happening, so that means that we have our workflow going. The image generator is putting up new images inside our cell bucket, which triggers a notification to our kafka topic.

A

Here we have our kafka listener that retrieved the message and pass it to the risk assessment pad for the image to be processed. Let's have a look at what it's uh at what it looks like now.

A

So here is the graphene dashboard that represents in real time, what's happening in your in our pipeline on the top left. Here we have a summarized uh schema on this pipeline, so we can see the images are being sent to an incoming bucket here and we have the counter of the number of images that have been uploaded so far.

A

Then, as notifications are sent to a kafka topic and the risk assessment container has been launched, we have the number of images processed and again. If the the certainty of our model is less than 80, then we will have another function that will anonymize those those images. Okay, so we can see that the pipeline is running on the right side. We have the list of the last 10 uploaded images. Okay and don't worry, those are totally random generated names and uh birth date and other personal information. Those are not real patients.

A

Here we have also again the list of the last 10 uploaded images. Then the last 10 processed images and we'll see uh right in in a few seconds uh what is happening on those images and then the last 10 anonymized images. We have some counters on the on the left side. The cpu and ram usage that you can see has increased, because now we have some processing to do. We have the number of risk assessment containers which have been launched so far to be able to handle the the load again.

A

This is something that is automatic, automatically scaled by openshift serverless, and then we have here a risk distribution. So so far within all the images that have been uploaded, we have the distribution between the ones that have been assessed as normal or a risk of pneumonia or unsure. Okay, we have here in this small graph, the number of images that have been processed by model version and we'll see in a few seconds what happens when you change the model, and we have here a counter of the number of deployments of the the risk assessment.

A

Pods. Okay, while I will explain to you what is happening on the images, I will do two things uh first is to increase the rate at which the images are sent. So far, it's only one per second and I will also change here, a parameter simulating that we will have a model v2 now that will be used to do this processing, so I will do the first patch here and then the second patch.

A

So, while my containers are being updated to uh to reflect those changes, let's have a look at her images, and here I have another special dashboard with a bigger version of the displayed images, and maybe I will wait for another one to refresh so that we can see better.

A

It's refresh every five seconds. Okay, let's stop here. So what happens is this is a base image? This is the image that I have prepared beforehand. I have about 800 of those images uh which are x-ray chest x-rays with some personal information that have printed on these images. Those are, as I said, random randomly generated information when a risk assessment is made by the model. What my my processing container does is right.

A

On top of the image, the the the assessment that has been done here, a risk of pneumonia weighs the level at which the model uh made this assessment, so risk 100. So the model is pretty sure that there is a risk of pneumonia for this specific x-ray image, but when the model is not sure and the risk is less than 80 what we are doing also here, you can see that the personal informations that were on this specific x-ray have been blurred.

A

Okay, that's kind of a simulation of what you would do when you want to anonymize images. Okay, let's go back to our main dashboard and see what happened well, lots of different things.

A

First, the the usage of cpu and ram has further increased, because if you remember, I increased a lot, the rate at which the image are processed. Now it's 10 times per second okay. So here those counter are growing much more faster and we can see that the open shift has done its magic and automatically scaled. The number of containers number of pods it needs to be able to to handle the load. Of course, we can see here that many more images have been have been assessed.

A

Okay and at the same time, we can see here that now I am now using the v2 model to be able to uh to make the risk assessment. So here uh with this model change, I am simulating that, following image, anonymization and manual classification in the central data science lab a model has been retrained and pushed back to here to our hospital so that it can be used from from now on.

A

I hope you've enjoyed this presentation and the demo, if you have any question, please feel free to reach out. You can also find the code used for the demo in my github repo, don't forget to check out our websites and youtube channel to learn more about data science and openshift. Thank you.

A