Red Hat OpenShift ODSC West 2021 | Open Data Science Conference, 30 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Scalable Natural Language Processing Using BERT, OpenVINO, AI Kit, and Open Data Hub

Description

Bidirectional Encoder Representations from Transformers (BERT) is currently one of the most widely used NLP models. The combination of OpenDataHub, Intel® oneAPI AI Analytics Toolkit (AI Kit), and OpenVINO Toolkit helps operationalize models like BERT following MLOps best practices. As a starting point, OpenDataHub provides a notebook as a service environment through it's JupyterHub implementation. We will show how data scientists, using custom resources, can initiate training of BERT models using AI Kit images with Intel-optimized deep learning frameworks like PyTorch and Tensorflow. OpenVINO integrations with OpenDataHub augment it's image catalog to include pre-validated notebook images that can be used to optimize or optionally fine-tune for lower precision models like BERT. Finally, we detail how to operationalize optimized and scalable inference on a multi-node Xeon CPU cluster using OpenVINO model server and Istio service mesh.

A

B

A

My name is kyle bader david foundation, architect at red hat. Joining me today is ryan looney from intel openvino product manager today, we'll be speaking about scalable natural language processing, using burt, openvino, ai kit and open data hub to set the stage I'll start by talking about open data hub open data hub is a community project that was started out of red hat. That aims to take a variety of open, sys, open source components and package them into an end-to-end ai machine learning platform for openshift.

A

So the types of components that it draws from are apache spark kubeflow, tensorflow, jupiter hub a variety of others, and we make them available very easily for users of either the okd open source, distribution of kubernetes or commercial, offering from red hat in either case. You can go to the operator hub community operators section and you can install open data hub and get going with your data science and data engineering problems.

A

To give an idea of the problem domain that open data hub seeks to address, I present this workflow, that's more on the industrial and scalable side of machine learning.

A

It starts with your existing data and oftentimes before you get to the part where you start doing, data science you'll need to process and refine the data into a state where it's uh where a data scientist can start to do an iterative loop of feature, engineering and model experimentation to figure out a model. That's going to um that, can be applied and solve some sort of business problem.

A

When you're doing scalable machine learning, often times you'll want to take the extracted features and flatten them and store them into some sort of scalable storage system like a s3 object, storage solution like ceph and then ultimately, you'll want to train them and validate them using some sort of machine learning framework like pi, torch or tensorflow, with the resulting model, you'll want to preserve it in a model repository and from there you can do an integrative loop of model, optimization to create intermediate representations of the model that can be used for various different hardware targets or can potentially eliminate layers and make trade-offs between accuracy and throughput.

A

That may be appropriate for different situations.

A

After that, it's time to push the model into production, in which case the model will get loaded into some sort of model serving engine and that model serving engine will be integrated with some sort of application. That's doing that's driving the business value so whether it be some sort of recommendation service or image detection.

A

um That's ultimately, when the model is in production and providing some sort of business value, but it doesn't stop there, because the data that the model is acting on can change over time. So you need to have some sort of monitoring in place so that you can see whether there's drift between the data that the model is seeing in production and the data that the model was trained off of and in the case that you do detect drift.

A

That's when you go through another iterative, loop of training off of fresher data and then repeating the cycle.

A

So wrapping up on open data hub, it's really about creating a blueprint and giving you an opinionated set of patterns and tools in order to kind of streamline data, science and machine learning on kubernetes platforms like openshift and recently, we've been doing some work with intel to add some additional tooling to help with the life cycle.

A

I kind of walked through in the previous slide, specifically by integrating the open source open, vino technologies which help provide a variety of optimizations for models and then also have a model server for when it's time to put the models into production but I'll. Let ryan go ahead and speak more on openvino and the integration work we've been doing recently.

C

Thanks kyle I'll quickly walk through some of the deep learning software from intel. So on the left, we have deep learning frameworks like tensorflow and pi torch, where we upstream some of our one api optimizations to get better performance with training and inference using the frameworks directly. We also have the ai and analytics toolkit which bundles together some of these third-party frameworks and tools for data science and classic machine learning like sidekit, pandas, moden and then on the right.

C

We have openvino toolkit, which is an open source toolkit that includes tools for optimizing and preparing for models for deployment. So we have tools for quantization, aware training, post training, quantization model serving annotation and more.

C

This is a look at openvino in a nutshell, on the left, we have deep learning frameworks like I said, tensorflow pi torch and you take these fully trained models, and then we can optimize them with the open vino tools which are available now through open data hub using the openvino operator once you've optimized the models. You can then deploy them on intel hardware, whether it's an intel, integrated gpu at the edge or a xeon scalable processor in the cloud, and you can deploy on windows, linux and mac.

C

Today, we're going to show how to deploy linux containers and kubernetes.

C

We're proud of our ecosystem adoption. You can see some of the partners, who've integrated openvino into their solutions, and we are always trying to grow this list. So if you're an ai developer, whether you're open source enterprise, we would love for you to integrate openvino and leverage it for your deep learning solution.

C

Kyle mentioned the workflow in open data hub and here I'm showing where openvino fits in. So we have an integration that plugs directly into the jupyter spawner in open data hub and we'll show that a preview of that in a minute. We also have the kubernetes operator for deploying and creating inference endpoints, so you can serve your models and serve predictions uh in a kubernetes cluster.

C

This is a high-level view of the jupiter spawner in open data hub. So if you once you've installed the openvino operator you and created a notebook resource, you can see that there's just a button to select open amino toolkit choose the size of your jupiter container and then click start and you'll have access to a set of tutorials that show how to perform post training. Quantization quantization, aware training for different use cases take models from deep learning frameworks and convert them to the optimized, open, vino intermediate representation or openvino ir.

C

For short, so here you can see an nlp example: we're actually going to show how to deploy a natural language processing model in a kubernetes cluster. At the end, openvino also includes an open model zoo. It's a collection of 220 pre-trained, deep learning models, uh some that are provided from by intel and others that come from the open source community.

C

What this does is it enables you to quickly get started and download models for object, detection, image, classification, automatic speech, recognition, segmentation, natural language processing, you name it there's a wide variety of use. Cases and all you have to do is launch a jupiter, a notebook, and you can quickly inside open data hub use our command line tools to download a model and start using it directly in in python and the jupiter notebook.

C

We have a a tutorial that shows uh how to use these tools. So if you install open data hub and open vino together, you can quickly dive in and start downloading and trying out some of these models.

C

As I mentioned, deploying and creating inference endpoints in kubernetes environments requires a containerized micro service, so openvino model server provides a container that exposes endpoints, grpc and rest where you can send prediction requests from your application, whether it's written in python c plus go lane. You name it over the rest, api or grpc api. You can send your input images or your text as an input and then get the results back to the client application.

C

This is great for scaling deployments with kubernetes, so uh here you can see we have multiple pods uh with model server and that we're load balancing uh with between requests that are coming in from our application, and this is what we'll show a demo of at the end.

C

This is a high level view of the openvino model, server architecture. As I said, it exposes a grpc and a rest api endpoint and additionally, it's doing configuration monitoring what that means is checking which models need to be served and if there's additional models that you've added that you want to serve in production.

C

The configuration management and monitoring is happening under the hood in openvino model server uh model management. We have a concept of a model repository which I'll talk about on the next slide, but basically, when you're ready to roll out a new version of a model, let's say you have some new data, you've improved the accuracy and you want to roll it out in production. You can do that automatically without interrupting the service and having any downtime for your application.

C

Openvino's model server is built on the plug-in architecture from openvino, so it's also easy to switch between cpu, gpu and other intel devices or combine them together for increased throughput.

C

So the model management require we have a concept of a model repository think of it like a github repo for your deep learning models, so it can be in google cloud storage, any s3, compatible storage, a persistent volume and kubernetes or open shift, and you just create a directory structure that has your models, the versions and then the uh the binaries. The graph of those models stored in those sub directories by default, the highest subdirectory is served and every time you add a new model automatically the new model gets loaded.

C

The previous version of the model is not removed until the new model is starting to serve predictions. So there's no interruption in service. You can hot swap the the models so to speak.

C

Now the fun part we're going to show a demo derek truwinski. The technical lead from the model server team is going to show us a demo, taking a pre-trained and quantized model quantized to integer 8 precision. This is a bert natural language processing model.

C

It's part it's available for download from the openmodel zoo. So first it's going to download that model and then he's going to deploy it using openvino model server in a kubernetes environment. Once we have the model deployed and we're serving predictions, we can use a lightweight application to client application. To ask questions, so derek will define a corpus for the source.

C

It's going to be like a wikipedia page, and then we can send queries to the api over the grpc api. Asking questions like what is bert and then getting a response back so derek. Why don't you show us the demo.

B

Hello, my name is director vinsky, I'm a software engineer at intel and I will walk you through openvine model server demo in openshift. I will show you how the module, server and interface service can be deployed using openshift operator and how the cluster can be used to scale the inference execution.

B

During the presentation, I will use the birth model from opened vinyl model zoom, which performs questions answering it's a model quantized to integer, 8 precision and it's trained on a squad. 1.1 dataset. Now, let's go to the openshift console and see how to deploy the model server using the operator.

B

The operator is installed, so I can create new instance of the model server by just using the operator graphical user interface.

B

I just click, create model server.

B

That brings a fully functional template with the resnet model hosted on the google cloud storage.

B

The parameters defined various aspects of the model server like location of the the modular repository model name.

B

Model configuration and also performance tuning options.

B

The model server can be deployed also using openshift command line by creating a resource model server with defined configuration.

B

Here I prepared a configuration of the model server with birth model.

B

So the model is stored in on a s3, compatible storage.

B

So, let's check the data structure in the module repository.

B

Yeah, so here are the model files.

B

Now I will apply this configuration using openshift tool and deploy the model server.

B

Okay, so let's check the results in the console.

B

Okay, so the model server is already initialized and deployed and let's check created resources.

B

So here we can see the operator created the pod.

B

So this model server is deployed in openshift service mesh environment, which gives more capabilities for controlling the traffic and monitoring it. It adds on each pod a sidecar proxy container, which is load balancing the calls to the service. That is the reason why the module server pod is reporting two containers.

B

I can check the server locks in the in the console, so the pod has the red state, so the log should confirm that the model is loaded and cervic is started.

B

Yes, deluxe conference, everything works as expected, so model is loaded.

B

Now I will use the service created by the operator.

B

This is the the service and I will use it to run the predictions to answer the questions the service is enabled in the cluster, so I will expose it using ingress component from the mesh. I will do it by adding the mesh virtual service and gateway resources.

B

This is the configuration of the gateway and virtual service.

B

And in this cluster the ingress controller is exposed using the port node. So I will be connecting to the sport.

B

Using any of the nodes, so now I'm ready to connect to the service I'll switch to another terminal and first I will use a grpc client querying the model parameters. This client is from the model server github repository.

B

So I'm connecting to the node using the exposed, node port.

B

Okay, so in response I can see the information about the model inputs and model outputs.

B

Now I will use the birth client written in in python. Its code is also in the github repository of the model server in example, client's part. So here we can learn more about it.

B

So this client takes us argument: the url to the web page with the knowledge source and the questions to be question to be asked.

B

So here I'm connecting to the grpc endpoint here are the parameters related to the model. Input names vocabulary file, and here is the url to the wiki page and the question.

B

So the script is splitting the whole content of the web page and asks the model server for answers for each part, then it gives the most likely free answers.

B

So I can start this client also in the in the loop, so it will be keep sending the same question so so, let's check what the mesh monitoring can can detect.

B

So kali and grafana are the monitoring tools installed together with the openshift mesh service.

B

So we can see here the throughput results, but we will not fully take advantage of the clusters capability scalability.

B

So right now each request is sequential, so there is only one inference execution at a time to show you the full advantage of the scalability. I will use another client which sent asynchronous grpc calls to the service it's written in c plus, and it's not doing any brand post processing to simplify the load generation.

B

This client is documented in the github repository.

B

In a cpp folder.

B

I will stop this one.

B

I have already prepared a docker image with um this client, so I will start the job. The open shift.

B

With with the client which connects to the model server and asks 10 million questions,.

B

We'll use the openshift command line to deploy this job.

B

Let's check the results in the console.

B

Okay, so the client is starting.

B

Client has started and is generating the load.

B

To improve the throughput in the model server, I will tune the cpu plug-in config for automatic configuration of the openvinon execution streams.

B

So here is the plug-in config.

B

So that will swap the pot with the model serving with the new configuration, but all will happen without an interruption for the for the client.

B

So the report shows the utilization with some delay, so we will see the change in in a moment.

B

So, besides the grafana statistic, I can also monitor the traffic in the beside the kelly. I can also open grafana dashboard.

B

Okay, so we can see the tropot is now increased and stabilized, but let's imagine that a single node cannot deliver a sufficient capacity for our needs, uh so we might have hundreds or thousands of clients connecting to the to the model with openshift cluster. We can easily scale the capacity by adding more replicas and nodes. I'm going to do is do this now.

B

So I will edit the model, server configuration and I will add another replica.

B

Okay, so now um two replicas are operational, so we should see the increased throughput from the service. The calls from the clients are now distributed between two replicas and two nodes, with the default kubernetes load, balancing which is operating on the third osi network layer.

B

Grpc calls are connection. Preserving. That means that each request from the client would be routed to the same replica with the load balancing on the oci application layer, which is the case for the mesh. Each call from the grpc client is dispatched separately, so it can utilize several replicas and nodes.

B

So in a moment we should see increased throughput from the from the service here. This is in keali and also in grafana.

B

Okay, it's already increased, so, let's, let's increase the capacity even more by adding the third replica and see what happens. I will repeat similar steps.

B

So we have now three operational replicas and again we can check how that will impact the throughput results.

B

The throughputs should be increased in grafana and in kali in a moment.

B

So, to summarize what was presented here, I explained how the model server can be deployed in openshift using the operator. I also demonstrated how to use grpc client to run a query to the birth model. Finally, you saw how the inference service can be scaled horizontally when adding more resources on a single node is not sufficient.

B

So now we can see the impact from the third replica, so the throughput is increased again and the response time is reduced and in a moment it will be the same results visible here in grafana.

B

So that concludes the demonstration so back to you, ryan.

B