Red Hat OpenShift Databases - OpenShift Commons Gathering 2022, 23 Feb 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Building Ignite AI Platform on OpenShift Hongfei Cao & Kevin Martelli (KPMG) OpenShift Commons 2022

Description

Building the Ignite AI Platform using PostgreSQL and Kafka on OpenShift
Hongfei Cao & Kevin Martelli (KPMG)

OpenShift Commons Gathering on Databases held on 02/23/2022

Slides: https://bit.ly/3MeH4tV

Join OpenShift Commons: https://commons.openshift.org/index.html#join

Full Agenda here:
https://commons.openshift.org/gatherings/OpenShift_Commons_Gathering_on_Databases.html

A

Good morning, good afternoon, everyone, uh this is a hong kong from kpmg. I'm a you know, cloud engineer, director uh co-presenter with uh kevin martelly he's a principal um in the you know cloud. uh You know industry area, um so our topic is um how we leverage openshift uh data storage uh to deploy our uh you know, ignite machine learning, uh platform.

B

um Yeah, so I think, as hong kong was alluding to you know as part of this database session, that we're having here at commons, we plan on showing um our kpmg platform, the kpmg client platform's, our data science, ai and ml platform used for business, but it takes advantages of a lot of these technologies.

B

So we want to show you, in the context of a use case, on how we leverage it through high throughput and also large data volumes of storing data, and that will kind of go into the the next two agenda items here. And finally, you know if we do have time, we thought it was interesting.

B

With some of the more I would say, robust models that we were storing uh required, some object, storage or using pvcs uh with a min io layer on top of it not exactly database, but a data data storage, type of platform and application. We thought it'd be interesting to share as well.

B

um If you go down one slide on fei as I was mentioning just so, we could set the background on what kpmg ignite is so so kpmg. Many years ago, we built what we call our data science, ai and ml platform powered on top of openshift. It's a platform, that's built in a very modular way, um to allow the usages of best pieces of matter whether open source software, proprietary software, commercial software that can kind of plug in in order to build your use case or application.

B

Initially, it was built for data scientists and engineers. um However, there's a hook there in for the business to be able to engage with and interact with the data sets that are coming out as well as to keep that human loop through the then process.

B

And finally, it was built. You know mainly around unlocking the value of unstructured data. It since has changed to do um structure data as well as semi-structured data, but really build off of all the the rich text that needed to be taken out of these unstructured documents. And what I just wanted to quickly show here before we dive into the details, is how a use case and methodology is built which aligns to some of the ways that we're using different database technologies, so use cases is put together by a component.

B

A component can be in like something open source like an ocr engine, a component b classification data extraction, so there's many components that get strung together into a workflow to produce an output, and, as these components are you know, communicating back and forth kafka. Is that the messaging channel? If you will that allows these components to talk back and forth and there's interfaces in that human loop, so users can see the output and help to retrain and re-update the models.

B

If you get that one slide-on thing and then finally, this is the last slide before we we drive into the content. If we think about ignite, we think about it as a layered cake. If you want there's sort of always the top in the user experience part of it. There's interfaces and annotation uis and management consoles of how people can interact with the data coming out of the platform. We have what we call in that middle layer, the ignite ai platform that that these are kind of like the ai tooling.

B

That enables you to build and execute pipelines. As I was mentioning earlier things that could be proprietary, that kpmg has built where we call it. You know custom type of capabilities, things that may be open sourced in the market like atlantic tester act um or things that we've kind of built as part of our overall, like drivers of certain types of you know more tactical data extractions and which we call our intelligent domain engine.

B

And if you look to the left, it talks a lot about some of the core fundamental things about the platform, so loom is a way that we store data. So there's a consistency of where you put something into a particular. um You know component and how something come comes out of that component and then finally, you know, as one would expect.

B

We have the the the orchestration layer which is really powered by openshift, and we have some workflow engines in there, but I wanted to highlight this core infrastructure, so the core infrastructure is where we're going to focus most of our talk on today, and these are around the different, I would say, database like applications that we're using so we're using kafka we're using postgres. You know we're also using min io as we talked about, and then we are also using elasticsearch, but we won't go into that for timing, um but we'll go through the types.

B

The ways that we're using um you know kafka how kafka is set up in the platform pros and cons and then we'll also talk through. um You know how progress is being used as well.

B

All right with that. Let me turn over to hong kow.

A

Right, thank you, kevin uh for the rest of the presentation. uh Let me introduce you uh how we set up and leverage uh openshift datastore to deploy the database for ignite platform um and also share some of the. You know lessons in our best practice and there was a benefit uh deployed on top of openshift. uh So the first component I'm going to introduce to you is kafka. So we leverage uh kafka for you know message broker to stream our ignite workflow metadata or you know some.

A

You know uh you know job you know uh result. uh You know to the multiple worker container uh for uh to simplify. uh Here is a three node uh kafka clusters with high availability setup and each you know, worker original broker container part will have multiple process volume claim amount to it. Here we have a customized storage class for the business volume which are using the encrypted openshift container storage ocs.

A

um So the storage setup here right is mainly for distributed kafka message. You know data. Also, our country version akava requires zookeeper to store the cluster information, um so we also set up the high availability zookeeper cluster. um You know as a one simple example: we have a three zookeeper uh nodes right as a minimum, a column cluster and each uh zookeeper sim node right similar to kafka. It has a multiple versus volume class amount to it.

A

um So we found um when deployed uh kafka and zookeeper on top of openshift versus uh other like a traditional vm, or you know, ass type of deployment on cloud, uh the benefit using openshift, including um you know, uh below uh aspect. uh First, one is so. um As you know, mike mentioned right, there is a a strong.

A

um You know advantage uh to build the hybrid cloud strategy, uh cloud, agnostic approach uh using openshift um and um out of the box openshift offered us as a default like in the orchestration of, and also the failover. uh Through, the you know, skillful side deployments user. You know, building replica set.

A

Also um it's a whole deployment is using. uh You know automated uh csd, workflow, uh which you know um helped us uh significantly. On the you know, the kafka restore uh you know, early update, uh patching, etc. uh Last but not least, um a series openshift that we can easily scale up and scale down. um You know our kafka cluster would keep a cluster based on the workflow.

A

You know needed um hey.

B

I'm saying maybe there's one thing I want to add on to this slide. I think one of the challenges that we did run into and you know we had a kind of software around it, as we were mentioning before we have. The concepts of a component, so component can be like ocr component can be a classification.

B

A component could be some like heuristic rule, that's getting information out of a document and if you're you're going across hundreds of thousands of documents and then you're having you know, thousands of instant of these these components spin enough to operate on the documents. There's a lot of communication and trafficking going back and forth between kafka, saying one component's done next component. Take it next component's done so all that interchange between you know the the the process of executing component, one component, two component, three component: four: to produce some type of output.

B

You know had a lot of heavy throughput on how kafka needed to be kind of deployed configured uh with on the platform to have the certain slas that needed to be in place and also keep the resiliency of how the um the tooling needed to work. So there was a couple of things that that the team has worked through. I think on fay.

B

You know talk through them, uh but that was initially a challenge with how many messages were going back and forth because of the spin up of the pods to execute those individual components for selected workloads.

A

Yep see you kevin right, um the next component I'm going to talk about this postgres, uh so we uh as a united platform, we use a uh postquest to store our uh internet workflow metadata as a traditional data store um similar to kafka. We also want to deploy postgres uh high availability in cluster setup, and what we found out is uh openshift offers a postgres operator. uh You know through the vendor, um you know the implementation, so it uh significantly reduce uh the complexity of deploying the high variability postgres cluster.

A

uh Also, uh we have you know, building uh some of the solutions or customized solutions for uh backing up the postgres data uh which leverage the the object, storage, mirror os uh landing zone type of solution. uh We dump the prospects data to um uh and uh once uh the prospects uh cluster restored- or you know backup, we can, you know, share the data across.

A

You know the uh to the different cluster or you know uh you know backup, restores the data to the new postgres cluster um when we deploy the postgres on openshift, we found uh you know below advantage right benefits, including the easy deployment through the operator, uh and it has a you know, a very good integration with storage support.

A

Also, the building enterprise grade level high availability uh in orchestration failover uh help uh significantly on the database deployment uh it also uh similar to kafka. It provides a cloud agnostic, hybrid cloud approach of the deployment, um and you know easy migration uh csd high integrated using uh the existing uh ccd uh like jenkins ansible paper gorilla tikton, so uh it can even reduce riser deployment time uh last but not least, the building security module to support the policy and hardening our deployment.

A

So a lot of benefit for us try to deploy the database postgres on openshift.

A

Next, I'm going to quickly talk about. uh Another type of you know: storage, uh we're duty library for ignite machine learning model um different from postgres kafka. Here we directly leverage the standalone precision volume claim uh running on top of the openshift cluster storage. um Like many uh other machine learning platform, ecosystems ignite also has a model database or model inventory to store the trained model, uh and sometimes the model could be a very large scale.

A

uh If you know it involves, uh uh for example, deep learning or you know the nature network processing model, it could be like you know, uh even uh several gigabytes uh size um to speed up the model. uh You know prediction or classification process when we serve the model um to reduce uh downloading the model from the database model database model inventory, we actually set up a centralized shared, uh read, read many persistent uh volume class um to store those uh a large size, object or operating model, and then later it will share across multiple.

A

uh You know: machine learning, worker, container or parts so this required. um You know, minimum data download time is only one time, data load and it is significantly reduced network traffic between the auto database and the open shift cluster um and given the model itself, the nature of the model is relatively static. Compared to the other data we store in kafka or postgres, we can do the separate deployment uh loading. The model uh as at the beginning of the you know, model serving job uh and it only required um infrequent.

A

You know data updates, which we have a separate uh deployment job for the model updates, so here from the right hand, side um it shows uh before the deployment that we go to mounts um processor and volume, uh rewrite uh versus volume claim to our deployment part and it will download the model from ml flow as our model inventory uh once the model is uh persist there uh for any model serving uh pod or worker job, and it can, you know, load this uh business volume claim as rewrite money and reduce uh the network traffic.

A

um Last but not least, I'm going to quickly touch on the object, storage setup inside ignite, um so we also leveraged the miao as a hardware file system on top of the openshift ocs storage container. um Here uh the miao is occupied as a state force assad and each male safer side has a multiple uh versus volume claim, with a customized storage class um to benefit us ignite. The miao is support, uh supports uh rerun money uh and has um the api uh with a secured access key uh to allow.

A

You know different uh worker container to access the mail data. For example, we can store the runtime log, organize job input, you know the documentation list, etc. On top of the mail as our shared object, storage, okay, um so, finally, uh to conclude uh our uh united deployment openshift work. uh We found that uh leveraging openshift especially operator, is a key for our. You know: enterprise level grade uh deployment uh to handle the postgres.

A

um You know machine learning, uh platform, storage. um We found the openshift offer a lot of out-of-box in functionality to support high-availability failover and you know, say cd pipeline also to have a better high-availability support. We prefer to deploy our platform uh to multiple clusters in a different region and uh uh location data center. uh To enable the rerun many precision volume claim is uh the key to reduce our network traffic uh for large-scale uh machine learning.

A

Pre-Trained model, like you, know, deep learning model for ilp and shared across multiple uh lp or model serving job, um also customized versus one claim. Backup utility uh is also the key to help us uh quickly. uh You know rotate or you know, update our existing database like postgres or kafka.

A

Last but not least, um you know migrate from the old storage class to the ocs encrypted uh story class and give us you know a better. You know throughput, and you know encryption uh from the openshift storage perspective.