Red Hat OpenShift All Things Data | OpenShift Commons, 6 Mar 2020

Previous Meeting

⏯

youtube image

►

From YouTube: All Things Data: JupyterHub on-demand & other tools w/ Red Hat's Guillaume Moutier & Landon LaSmith

Description

As part of the “All Things Data” series of briefings, Red Hat’s Guillaume Moutier and Landon LaSmith will be demoing how to easily integrate Open Data Hub and OpenShift Container Storage to build your own data science platform. When working on data science projects, it’s a guarantee that you will need different kinds of storage for your data: block, file, object.

Open Data Hub (ODH) is an open source project that provides open source AI tools for running large and distributed AI workloads on OpenShift Container Platform.

OpenShift Container Storage (OCS) is software-defined storage for containers that provides you with every type of storage you need, from a simple, single source.

A

To kick off the series we have Keanu ta and Landon listeth from Red Hat here to talk about Jupiter hub, as well as other tools running on open data hub with open ship container storage. If you have any questions again put them in the chat and Vanden, please take it away so.

B

My name is Landon Smith I am working on the team that is working to provide open data, so I'm just going to give a quick introduction about open data hub before I turn it over to Guillaume.

B

So the open data open data hub is a project to demonstrate how easy it is to run your a IML workflows on open chef. So it.

A

Is a reference.

B

Architecture or project a community, we host all of our information on open data hub IO, where you can find blogs tutorials guides for the different components that we provide as part of the open data hub. So the main entry point into open data hub is the open data hub meta operator. So we.

A

B

Meta operator, because it is an operator that deploys other operators or components which can include different products that are available in the open data hub, so Jupiter will be one of the kind of main ones featured during this talk. It's also a project that we're running internally at Red Hat for our internal data science and AI platform.

B

So this is to make it easier to control the entire lifecycle from data ingestion to data transformation, a modeling to kind of make it easy for that engineers and data Sciences I, just know workflows on open show, so it's an entire Indian process. So we have support for object, storage through SEF object, storage, we've hosted many workshops and workflows. That's our storage of choice, just to make it hybrid data, so spark and Rahab with tensorflow support our main projects that are being used by the data scientists and we're.

A

Also, bringing in line with.

B

Coop flow, so we want to make sure that it's compatible upstream products that the community is using.

B

This is just an overview of kind of the main features of open data, and a breakdown of you know the different components that satisfy needs for her or a target audience so you'll see for storage. We support all different types of storage versions for self object, storage, Postgres, MySQL database access. You can interact with those using our data, catalog components, they're super central and for the data scientists we have or for a lot of the major libraries that are being used in their workflow.

B

So one of the problem or issues that resolve our teams of data scientist developers, engineers working together so a common common platform that takes care of all the deployment headaches upgrades with running these different products. We want to make sure it's easy and intuitive to use. While you know eliminating a lot of the maintenance needs.

B

So here a list of like the major components that are available in open data hub. The current version is zero, five one. This is available in the open ship operator hull, so you can deploy it now in tested out. We have Prometheus and Ravana for modern monitoring, seven for serving your models that you spark for that engineering and data analytics Jupiter hub is one of our core components to support multi they're, cheaper notebooks ACEF for our object, storage, caki for streaming events and Argo for our pipeline.

B

Additionally, with the most recent version, we've added support, superset for that expiration and our data catalog, which combines multiple components so that you can query to visualize and analyze.

A

B

B

A

C

A few the upstream.

B

Communities that we're working open data Hospital hub has support for GPU workloads, so an open data I/o we have Doc's for enabling your GPU GPU nodes in your open ship cluster, we're actively working to bring open data hub in line with coop low. So you can stay tuned for more information about that and then top stream components. What we're calling upstream or things that are outside of Red Hat- that we want to work with. So we we don't customize them.

B

Our attempt did not customize them we're working with the pure authoring components and again this is on operator hub. So in your open shift cluster you could find us and deploy the operator to test it out.

B

Come use case Jupiter as a service is our the main entry point into the open data of Jupiter notebooks, or what a lot of the teams of data scientists and engineers are using to interact with your data. They had it coming in from multiple sources: I've got events to object notifications and they want to be able to do some model training on that using GPUs.

B

So internally we're running this. We have 40-plus concurrent tubular incitive across multiple GPU nodes and on any given time we have you know 13,000 peak events per second in kaká and daily kind of 350, plus gigabytes of data they're being transmitted. This is all using set object, storage.

B

So it's self service. With our team of data scientists at data engineers, we can just point them to a link to access a cheaper notebook. It's completely customized. So if we have a team that wants to use tensorflow, notebooks, ipod, notebooks or all the same notebooks with GPU enablement or GPU access, they can so they can go through the whole development lifecycle model test, enter and iterations, with full access to resources that are available in that cluster. On the vision.

B

And there's ludus support for multi-touch, so the open data hub runs within a namespace. So if you have different needs different restrictions or capabilities for or different teams, you can run these in independent namespaces within your cluster and pulling of sharing your resources, the open ship model, you can request resources that are available in the cluster specific to your needs. Super, Hub and open data hub have full support for that and it's multi-tenant, so that's kind of a quick intro to open data hub. If you want more information, please go to open data I/o.

B

We have a lot of getting started, guides toriel's on GPU enablement, pulling kind of set object, storage, cluster and using the different components of the data hub. Our open data hub group is currently available. One get lab, so you can follow that get more information about development of the operator. We have a coop flow, odh coop flow github project that we're currently transitioning to you can find more information about that initiative and join our mailing lists.

B

If you want to get updates, whenever we release new versions of the operator or even the project in general feel free to sign up for that, and we give a lot of talks, workshops, conferences at conferences, so you can find most of our videos for different conferences on the AI ml playlist. On the open chef comments, YouTube channel.

B

C

You London hi, everyone, okay, so as London explaining the atom data hub is, is a fantastic tool for your data science platforms. But, of course, if you want you to play with data science, you have to get data. So that means you have to store it somewhere and that's where up and shift container storage can come into play and help you for this up and shift internal storage released a few weeks ago and basically, what it does. It brings you all the different types of storage.

C

You may need directly inside what it is, in fact, it's rope that is controlling and deploying yourself underneath running directly on your app and chief cluster plus new paths of the multi card gateway. That brings you object, storage with with other fancy features which engage YouTube to look at.

C

So what we have to retain here for this demo is that with OCS deployed, you can directly have blocks eyes and object storage directly from within OpenShift and using the same tools as you usually do with an opportunities making claims using standard EML files to to provision all your different types of data that you need. So what we'll do here is leverage LCS to provide different types of storage directly inside inside open data and especially inside the Jupiter up.

C

Of course, you can do it manually that is providing the object, storage for one of each user, but what I wanted to do in this demo is to push it a little bit further to demonstrate how everything can be fully automated in in such a platform, though, before someone asks all the code here is available here at this repo, though that means that you will be able to reproduce it and our text bits of pieces. What I did do to suit your your case.

C

What I want to do here is to have my Jupiter environment, providing two typed, we're using two types of storage for my standard files that we see here. So there are some nut boobs and some butter files I would want to use. Let's call it standard storage, so I will make a persistent volume claim that will use a storage class provided directly by OCS. So here it would be block storage that will be at a vertically provision or a new user.

C

But at the same time, I want also to provide object, storage to my user, so I will make another big bucket claim, which will automatically create a bucket true, yes, and what I will do as a fancy thing is displayed directly inside my Jupiter environment and everything will be fully automated, so that means the user won't have to manipulate any access key secret keys.

C

This is what we're gonna do so for that only two four requisites to have a CH installed, of course, and yes to endpoint that you that you will use, we have to do to take note of it and a project where open data is deployed, which is quite easy to do first week twice a week. Try right now because of the operator. That is, that is so great. It's a very easy way to deploy your data science platform once you have both of this, what we will do is use a custom Jupiter up config.

C

This is some configuration that will be appended at the at the end of the Jupiter Jupiter of deployment and discard. What will do is each time a user logs in and lunch is not books. It will create a new object bucket claim. If there is non present, it will retrieve the configuration the access and secret keys for the specific user and inject everything as environment variables in the user, sperm, okay,.

C

Then we will deploy up and data hub itself with some specific specific configuration here. We will use this custom config map that we created before that will do all the things I expand. We will enter the f2 endpoint URL, so if you have deployed standard orders- yes, it's as simple as s3 that up and shaved our storage, which is the namespace in which Josias is deployed, and we will indicate the storage class to use when creating PvE standard TVs for users.

C

Okay, of course, here is the command line for for the code that is available in the repo. Also, we will have to create some roles and runs bindings else. Our special code will create new config Maps and we'll have to get access to to some secrets where the de access keys and secret keys will be stored for the users.

C

So it's only about here defining a new role which will allow the Jupiter up to get secrets and to create object, bucket flames, and then we bind the role with the service account on which a Jupiter abyss is running, so only a few commands to to to run. We will also use custom that boots well here.

C

This is not mandatory here custom that books is, if you want to display directly the the object bucket storage inside your notebook, so we use here the dihybrid content manager and the s3 content manager, which is a very interesting open source project, and that is what allows us to show at the same time, standard TVs and object storage from within the same book environment.

C

And if everything works well, what we should see is when we spawn a new notebook, we will have our standard standard, PV connected and providing this kind of file system, but also we will connect here to the standard to the object package that we we created.

C

So let's see it in action, so here I have in my project of the edge and up and shift that's up and era hub, and we can see that there is the operator already deployed and that Jupiter up instance, though it's ready for us ready for us to use it if I go to Jupiter ups. So it's the route that was created when Jupiter app was was deployed. I can hang in open shift, and here I created a bunch of fake users.

C

So we start with a new one, nickel who has never connected to open that hub. So I have to do her work, and here you can see that we have the different notebooks images that we that we can choose and we will choose the custom ones that we that we have provisioned before which I call as as to a minimal s3, because we add some connection automatic connection to a string here.

C

I don't have to enter anything because it will be automatically provisioned and and injected inside the environment, and then I will spoil my notebook, which will take a few seconds. If we go back here, we can see that there is a new container, creating that's the notebook environment for the for the user, but the container is creating here.

C

We'll have to wait a few seconds. Okay, so now it's running and there we have it. That's the environment that was just created for Nicole, and we can see that there is nothing, no files yet because it's a brand-new PVC. But there is already a connection to an object buckets here, which is called that at aqueous, the name of the bucket- it's not very fancy I should definitely change the code to have a better display, but but this is the object, storage and of course you can go to it.

C

So if we take a look at what happened behind the scene, we can see in the storage that there has been a new PVC created for Nicole, okay, which is there with the default of 2 gigabytes. That is provisioned. So that's the the standard claim that was made to storage server, 2 to provision Nicole with a new storage space. Ok, what also happened is that an object bucket was was created and we can see it here in the country map, so we have ODH bucket Nicole.

C

That was the claim that was made with the config map, and we can see that there is a bucket that was created and also a secret. A secret is all the informations that are required to connect to this specific bucket and that's exactly the that's exactly the environment variables that have been injected inside. In that book it so he that we can have them, so we have the access key, the secret key. So that's what allows the notebook to directly connect to the object, storage and retrieve the information.

C

If we take a look at noubar here, we have a list after the different buckets that have been created, and here we can see at the end four zero one. Five, that's the bucket on which we are connected. Okay, now here there is nothing provision, so what I will do is just change.

C

Users I will just up my server here, which will just have done the environment, so here we can see that the pad is being terminated and I will log out and login again with another user wait string.

C

A

Already connected before.

C

So is storage has already been provisioned, both the PVCs and the object bucket flame and just wait a few seconds. So here it's launching the container is created okay and then we will have access to his workspace. So here we can see that Frank has already been working on some nut books, doing some some terrorist training model and things like that and of course we have recollected him directly to his to his PV okay, but we have also reconnected him to this. Data leak, environment.

C

Okay, so here is the is the fan with the bucket that has been created, especially for what I did also is a little trick, because I wanted to create some object, storage space that would be that could be shared between each and every users. Okay, so what I did in nuba is create a bucket which I called turn data, and of course you can do everything like this programmatically.

C

So here I have allowed access to this bucket to this account, which is the one from and while I'm there I will also connect Nikol to its being and that's what allowed me to directly show this dislocate, because the crowd, the special code that we injected inside inside Jupiter, robotic does is list all the buckets to which every user has access and links them and show all those buckets directly inside the environments.

C

So here that means that Frank can go to the shared data folder and see that there are already some files that you can use. Those are here images to to trainees model for pneumonia dictation. There is a credit card, CSV file, so that's a very great way to have some central point where all the users can share data sets that allows you not to have people copying over and over to send data sets for for training and everything.

C

There are some standard tools and files that you want to be able to share between between people and that's the great way to do it. So here I'm gonna again log out of these environment and go back pinnacled because because, as you remember, I now had a lot of her to have access to to the shared objects door. So, if I launch her environment again,.

C

Wait a few seconds.

C

Instead, okay running so she's come very soon. Okay, so we see that now she has her own TVs with file no file yet in there are on the victor rate, but she also has now access to the shirt to the shared data exists alone. Okay and that's again, a pretty neat way to set up your data science platform so that everyone can and collaborate.

C

So in this quick demo that, of course you can reproduce. As as I said, the code is available and I will show you again all the resources here. I just showed you that it's quite easy to set up a full data science platform with a fully automated storage provisioning for your users, both with standard block storage and object storage. We could do the same with leveraging set FS with shared file systems, and everything can be totally automated using standard, Q, Burnet ease and NFC commands.

C

So that was it for me back to you, Karina and I think we have some time left for our questions. Thank.

A

You and Landon I am Karina angel and again this is the first in the series of all things: data for openshift Commons briefings and to look at the briefing calendar, go to Commons, openshift, org and we'll add more there.

A

Dianne runs so many great briefings so make sure to look at the youtube and watch previous briefings. If anybody has any other questions, please put them in the chat now Thomas. Are there decks available? Yes, we'll follow up with you afterwards. So thank you for your question.

A

Great, thank you, everybody and we will see you hopefully next week.