Red Hat OpenShift AI / Machine Learning | OpenShift Commons, 8 Oct 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Machine Learning on OpenShift SIG Using Ceph for ML Workloads on OpenShift - Kyle Bader (Red Hat)

Description

Machine Learning on OpenShift SIG Using Ceph for ML Workloads on OpenShift - Kyle Bader (Red Hat)

A

Anything that that might have a spark operator component to it, but I'm not sure so I'll let Kyle um dare his screen and walk us through that. Now.

B

Thanks for having me here, be fine yep.

A

B

So one of the things I story, you primarily on stuff that I've been working a lot with have big data applications over the left, half and obviously we're interested in the intersection of set, object, storage and read, analytics and open shift and that's stuff.

B

So one of the things I've been recently working on is kind of building together, like almost like a tutorial for experiencing stuff object, storage and learning how you can use stuff object, storage with with some of the tools that are coming out of rabbit lettuce. So if, if you're interested, you can do this, you know you're in time to buy just kind of cloning. This repo I have here really kind of called together in this last week. So it's it's improving!

B

It runs on openshift and kind of the idea is there's kind of like a micro stuff called Seth. Nano and I have kind of configuration here. That'll create a set of credentials and use openshift secrets to sort of those credentials and then create kind of a stateful step running like a single pod step cluster effectively.

B

That's just you know just described here during the bootstrap process of that that stuff Nano is stateful, said it will use the the credentials from the secrets to kind of create an additional set of users, but I have already here in my mini shift running on my desktop here, a set cluster consisting in one pod, a more kind of robust environment. You know, there's the workaround, you know a step operator using using rook and and you'd probably want to do that in kind of a real environment.

B

You wanted to be running a more production grade, step DEP cluster inside of OpenShift, but that's still very much pioneering work as well, but really kind of the goal is to be able to use step object, storage, just like you would use, or on creme youssef object or just like you use, Amazon s3 and one of the challenges I had with kind of using that is.

B

I was using the s3 a connector from the nuke and kind of the the pre-canned images when I was initially playing around with or the OpenShift spark images format analytics were we're using a an older version or a little spark with an older version of Hadoop, and that was that was kind of problematic. But one of the new things that you know the brand and all that committee is the ability to they have their publishing these incomplete. Both Bishop's Park builder images that you can pre and build with and provide a tarball.

B

Basically, a link to spark caramel and it'll create a custom open ship spark image for you with that particular version of spark, so I kind of put together a spark. The latest spark and a new 2.8, not five caramel and then I'm using that to create create a build here. A built-in chef spark and then I can go ahead and use that that resulting image as the base for for a notebook notebook here.

B

So you build out notebook and then a lot of the secrets that credential secrets expose those to the notebook and then I'm kind of I'm just going through all this verbally, so that we don't have to wait and watch all that so that we can spend most of our time kind of belonging through the actual notebook.

B

Though I did that earlier here, cluttered it up, I have have notebook here: I threw the environmental variable, I included kind of the notebook in in this particular repository, and it's it's getting the rgw end point kind of from this. This Zelda command here and the most straightforward. If you are you're, you know just working from within a notebook, and you want to interact with the object, store, photo, is kind of the AWS library and choice in the case of Python.

B

So if you kind of want to be able to interact with not good store, this really kind of as easy as installing boto and then you know, creating a boat object to interact, interact with the object store. This particular image I'm just using the base, notebook image, and so it doesn't include photo three.

B

You can always install, and you know, install cook condo packages from the notebook. But if you do this a lot, you might want to build your own base, notebook image, the the secret key and the user Keith their credentials or for the staff stuff, Buster and then the end point are are specified and as I'm creating the photo object here and those are being sourced from the environment. They make their way into the environment from.

B

From right here, I had created the openshift or criminality secret and then exposed it to the notebook. So if you look at like the the configuration for the this application, you have the the end point here, which was.

B

Being passed to the new out man line and then the t-grip and or the user key and super key, or coming from open ship secret and being exported to that environmentally revolt inside the pod. So, instead of having like it, would be very bad hygiene to kind of have your your you know, s3 super key and and a user key kind of like statically coded into the notebook. So this is kind of the the current best approach that I've I've found, at least in connection with a South all fixed, or so we have this.

B

This object now and we can use it to create a bucket in the set panel object store. So we have the you know: SEF object, store, running and open shift here and then create the book created the book created a bucket and then just wrote kind of a dummy object into that bucket and and then listed the contents of that line. So we can see that there now, that's all fine and well, but you know, if you're using photo just within the confines of the notebook.

B

That's obviously not a particularly scalable approach, depending on what you're doing, and so, if you have to do some more heavy lifting and interact with the object store, that's kind of where, where spark comes in so you can run this notebook I can create a spark context here. Of course it's just it's just talking to the you know, it's doing writing spark locally within the pod. That's running with the notebook. This could, you know, be a good cluster.

B

He provision with the Machine Co or for like a spark operator, I suppose I'm, not particularly familiar with that. Yet so, oh that's! It yeah learn more about that and then with s3a you have to set- and you know, there's a number of things you need to set similar to, as you had to set a photo again, we're setting the end point and then the credentials here, but we're also telling it to use a path style access kind of by default.

B

The the AWS Java SDK is going to try to use like the cname notation for accessing the s3 API, which means you need to kind of have a bunch of DNS plumbing, set up that in the case of the SEF nano running, an open ship we don't have so instead of using kind of bucket dot, endpoint flash object, name we're going to use endpoints last bucket, slash object, that's with path style access, that's true does and stef Nano is not not configured to do TLS, so we're just setting that pulse as well.

B

If you had kind of a real, proper setup, def buster and your local environment that's available to open ship, you would of course not have to set those two things so we'll go ahead and set them, and then that same object that we're out with bodø we can. We can read in the data frame using.

B

Using spark, and, by extension, the s3, a filesystem client from Hadoop common.

B

Now one of the things one of the real one of the big reasons I wanted to use the Hadoop 2.8. Why I built the custom open over ship spark image was it allows her bucket configuration where I can have a different set of credential or a different end point for different buckets so right here, I'm drawing from one of the tutorial, the red Analects io community, tutorials.

B

That they have some data that they've made available in Amazon s3 bucket kind of showing how you can interact from the same context with data both in the public cloud and the private cloud or public object or private object store here. I'm saying that this, the bucket Brad analytics data has a different end point than the default.

B

That I said here that that end point is Amazon s3 proper, so as freed up Amazon AWS and instead of using kind of the credentials that we configured for death that are being coming into the environment through the the secrets we're going to use the anonymous credentials provider, because the object assets that we're going to interact with our are said publicly reputable and then we can use s3 a and in this case it's going to talk to s3 instead of stuffin a no-no and load into a data frame, a park, a file, that's in that Brad Analytics data bucket and Amazon.

B

Unless that's right in you know kind of showing the power. What you can do here is. Is you can write it back out? You can write that Park a file back out to the Ceph man. Oh, you know local object store. You know right there here, well honey right here in my any shipped environment.

B

So and very much is you have this kind of same operational modalities being using death as an object, store and then I'm gonna sana s3? So if developers were used to having the you know, experience in the public cloud and you want to replicate that private type environment, it's really really relatively seamless for prevented experience here, I'm going to there's another bucket of mine called be dist I'm doing the same thing. I did up here with the rad analytics here.

A

B

That bucket to be pointing to Amazon and I'm again using the credentials provider, and we have a trip report, a tab, separated value where you know it right now. We have. This is actually from shards data hub team kind of the trip report.

B

It's like a sanitized version of the reports that customers provide after a trip and one of the things they do with it, and this notebook has programmed and is doing a sentiment analysis or you know, building a building, a model training it and then you know dating that model, and so this is kind of the data that we're sourcing there. So this is what this command does. Is it's reading the sample data out of the bucket? That's in Amazon I go away, so I can read it.

B

So it's is reading, reading the data reading the data out of the bucket in Amazon and then turning in CSV and turning it right around and then writing it into the stuff Nano running in our clusters. So that kind of an example of almost like an ETL except it's the same format.

B

You can do something similar where here you know you can read a CSV file in from Amazon s3 and then write it out to yourself Buster in a different format like park' kind of all, in one little neat command there, that's that's kind of fun, the example of taking taking the rad analytics building on the right and on the next tutorial they they had available kind of.

B

You can show the schema going back to the data frame, one which was the end of day kind of take your data here kind of count it where you can register the table. If, like you know, a lot of data folks that are analyzing, data are familiar with using sequel, so they just want to use raw sequel and less familiar with using using kind of the Python methods for manipulating data. They can certainly do that.

B

So by registering the data frame as as this this table name now, I can kind of just run sparks equal against. That table will filter out just the ticker data with the Red Hat symbol. Here then plot it with matplotlib- and this is all from the the radix technical.

B

Kind of taking taking the next you.

A

B

One of the things that the data hub team for our and company we're doing we're analyzing data that was stored in their local set object store. So you know this is set bucket and SEF. Nano I can kind of load the sample data set into a data frame and luckily I already installed those in the kernels. Those are good load into the data frame. This the stats are pretty data, then this is coming from.

B

You know this is the staff nano it's stored in there versus Amazon s3, and you don't have to worry about no PBS or moving data around there, because it's in the object store, so it's post goes away again still there, and the same is true like after you know.

B

At the end, at the end of this notebook, I'm- probably not going to get all probably stop here in a second, don't we get all the way through it, but you know the saving, the resulting model and tokenizer and feature dimensions back into s3, and then it persists. You know you I, don't have to worry about reattaching that pv2 something an object, store, that's available to anything that has the appropriate credentials, and this is really kind of a neat way of kind of sharing data across multiple.

B

You can have a shared native context across different applications of teams or technologies, kind of anything.

B

But yeah is the rest of the notebook kind of walked through they're, taking the data that we looked at at the set Nano and then.

B

I write write this. This is the folks on shorts team, but basically they train a train, a machine learning model using using this data, and then then you eventually got some charts and then and save it back into the set cluster author cup, so they're kind of showing the sentiment of these trip reports. You know, based on on the person successful versus unsuccessful. These are all the sanitized made up kind of people's names and then breakdown based on the personality, the audience or the rose, customer engineering etc.

B

But yeah I don't want 10 minutes left so I'm gonna, stop and kind of just answer. Questions send them. The link could get every foe into the chat and then we can I'll save it into the the cig meeting. That's also.