Cloud Native Computing Foundation Kubernetes Community Days (KCD) Chennai 2022, 30 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Machine Learning at Scale Using KubeFlow on AWS with Siman Debnath

Description

Data science, machine learning (ML), and artificial intelligence have exploded in popularity in the last few years, with companies building out dedicated ML teams. Kubeflow is an OpenSource, ML toolkit for Kubernetes which provides useful components resolving problems in multiple areas. In this hands on demo we will discuss about how you can make a scalable architecture for ML training and inference at scale using a distributed storage in a backend (Amazon EFS) and Kubeflow (on Amazon EKS).

A

Hello: everyone welcome to this session on machine learning at scale using kubeflow with uh amazon, eks and amazon efs. My name is suman davnat, I'm a developer advocate with uh on the efs product team, and I'm super excited to share a few of the uh insights about how you can run your machine learning workflow now using an open source tool called qflow.

A

uh So what we are going to do in next 30 minutes or so we are going to talk about uh what machine, why machine learning on containers, because that might be little uh odd to many of you. If you are new to uh you, know container space, and we will then dive a little bit around queue flow and then we will straight away jump into the demo, which is more interesting rather than uh there's boring, slides. So first thing: first, why machine learning with containers?

A

So, if you think about it, uh you know it's not only about machine learning, we get the flexibility and all the benefits that any application can get with containers. So if you look at the whole machine learning stack, there are different tools like tensorflow pytorch, and we need different kinds of infrastructure to run this training. So things gets very complicated when you think about different packages, dependencies and configurations.

A

So what containers helps us to do? Is you know it packages, our training code, along with all the dependencies in a much modular way? And that way you know our ml environment gets very light weighted and it's very much uh gets portable right. So you can run your machine learning, uh training, jobs or other tasks. You know independent of the platform, so one of the reason that why uh it is even better to run it on kubernetes is uh you know first, which is uh composibility.

A

So basically you you are, uh you know, defining your training, jobs or you are defining your uh machine learning task in a granular way, and so that you know you can always run it at different places. And if you want to make some changes, it doesn't affect. You know the other jobs in your pipeline. The other thing is, you can start today, on-prem uh in your kubernetes environment, and you don't have to change it in future if you want to migrate it to aws and run the same training job on the cloud.

A

So since it is created at the container on kubernetes, it gets very much portable, as we have just discussed a while back and obviously you don't have to think about the scale, because kubernetes will take care of the infrastructure, so whether you need two instances to run the training, job or three instances or ten instances. Since the training job is running as a container, you don't have to worry about the backend infrastructure, which is very much valuable for any machine learning engineer or even for any infrastructure engineer and the best part about this.

A

Is you don't have to worry about managing this kubernetes cluster? So you may like to use amazon eks, which is our uh managed kubernetes service on aws, which helps you to give to give you the control plane where you can attach your data plane or compute nodes, and it supports you know you can always get your. uh You know kubernetes upstream experience uh and you can uh uh decide which version of kubernetes you want, but uh you know you can negatively get the upstream uh kubernetes experience. So it's it will look.

A

You know it will feel exactly the same as how it is. If you, if you have to install and configure kubernetes on-prem or in your managed, uh you know ec2 instances and it gets integrated with lot of other aw services and which we are going to see in a while. Where we will, we are going to, uh you know, um run machine learning, job um using kubeflow onycast and we are going to save our training. Data sets on efs and one of the other thing is uh irrespective of kubernetes.

A

Is you don't have to build all those training container images from scratch? So we do offer a lot of pre-packaged docker container images which are fully configured and validated and you know uh tested rigorously. So you are always get the best uh configuration and image which you can make use of.

A

So you can just what you can do is you can just create your own training script or use the training scripts that we have as a template and make the relevant changes based on your need and requirement, you can always uh customize uh those container images um and we support different frameworks like tensorflow, mxnet, python, dots and so on and the best part is uh you know you can use these uh deep learning containers uh not only with eks but with ecs, amazon stage, maker and ec2 instances.

A

So let's talk a little bit about queue flow before we jump into the demo. So keyflow is basically uh you can think of it as a machine learning toolkit for kubernetes. So it's it comprises of you know various other uh projects uh like jupiter notebook, uh pipeline uh training, services um and inference uh you know or serving so basically, if you have ever seen uh amazon sagemaker, it's kind of a similar platform that you will get here.

A

It may not have all those fancy features that stagemaker offers, but if you want to run your machine learning workflow on kubernetes and if you want to have a control over uh your workflow even at a more granular level, uh then maybe you can make use of a queue flow and it's an open source project. So you can always uh contribute to that and we're going to see in the demo that how you can create a jupyter notebook and how you can start a training job on queue flow.

A

Now, one important uh advantage of uh using your own aws is uh you are going to get a lot of flexibilities and leverage the integration that aws has with other services. So when you are running q flow on eks, you get all the goodness of. uh You know this service integration that we have with eks. So in this case we are going to use efs with queue flow.

A

But if you see here, we have a good ecosystem where you can integrate other managed services that we have for better experience for your machine learning workflow when you are running on queue flow.

A

So since we are going to use efs in our workflow, so let's talk about efs a little bit right, because we uh we learned a little bit about eks, which is a managed service for kubernetes. So let's spend a couple of minutes on efs. So efs is a simple serverless and set and forget kind of file system which we just created, and you don't have to mention the size of the file system and you can just use it. uh You know anywhere, so it you can access that file system uh from your on premise.

A

You know machine or you can access it from ec2 instance, from a lambda function uh or from a kubernetes cluster, and we are going to use efs for saving the training data set for our machine learning workflow. But it's very much elastic and performant. We recently announced a sub millisecond, read latency, so that means in general you will get a latency of 600.

A

uh You know uh microseconds uh for your read workload, which is pretty amazing when you think about a shared file system on the cloud and it's highly available and durable, and we have different classes of storage that you can select from.

A

So, as we discussed uh you know, efs has an integration with you: can access efs from various different compute services, so one is on-prem or ec2, but we don't stop it there. You can always use efs with your any of your containers, which are running on ecs, which is our managed container service or on eks, which we are going to use. uh You know in our demo so how to get started with the efs uh with kubernetes.

A

So now we are not talking about qfl in general, because we are going to do that in the demo, but I just want to give you an overview of how you can get started with efs with kubernetes. So the first thing is, you need a kubernetes cluster. So in this case uh you know we are creating an amazon eks cluster, but this could be very much your own installation of kubernetes on a bunch of ec2 instances where you are managing the kubernetes cluster. But you know this.

A

We are just taking this example where you know we are creating and managed uh eks cluster. What that means is you don't have to manage the cluster by yourself? uh All the pack, you know patching, uh you know, uh version control and all of that update updates is taken care by uh amazon.

A

So, first you need to create that eks cluster and second, if you need to create a security group so that you know the effs file system can be accessed from the eks cluster and then you have to create an efs file system. Now we are going to do this uh through the code in a while, but this is how the workflow would look like and the most important thing is: you need to install the efs csi driver, uh so uh this storage driver doesn't come out of the box.

A

So once you have your eks cluster, uh if you want to attach efs storage, uh you need to you know, install this uh csi driver right, so it's this is also an open source project. If you would like to contribute, if you to contribute on this project, it's uh you know, we have done a lot of improvements in our csi driver in the recent past. So please have a look so once you have created this eks cluster, you have created the file system and you have installed the csi driver on the case cluster.

A

um What you can do is you can define a storage class right. So in the storage class definition you will provide that file system id which we have just uh created and that's all right. So after this you can run your application by creating a persistent volume claim and you will define uh the same storage class which you have used before so here.

A

If you see the storage class name that we have given is efs hyphen sc, you just refer to the same storage class in your pvc definition and once that is done, you can mount that uh pvc in your application or in your in your pod right. So in this case we are just using the same position. Volume claim whose name is efs claim, and if you see this, this is the same claim name that we have used to create this pvc. So that's all so!

A

Next time, if you want to run another application, you will just create another pvc and use it in your pot. So you don't have to go back to the storage over and again right so because we are using the dynamic provisioning and the csi driver will take care of creating uh those access points which is the technology behind provisioning. This uh uh these uh pvcs right. So that's uh about uh you know how a little bit about how you can make use of efs with kubernetes, so before we jump into the demo.

A

This is what the architecture of our demo, so we are going to use efs uh to uh storing our training data set. So in our demo we are going to download the training data set and we are going to run some training job on queue flow and that a training job is going to access the storage uh from efs, using the csi driver and for our training job.

A

We are going to build an image and then we will push it to ecr, which is our container registry, which is kind of uh docker hub, but it's within aws, and then we will start the training job on our uh eks cluster using q flow and it is going to pull uh the image from ecr, run the training job on the training data set which is saved in efs, okay. So let's jump into the demo and see it all in action.

A

Okay, so I have already opened my aws console and if you see um I am inside cloud9, so we are going to use a cloud9 which is an ide for writing uh your code on aws. So it will give you a a nice uh ide kind of environment which runs on an ec2 instance. uh But it's very easy for you to write code as you go along.

A

You don't have the dependency to carry your laptop or you know workstation, you can just uh you know, write your code from anywhere as long as you have the internet connectivity. So if you come here- and you can see that I have two environments, so I have already opened this environment- you can always create your own environment by clicking on create environment, and it will ask you just a few questions. What type of ec2 instance you need what operating system and you are all set.

A

So this is the ide environment that I have you know. If you come here and click on open ide, you will land onto this page. So I have a lot of code here. You know I have cloned view of the github repo which I'm going to share in a while, but you will get this kind of interface, okay, so first thing first, so I already have an kubernetes cluster up and running and we can see that if you run cube, ctrl get nodes and we have an eks cluster with five nodes.

A

I have also installed um kubeflow and if you want to see that we can see all the parts running as part of keyflow because qflo as we have learned a while back, it's a collection of different services right. So we have all those uh you know, parts running which supports uh which are actually uh running a queue flow. So the first thing is: we need to create an eff file system. At this point we don't have any effs file system created and if you see cube, ctl, get storage, class or sc.

A

You see that the default storage, which is an ebs volume which is by default when you create an kubernetes cluster on aws uh now the first thing we need to do is we need to create the eff file system and then install the driver and create a storage class. So we don't have to do it all manually.

A

We have created a script which is located inside this directory and it has some dependencies. So if you see this, uh you know auto efs setup of script. We have some, you know we are using some external uh libraries uh so for that we have one uh requirement.text file. So first, let's install uh all the packages, so I have it installed, so it will just escape all of the installation. So next is we need to run this script.

A

So let me just run this script and then we can go over what the script is doing so the before I hit enter. I just want to show you a few parameters that we are passing along. One is the region that means in which region we are uh creating this uh file system, which is obviously in the same region where I am going. I have my eks cluster, so I'm just giving the cluster name which I've saved in the environmental variable and then the file system name right so before I hit enter.

A

If you come to efs uh console and hit refresh, we see that there is no file system. So hopefully, after this uh you know script gets executed. We will have a new file system, so by the time this runs. uh Let me just quickly show you what the script is doing.

A

So if you see the script is doing uh just three basic things, one is, it is checking few of the predict with it and what is needed and then uh it is creating the uh I am roles uh so that the cluster can access the efs file system. uh Then it is going to install that csi driver, which we just talked about.

A

uh You know in the presentation that it needed to uh you know, um install the csi driver so that the kubernetes cluster can talk to uh efs and then we are just creating a file system. And after that we are creating a storage class, and this is the storage class which we are going to use in the queue flow for training our job or even for the notebooks to creating a different uh data stores for keeping our training data sets.

A

Okay, so uh creating a file system and creating a storage class is something that you can repeatably do as then, when you need. But uh this uh setting up the uh you know csi driver, and all of that is just um your one time activity, so it will take couple of minutes. So let's wait for this to complete.

A

Okay, as we can see, uh it has created a file system also, it has created the amount targets and it has provisioned a storage class so to see that we can just run cube ctl get sc, so we can see that the storage class, which has been created right and it is still not the default one. So if we create any anything on queue flow, uh let's say: jupiter notebook, it will use this storage. But if you explicitly mention this storage class, it will carve out storage from here. Okay.

A

So now before we uh go ahead, if we go to efs and click on refresh, and you will see that the file system got created, my efs1- and this is exactly the name that we have given uh you know, while uh parsing the efs file system name in the script.

A

So we are all good now and if you come inside this file system and come to access point, which is the entry point um for the application to efs, you see that there is no access point created because we haven't. uh We have just created the storage class, but there is no pvc uh we have uh claimed or created.

A

So the next thing would be: let's go ahead and create a jupyter notebook. So to do so first, we need to run the I think it's already running, but let me just stop and start it again.

A

So this is the dashboard service or basically uh st low uh service, and now, if I just go to preview and open the app in a separate, uh not here, maybe so I'm just closing this off and going back here so now the uh dashboard is, you know up and running, and we can see that here and if you see here, we have notebook tensor, uh tensorboard volumes and all of that pipelines everything.

A

So this is kind of uh uh you know if you have seen uh sagemaker on aws it's kind of same, but it has uh sagemaker has a lot of different uh flexibility and features. But this is kind of a nice environment for you to manage you know inside out, so you have more granularity and it's all running on kubernetes, which is great. So now, if you see here, we don't have any volume. So let's create a volume, and this is the volume which we are going to use for our for keeping our training dataset.

A

So let's give some name.

A

And let's give some size, let's say: 100 gb, and here we can select uh the storage class and we can have access mode, as maybe read, write many, which efs supports. That means you can access this volume from multiple parts.

A

So let's create it and it is going to be in pending state because we have not yet uh you know uh we are not yet uh attached uh this volume to any of the jupyter notebook or any of the other uh training jobs. So let's go ahead and create a jupyter notebook.

A

Let's give some name, let's say notebook 1, and here we can select the image, but let it be the default and it is going to create a a volume for its own use. Basically, the home directory for this notebook and since our default uh storage class is ebs. So it is going to create that pvc from this storage volume uh or gp2 or type of storage class. But we are going to attach uh the x one external volume which is going to be the data set efs volume which we have just created.

A

So let's attach this and click on launch.

A

So if you see it's already got created uh and if I come to volume now, you will see that the data set is also the status is now bonded and we also have a another volume which is for the home directory, and this is coming from gp2 and now, if I come to efs and click on refresh here, you will see one access point which got created dynamically by the csi driver.

A

So let's go back to our notebook and connect to our notebook and what we are going to do now is uh we are going to run a training job, but before that we need to have some data set right. So what we are going to do is we are going to create a jupiter notebook and we are going to download some uh data set. So I already have the location of the data set and it is basically a simple.

A

Data set, which contains uh you, know, images of different flowers and what we are going to do is we will run a cnn job or basically, a deep learning um uh training job which will identify the type of flowers uh you know given an image of a flower. So this is a very tiny uh uh training data set and the focus is not on the machine learning part, but the idea is what I really want to uh you know give you is how you can make use of queue flow to run your training jobs.

A

So if you uh get inside this data set- and this data set is coming from the ess storage, if you get inside this, you will see different types of flowers rose, sunflower and so on. So let's wait for this to get downloaded, and once this is over, what we will do is we will go to q flow and start a training job.

A

So, as you can see, the training data set have been downloaded and we have images uh saved inside uh this uh efs data set uh share right, so we are all good to start the uh training job. So let me close this jupyter notebook, because we don't need that. The only idea uh you know only thing we wanted to do is to download the dataset which is now stored in uh you know in this eff file system via this access point. So now, let's go back to our console and let's start the training job.

A

So just to recap: where are we now?

A

uh If we open this uh this architecture, we have saved the training data set on efs and all all we need to do is you know, run this training job, and for this I already created a uh deep learning uh image which contains the uh you know the code for uh running that training job and uh we we have to build it locally on cloud9 instance, and then I pushed it to ecr repository, which I'm going to show you and then we can simply go ahead and run a training job on queue flow where I have specified the training data set location, as the data set which we have created and the image to use is the same image which is in ecr right.

A

So let me show you uh the image first. So if you see, if we run docker image ls, you will see that we have one image or a repository inside, that we have this image uh saved and all this code, um like all the docker file for this, um it's there in the github repo which I'm going to share um with you, you know uh towards the end, but if you want to quickly look into that docker file, it's simply, uh you know we are just pulling one a tensorflow base image.

A

We are copying this training script, which is located here and we are just uh you know uh giving this as an entry point. That's all it's nothing! Fancy and inside this uh training script we are uh running that machine learning training.

A

So let's go to eclr, and this is my repository, my repo, and if you see here this is the same repository right in my account and inside that we have our image right. This is the image that we are going to use so.

A

Let's, uh let's run it so so the training job is basically we have to. We are going to run on a queue flow, so it's also defined as a yaml file. uh So this is inside this notebooks uh inside this, so one second yeah inside this training samples in the tf job.yaml. So if I just open this up, you will see uh you know this is a tensorflow job. uh This is the name of the uh job and uh we are going to create two replicas.

A

That means when we execute this, you will see uh two parts getting created for this training job and if you see here, this is the image we are using and the most important part the training data set, because we have this training will be running on some data right and this is the same data set which we have downloaded a while back uh inside this uh data set pvc. So if we run this cube, ctl get pvc, uh let me just grab the.

A

A

If you see this, this is the dataset pvc, and this is the same pvc which we are mounting on. This uh uh training chart right. So this training job is going to create the pod right and that part we will have our efs storage uh attached and mounted and it will be mounted inside this. uh You know train directory and in our training script we have mentioned. You know uh that, uh go and read this directory for your training job.

A

So let me open this training, dot, py and if you see it here, we are mentioning. Where is our training a data set located? Okay, so let me go back to our cli and let's run this job, so we are inside this ml folder and the training uh job is uh inside this training samples directory. So we can simply run this, keep ctl apply and the training uh you know and the location of our uh definition file.

A

uh So now, if we see uh uh the training job and it it is now uh in the bot creation phase, it is yet to start the training, but we see that there are two parts which it created. So we can even run cube. Ctl get part.

A

A

The name space and you will see that these two these are the two parts uh which are running, and these are the exact same two parts which we have name that we have given image: classification, pvc, so it's worker, 0 and worker 1, because the replica is 2..

A

So now what we can do is we can even see the logs by getting into one of these workers.

A

So let's run this and if you see the training job is already completed because it was a tiny data set and we just run it for two epochs and if you see that the accuracy is not at all great but that's okay, so the idea is basically to show you how you can make use of. uh You know kevlar to run your training job. You know without any hindrance, so our training job got over.

A

And now, if you see the parts you see that it's in not ready state, I mean meaning it's not running now, so it's already over and what we can do is we can. Even uh you know, delete uh this. uh You know whole deployment, so by just you know, we can just copy this and maybe to delete this job. You can just say, delete and let's copy this and run it right.

A

So if you see uh here we, what kubeflow allows us to do is it allows us to scale our uh machine learning workflow, uh you know dynamically, so we don't have to worry about the infrastructure that is needed for your uh ml uh training and uh with efs you get the flexibility to attach or use the storage for your team for different data scientists or maybe for different users.

A

Saving your training data set in one central location which can be accessed by you know uh different uh people right. So if you see here in the efs, uh this is the place where I have my training data set. So you can not only access this from your. uh You know, keep flow uh users, but also you can, let's say for troubleshooting. You want to. uh You know, attach this to an ec2 instance and want to explore. You know something.

A

So maybe you want to see uh you know what the training data set is for some ad hoc. uh You know our task, so you can always click on attach and uh you know, copy this command and mount this file system as an nfs uh storage into your ec2 instance, and provided that you have all the permissions granted.

A

uh So the idea here is uh when you use efs the same storage, you can access from the you know, containers from your ec2 instances, lambda functions, and all of that which gives you a lot of flexibility right so and it you don't have to provision it beforehand. So, if you mention, if you see here uh nowhere, you know we, uh we mentioned the size of the file system right, so it will scale up and scale down. uh You know automatically uh also when we created this a volume uh we have.

A

We we mentioned the size uh just to make kubernetes happy because uh for q flow or for kubernetes in general, um you need to mention the size of the pvcs, but that is not from the ek standpoint of from efs standpoint.

A

uh You know it is just ignored because there is no requirement for size right, so that's all about it and if you want to go through this whole demo, uh this is the uh location, so you can get into amazon efs developer zone uh inside that uh we have a machine learning with kubeflow uh on eks with efs section, and uh you know this is uh you know this tutorial will guide you through a setting of the whole environment on cloud nine and also the training job and few other things.

A

So if you want to try out uh feel free to go over here and give it a try and to come to this page, um you can go into this uh landing page uh yeah, amazon, efs, developer zone, and if you scroll down, you will get some information about efs uh like what it is, how it works in little bit of details, and if you scroll down, you will see a section of different integration. So this is amazon efs with containers, and here you can see uh machine learning at scale using your flow.

A

So you can always click here and you will go to that page, which we have just seen and uh you know give uh you know: do it in your account? Okay, so uh that's about it. So let's go back to our slide, all right. So um now that you have learned a little bit about, you know how you can make use of efs with the ks4 queue flow.

A

uh There are plenty of other uh you know kubernetes or container specific tutorials, which are available on amazon, efs developer zone, which we have just seen a while back during the demo but feel free to access. This page- and you know, share your experience and if you would like to contribute, you know you know, maybe you can send a pr request with your demo and we will uh add that in the repository.

A

um So thank you so much for your time. uh I hope you, uh you learned a little bit uh about qflow, eks and efs, and I look forward to you know, hear from you about your feedback once you you know do this in your account and share your experience. Thank you so much once again and uh have a wonderful day ahead.