Red Hat OpenShift San Francisco 2019 | OpenShift Commons Gathering, 28 Oct 2019

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: AI/ML and Operators Case Study with Azadeh Khojandi and Jordan Knight

Description

AI/ML and Operators Case Study: Databricks, Azure and Kubernetes with Azadeh Khojandi and Jordan Knight from Microsoft.

Filmed October 28th, 2019 in San Francisco.

A

Yeah so, as Diane said, I'm my name's Jordan Knight and with me I, have ants and we're both software engineers working for Microsoft out of Australia, and we are here today because we built an operator when I say we I mean as and turns out that when you build and post and operated at github, that's like sending up a bat signal for Diane, because I think within minutes of us actually making that project. Public Diane was in contact with that.

A

So so and be Diane's been super supportive in helping us actually get that operator up to the stage where we're nearly ready to pop it up onto the operator hubs. So we're here today we're going to run, run you through some of the reasons why we indeed built an operator and it's for reasons. Far more than just they're shiny and everyone was busting to try and build an operator. We actually had to find a real business reason to do that.

A

So what I'm, actually going to do is just run through a bit of background, basically I'm a designer on a project, a customer project that we had a need for an operator and basically, as is the lead developer on that operator, so together we're in the same team. But together we actually I've got the background on how we how we go about using that operator and as it's got some background on how you use the operator itself.

A

They're. Just a little bit of background about our team record, you get emotionally commercial software engineering, okay, I think Diane's still on the mic out there somewhere we're called commercial software engineering where, in the past, Microsoft engineers have been locked behind closed doors up in Redmond and other cities around the world, working on mainly Microsoft products, and so what Microsoft decided to do was to go and build a software development team, that's free and available to go and work on work on projects other than Microsoft software.

A

So, for example, we've got a customer that had a need for an operator, so we came in and actually made an open-source project to go and build that operator and and so we're working with these customers, and this particular customer along and I, had a use case to go and build highly cohesive, loosely coupled multi-directional, complex multi configuration multi-component, multi technology, multi-platform, high scale, high availability, low latency, big data system or, in other words a stream processing pipeline. In this case, it's a flexible stream processing platform. Actually that can be reused for many different scenarios.

A

The scenario that we have for this particular customer was that we're actually going through and collecting a lot of data from a river system: water quality data, nitrate levels. Basically, we want to be able to measure the quality of water. That's coming down this river to see if the farming operations upriver are having an impact on various ecological downstream areas such as the Great Barrier Reef. So the problem we had and it's a bit of a joke- that big blob of text there on the pipeline, but in reality pipelines are actually fairly complex systems.

A

They contain many many components, they're like streams and our particular pipeline has a bunch of storage data bricks transforms. We've got a range of environments to worry, about dev tests and prod. We've got to think about security as well as external dependencies, and indeed a lot of the time.

A

The dependencies when you think about kubernetes are within kubernetes, but when you actually start going to work on work with pod services or other cloud native services in as your AWS, you start thinking about how can I really nicely manage those without having to have a whole second concept or a whole second paradigm, in your flows to go and actually deliver and bring those systems up?

A

So if you break down pipelines into their I, guess semantic parts, or at least conceptualize them a little bit. The reality is they're they're, a highly cohesive system, but they're loosely coupled so the components don't really know about each other, but they know sort of the interface between the components they know. What to expect coming in. The other thing was pipelines is that relationship is king, and so, if we don't, we bring relationship up as into a first class consideration as part of our software delivery.

A

Then we can actually have a really hard time as these pipelines get more and more complex down the track. Just.

B

A

Do that, especially when it none of the relationship or maybe none of the none of those none of the things that we're being asked is actually available at design-time of the system. The particular platform we've been building is is reusable. We don't actually know what the pipeline is going to look like when we build that code and deliver it through DevOps into the cluster. It's up to the customers that then go and use that this pipeline platform that we've built to define those pipelines and to go and deliver those.

A

So we we got asked by the customer to basically build a reusable set of components that they can then pull off the shelf and go and string them together in various relationship ordering to go and put together a pipeline using a UI or using a simple configuration language. The day that we've come up with after the fact, so we've we're long gone from this customer and then they can come along and decide to implement scenario, a B or C using this pipeline system.

A

So it's not just one pipeline, we get to build, which would be still difficult, but we could at least hard code a lot of that stuff. So we had to build a pipeline system that had unknown component configurations and relationships until later on highly flexible and reusable has one to end or one to many custom transforms. So there might be a whole bunch of spark jobs or Python scripts or anything that run part of this pipeline, and there could be Forks in the data stream. Some could go off to modern data warehouse.

A

Some could go to storage. Some can continue on a hot path through the system we don't know at design-time, so the idea we had was to actually take tackle as complexity in and tuck it away and not worry about it at the start, which is certain which sounds like an easy thing to do. So what we did obviously compartmentalize the whole problem, and basically we came up with a single point of configuration for these pipelines. So basically you can design the pipeline.

A

Have a look at it all your relationships, components and everything are laid out in a single file which, when you think about a pipeline in many ways, is just a directed acyclic or graph or a dag. So basically we created this configuration system that then the rest of the DevOps component tree actually goes through and makes real that actually goes and turns that into the real pipeline.

A

So we start with this dag or directed a sick of cyclical graph. You can visualize it. We can actually visualize it automatically when it gets PR din to the to the production branches, and then the DevOps makes it real. As I said, and in the background the dag is essentially created to a series of helm, charts but then get deployed into the cluster or it could be using. We could be using ansible, we could be using terraform or even just manually, build scripts. It doesn't really matter.

A

The point is, is that the config up front has no idea about what's happening behind it, so that way, we've got this first-class view of what the pipeline will look like, and so we had these configurable first-class objects, these concepts that we can take and actually put into these pipelines, but we had. We had a problem in that. When you have these first-class objects, we need a way to go and make them real, and we thought oh, we could have all these scripts to do it.

A

We could have all these other things, but it turns out we're using kubernetes and kubernetes is so much more powerful and just a system to actually hold pods to do work. In fact, we've got a lot of scenarios now where we can actually use kubernetes and have no pods in and other than operators. It's just such a powerful system for managing configuration and for our desired state configuration.

A

You know create, read, update, delete resources, you updater manifest, and it replaces what's there or updated or deletes it, and then, where we have considerations or objects that don't yet exist as a concept in kubernetes, we can actually go and extend the kubernetes api, but creating a custom resource definition or a CID.

A

That basically means that we can then say hey we want to. We want a spark notebook job and it's got these parameters and it will go and fire that up in data breaks as if it was firing up a pot in kubernetes and so as a designer of one of those configurations. They can actually see my whole catalog of things that I can do so. We don't just have to use operators to create things in kubernetes.

A

We can use them, create things outside kubernetes and that's what we're going to show you guys today so yeah when it comes to custom components like data bricks or event, hubs which is like stream processing in in Microsoft Azure. It's it's operators to the rescue, and so.

A

Operators we were unsure before we started using operators, it's a feeling, I guess new pattern, but I can tell you right now that since we've been using it for the last using the operators pattern for the last six months, we wouldn't do it any different.

A

It's because it's just such a nice way to package up and compartmentalize this piece of complexity that you can reuse over and over again in many different ways. So we can create these reusable modules and once they're good tested, we can publish them and folks can pull them off the shelf and use them in any other project. So they're completely black boxed they're. Also a first-class object. It's a concept you can get around hey. Do you know how to use the data breaks operator, where's the documentation for the data breaks operator so think?

A

In those terms, it actually becomes a thing. It's not like just some bash scripts sitting in a DevOps pipeline. It really creates even an internal community around it. You can very easily represent them as a line in a in a directed, acyclic or graph are in a config file or anything like that. They're also easy to represent in things like helm or other well-known, I guess: kubernetes delivery packages were there be ansible or any of those other style projects.

A

You could easily represent an operator and they're, obviously extremely easy deploy, deploy, update, remove because the operators themselves just work using kubernetes manifests. So people know how to do that, but I think one of the main things and one of the main reasons that we became I- guess very accepting of operators and even promoting them and internally at Microsoft, is that they're becoming well-known and so their skills being built in the industry around operators.

A

You can actually say to someone who's coming on board your project, or you can ask for people that know how to work with the operators pattern and if they don't know the pattern directly, they might know a lot of the concepts that go towards using operators pattern because essentially it's just it's mainly just kubernetes, really it's just being used in a certain way, and so that's the the background on my side from the customers perspective of why we're using operators and what they've asked us to do and so I'll hand over to as to take it away with some more some more in depth.

A

B

To your donations, thanks Jordan for explaining why we needed operators and why operators are awesome, so none less looking to what exactly as your data breaks operator is and what it does so for those who are not familiar with Azure data breaks. Azure data breaks is a spark base, analytical platform that.

B

Database is created by original creators of spark and, as name suggests, Azure data breaks is optimized for Asia. You can create a spark cluster in Azure in few minutes. It's designed for large scale data processing and it's ideal for ETL, extreme processing and machine learning spark by nature, performs all of operation on in an in-memory objects.

B

That's why it's really fast and spark on data breaks, decouples query engine and compute from the data storage, and it gives us the huge advantage what it means that you can provisions cluster and you can connect to the data where the data lives. So you don't need to copy over your data on the cluster and then you, after finishing running your script, you can shut down the cluster without worrying about losing your data.

B

Data breaks is secure. It's integrated with Azure Active Directory, so you can get granular permissions and also it provides interactive workspace where data scientists and data engineers they can write their spark code in Python, Scala, R and sequel. It also supports Java and machine learning frameworks like pi torch, tensorflow and socket layer.

B

So this is the very basic hello world notebook data breaks notebook as you can see. It's just shows hello and the name of the parameters, the name of user, but to run this at the moment. So you can go to the portal or data breaks dashboard and you can create a job or submit a job and then run it. But from the ops perspective you can ask you know that you know your SR Engineers, you're up or you're up still just go to the data books and you know use the UI.

B

They will be really angry, so what they really like, and so there are two different approach. Currently you can use, you can either use the portal or you can kind of call the data bits API or you can use a data breaks you a command line, but what we see that we see that there is a space for the operator to extend kubernetes functionality. So what, if similar to submitting the Yammer file for deployment, you submit a Yama file to run the SPARC notebook from, or data breaks notebook from the kubernetes.

B

So what you can do you can submit your yamaha kubernetes api server, creates a record and then notifies the operator. These are the spec and then this is the responsibility of the operator to call the data bricks API to provision spark or run this run the script and then update the status back. I.

B

Normally like to do a live demo, but this session is really short, so I record tree recorded the demos and I comment over it and I change the speed to two weeks, so it's a little bit faster. So basically, so, for example, this is a data bricks run and, as you can see here in a spec, you can create a spark cluster with three nodes. You know you, you specify the location of your data, bricks, notebook and then you pass the parameters and all you need to do is submitting your you notebook your manifest.

B

So, as you can see, I can get a database run. There's no runs running, but if I apply my manifest, so it starts running it so, the first time it tries to provision the cluster after provisioning, the cluster. You can see the provision this cluster is there and then, after the clusters, provisioning finishes. It runs the spray, it runs the scripts and then it up. It shows the update comment.

B

So to recap: I just applied my manifest, as you can see, the first time is pending after cluster finishes provisioning. It runs the script and then it shut down the cluster. So it's very fast and it does its job, but you might ask what, if I want to run a job with the intervals so do I need to you know provision cluster, shut it down and reprovision the cluster again or what? If you I, want to have multiple workload on the same cluster and the answer is yes possible with the operator you can do so.

B

Data breaks has a functionality of interactive cluster, so you can create a cluster and in the cluster, is keeps running, and then you can attach your data bricks, your notebook to that. So, for that, as you can see, I have a manifest for data, a database cluster. You can set the auto scaling for minimum workers and the maximum workers. You can specify the environment variables of your spark and after that you need to create a data bricks job. So you can, as you can see in this sample, I run my HelloWorld scripts.

B

Every one minutes and I pass the parameters and then I I know I specified the location of my notebook path.

B

So again, if I check the cluster, there is no cluster on my and then I can apply my manifest. So after applying my manifest I will have my cluster, so you can see that is a start. Provisioning, Interactive, cluster and data breaks gives me the ID. So after getting the cluster ID I can update my data breaks job manifest so I copy over the cluster ID and I update my data breaks job and then, uh if I apply it yep.

B

So you can see there is no a new cluster, it uses the data current a cluster and then I can use cube, ctlt's and describe to see the status of my job and I can see my page one ID URL. So with that I can actually see the output of my job.

B

So the good thing about using the operator is up seems they don't need to be learn new stuff, so they can use all the tools and monitoring's that they were and they were familiar it. So to recap: I applied and create a data break strap and then, as you can see, I can go to the portal. I can see output of every single job that it runs every one minute.

B

Now, let's look into the something similar and closer to the real were so imagine that you have a pipeline to analyze tweets so, for example, the first step is interesting to it. So you want to get a injustice which, based on the hashtag of a certain keyboard and then what you need to do even before the first step is you need to connect to the Twitter to get the tweets and then put it into stream can be. Even hubs can be kept, Co doesn't matter, but what is important is here.

B

Is you need to manage secret? So you need to connect to the third-party services. So there is a concept in databases called secret scope that you can provide keeper values for your password. But what we did here we said that what if, because up seems they want to manage all of the secrets and all of the password as a kubernetes secrets? What if in a secret scoping, you can create your key keeper values, but what? If we read this secrets from the kubernetes secrets? So that's what we did here. So you have your see.

B

We have secret scope here for the connecting to the event hubs and Twitter, and after that you need to run data wix1 again. We are attaching to the existing cluster or you can provision new cluster and then potentially for this ascribe you need some third-party libraries, it can be maven or it can be Python libraries.

B

So what you see here, you would say that I run these libraries on the cluster before running the scripts, so the data breaks automatically runs these libraries and pull that libraries for you and after that it runs the script word again, a short video. So, as you can see, I have my community secrets and then, if I apply a my, so if I apply my secret scope, it creates the secret scope in data breaks and in coverages, for you.

B

So I can get my secret scope as you can see that the objects out there and now I'm going to get my running cluster. So if I get my data brisk cluster to get the idea of the cluster running.

B

And now I can apply and run my my data. It's not okay, so.

B

The first, the first time that I run it so it says that it's running and then, if I use a cube city or describe and provide the name of my run similar to how you work with the kubernetes other objects. So you can see it. It comes to my run. I can see output of my run and you can see that it passed the parameters installs the dependencies, and it shows the tweets that it's extracted. I have another notebook to test.

B

My my my Twitter ingestion and it's called even hub, ingest and I- can attach my data breaks to the current cluster that is created by the operator. I can read this secret scopes that is created by operator and then I can run, and it's just for monitoring to see that it actually reads the tweets. So I can read all of the messages that I have in my even hub and if you're patient you can see, you can see they're the exact same to it, that we ingest it yep.

B

So to recap: I have my secret scope. I have my runs, so you can see, and after that, I ran and I ingested tweets and then with the even hub ingest. That was my data breaks, notebook to just check that stream. So I can see the value of my stream yeah, so plus I like to share that Harvey built this operator and then share some of the lesson learned that we learned along the way for building the operator. We use queue builder. There are so many tools and frameworks that we could.

B

You could use a new skill builder behind the scene to use this kubernetes api, machineries and customize, and it's really easy for creating custom resource definition for those who are not familiar with custom resource and custom controller and custom resources endpoint in kubernetes. That allows you to save a structure data, and but with that, it's not enough. So if you want to be the operator, you need a custom controller. You need the logic to set up.

B

Ok, this is my desired state check and the logic will check the current estate and keep the desired state and current state in loop. So that's the whole functionality of the operator so that with the operator you need a custom resource and then you need a custom controller.

B

So CR DS has been used for a tea party extension and we are using crts and now you can see more recently, even in kubernetes they are adapting seared sea artists for their built-in functionality and Tim Hawkins, one of the co-founder of kubernetes. He shared his vision recently that everything is going to be CRT soon and there should be there shouldn't be anything that can you can't do that they can do so another tool that a lot to share with you and it helped us a lot.

B

It's using kind and kind allows you to create a local cluster and what I really like about kind that it uses darker and it works on different operating systems. The our team was distributed and they were using different operating system. So we use kind for creating our local cluster and the benefit of using kind is. You can create a local, create a cluster and theory done in two or three minutes, and if you build the image of your operator on your local machine, you can load that image on the master node of your local cluster.

B

So it saves you a lot of time and compute because you don't need to push the image into the image repository and dock your hub or you know other image, repository on cloud and then pull it down when you are testing, especially when you are writing web oops. So it's everything is your look on your local computer and especially it's really good when you are in your test pipeline, so you can actually test test your operator.

B

Another field that we found it very useful in terms of the onboarding is using the deaf container, so deaf container runs the source code inside the container. So you don't before using the F container or onboarding process was taking about half a day, but after using deaf container, it is reduced on to three minutes. So you just clone the repository and then with Visual Studio code and deaf container. It runs everything inside the container.

B

It's all the setup for the core, all the setup for using kubernetes kind, and it also gives you the ability to debug and run so it's it's very powerful and it's really easy. If you have a distributed team or if you have a team with the different setups, I highly recommend to check it out and we we got a lot out of using deaf container by that. This is our github repo. Please check it out and if you have any use case or anything that you like to chat with us, Jordan I will be.

B

We will be here today and you have our contact details. We like to have a chat and see that how you are using our operator. Thank you.

A

B

Looking forward.

B