Red Hat OpenShift Boston 2019 | OpenShift Commons Gathering at Red Hat Summit, 10 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift at Royal Bank of Canada with Raj Channa at OpenShift Commons Gathering 2019

Description

OpenShift at Royal Bank of Canada
Raj Channa Royal Bank of Canada (RBC)
Dhwanil Raval Royal Bank of Canada (RBC)
at OpenShift Commons Gathering 2019
Red Hat Summit
Case Study: OpenShift @ Royal Bank of Canada (RBC)

A

A

Am Racha no I'm part of RBC's tech infrastructure group, where I run the technology strategy and research function and with me, is my colleague Donald Raoul, who is going to help us with a demo at the end of my talk here. So the talk today is about container Ising spark on kubernetes and OpenShift we're going we're going to go through RBC's journey about container izing it. What options were on the table, what we choose and why we chose it.

A

But before we go, let's take quick, a quick look at what the traditional big data ecosystem is at the very bottom. We have the storage layer which consists of a bunch of commodity servers which are hooked up together. It's usually a Hadoop HDFS file system that takes care of all the storage subsystem and it says storage there, but generally the computer also runs there as well right because the whole Hadoop philosophy is run.

A

Your app with the data resides on top of that you have a resource management system which is generally Hadoop Eon, which facilitates all kinds of different workloads to run at right, in memory, processor, processing, Stream, sequel, machine learning, etcetera. So just to finish this slide out.

A

This consists of the whole Hadoop ecosystem, the big data ecosystem, and this talk specifically concentrates from this circled area, where we wanted to take apart a spark and then containerize it, and the reason for us, for this is apache spark was one of the big growth areas at RBC and the way big data in Hadoop generally works is compute and storage is generally tied together right and if you have to scale compute, it generally have to scale storage as well and vice versa.

A

So our initial idea here was to replace yarn with kubernetes and OpenShift, and this is where we started off so now we'll go through what our journey was and what we ended up in the end, so white container, a container I spoke I already told you about this tight coupling between storage and compute, and you know yet you scale them in unison.

A

That's one of those things but really wanted customizability in a big enterprise like us, where you have multi-tenancy a lot of people running their applications on it and you always have one or two application teams who want to go through to a newer version of spark, for example, a new version of some of the runtime and when you're running a they can't generally do Italy. You know in a traditional spark, a traditional big data system.

A

We also wanted to make on-demand provisioning of the spark cluster make available to our users, so they could provision this on demand and scale auto scale as well right, so predictability for SL A's. This was a big issue for us, because you know some of the jobs were taking a long time to run. They were not meeting. Sla is the real time jobs were interrupting. The batch jobs, etcetera, etcetera.

A

Now, in the last picture that we showed we saw, there were a lot of other open source, open source frameworks that were, on top of it and one of the things that you can't generally do very well in a traditional Big Data system. Is you can't scan these open source software's very efficiently in a very regular manner as well?

A

So we also wanted to be able to scan these things at each runtime and provide a better security posture for us. The last thing is infrastructure, optimization, really we, what we've seen is running the same spark job on a traditional system versus running it on openshift in kubernetes, really needed about 33% less hardware right. So there was a lot of if she's there as well, so just want to go over a little bit of a high-level overview of what we ended up doing. So this is the general. This is partic guinness the spot for engine.

A

You know the few orchestrators that allow you to orchestrate workload and start of spark. We spoke about yon missiles and the standalone Spock standalone scheduler is the default scheduler, which comes with a spark distribution right, and then you have kubernetes. Now, in the last previous slides I showed you, we went from yarn to kubernetes, but in reality we didn't end up doing that right as we went through it, we actually use kubernetes in conjunction with the standalone scheduler, and this is kind of how we ended up running our spark jobs.

A

So you have proven it is in the bottom, we put the standalone scheduler on top and then spark jobs running through the scale, a standalone scheduler. This might look a bit odd to you guys, but we'll go over. Why we did it this way, and you know some of the shortcomings of running everything directly and kubernetes as well before we dive into it. There's one more technical thing: I need to go through is the difference between a you know, a two different spark deployment modes.

A

Here it really comes down to where the driver, which is the main program for for spark, actually runs whether it runs inside the kubernetes cluster or the openshift cluster or outside of it right.

A

So if you think about it, you have the job launching environment and the kubernetes cluster, and if the driver actually once you submit to this box on, but if the driver actually runs inside the kubernetes cluster and in turn kicks off the executors, this mode is called the cluster mode and then there's another more, which is called a client mode where the spark driver actually resides outside, and the client mode is actually much more useful to us when you, when, when you, when the users are actually doing a lot of interactive work right, whether it is high spark etcetera.

A

So this standard open, shaped architecture, I'm not going to go over this. But let's look at well how the cluster mode actually worked for us. So in the cluster mode, the client actually submits parts of my job to open shaft and it actually hits the API server. And if we look I'm gonna blow this up and the sparks of my job is actually going directly into the capabilities. Api server host here right and the deploy mode is cluster.

A

This becomes important in in the future because we couldn't actually use this cluster mode in the end and then so from the API authentication server. It actually goes to the scheduler and the driver gets installed on the node and then the driver installs, all the executors. So this the general overview of this cluster mode. We could now use this for a couple of reasons.

A

One we were using spot 2.3 2.4 is the latest version right now where we were using 2. We had. We have to use 2.3 because of apps and not ready to go to 2.4, yet 2 or 3 did not the 2.3 spark distribution, which was did not support, client mode, which was very important to us. It did not have high spot support the most important thing there was. It did not have support for Kerberos support for it for the Hadoop file system itself.

A

Now, when it starts off when it starts the driver in the worker nodes, it's really a part that starts up there. There is no stateful set or controller, so we really did not also have aliveness Pro, but a readiness, probably along with it. So we went ahead and then looked at what other options existed out there in the industry and in the market and one of the things out there was a Shenko which was the standalone client mode right.

A

So we tested this out as well. So here with the ocean, cozy Ally we deploy to the spark cluster, we started off with zero workers initially and then scale it up. So the the master runs on one of the nodes and then what we do is we do a scale up the worker nodes and once we do that, the worker nodes come up there and then then we do the sparks submit right.

A

So the big difference here from the previous slide is when we do this box I've met, you see that it is not going to the K, it's not being submitted to the K kubernetes api server. It's going to the master, sir, a master node, which we had deployed right. The mode is the client mode.

A

So the standalone mode kind of fixed two of the issues that we had one was client mode support and did have high spark support as well, but what it did not have again was support for Kerberos and they had the file system itself. So this is where we kind of got stuck right. We spent probably over a month trying to figure out how to fix this problem, and you know during this time we were doing a lot of research contacting industry experts trying to see what what was out there in the market.

A

They could help us sell this problem, because without this Kerberos support there was no moving forward for us right. So the next two slides will actually show we'll talk about what we did to resolve this and what we had to develop. So after trying to figure out from the industry- and you know we didn't find anything where we did was we started looking at what yon was actually doing to solve this problem right. So this is the different components you have. The Kerberos domain controller here is for client.

A

Your Hadoop Ariana has the master and worker nodes right, the executors, so what yawn actually does is it does a the spore client actually does a K in it to get a Kerberos ticket and then, after it does that the spa client executes the sparks have met with the key tab in the principle that it had obtained through the first step right and then the yon application master actually distributes it to the worker notes, after which the access to the Hadoop file system may is possible, so that K edit process was critical for us.

A

So what we did was we kind of mimicked it right there on the worker nodes. What we did was there was a bootstrapping script or a startup script, where we did a similar k in it as well, and once we did that the spark workers as a next step, the spark was actually convert the Kerberos tickets into delegation tokens and using some post Docs, after which the spark client actually submits the job, and then it gets access to the HDFS filesystem and on openshift. This is how it looks.

A

We we have a spark template, which is a customized spark template. We just deploy that with zero worker nodes initially and then the master gets deployed, after which we scale up similar to the previous slide, and then the worker nodes get deployed, and now the client submits the job right. The client isn't again not submitting it to the k8s, the kubernetes api server, it's submitting it to the spark master. The deploy mode here is client.

A

Let's take a look at and compare how the three options actually looked right. So now, with the standalone mode, we actually have HDFS access, which is the most critical thing we had.

A

You know PI spark in client mode support. Actually the top three things in there comes from the standalone distribution of spark itself and the bottom few things actually come from kubernetes stateful set this I would say this is not ideal but works for us right now. We would allow to see this completely supported and integrated with kubernetes, especially the Kerberos part of it right so a little bit on logging and monitoring what we did.

A

Essentially, we had five bits installed on one of the nodes which goes and collects the stuff from the different nodes and it you know, then it goes to elasticsearch and then cabana and for monitoring we use Prometheus in Grapher. Now all the parts exposed matrix to a particular port and then from there it goes into graph Hana I.

A

Think right now, I'll pass it on to my colleague Donna who's, going to show us a demo on on a EMR cluster on AWS. So.

B

So like now, we gonna see some real action for spark running on openshift, so like essentially so, like essentially I'm gonna divide, my demo into three parts for simplicity. So first is gonna, be like environment setup and second is going to be I'm. Talking about the template which we have put together and the components it's building on openshift and finally, the spot cement job. So.

B

So so on your screen, you're, looking at the OpenShift environment, which we have put together for this demo and and along with that, we got EMR cluster. That sorry replaces Hadoop service with its own KDC to simulate kind of secure environment like secure, stepheson, whelming and now, as you can see, I'm doing kn8 or to get the reddit actually a fresh ticket, and so that I can go and talk to HDFS and I'm querying the HDFS data and the in full input file. But tht is my data set against which I'm gonna I'm gonna execute.

B

My word count example for this demo and please excuse my small font because, because, as you can see it was it like, it was recorded very late night, so here I'm, creating a project called OC new project Commons which you're gonna use- and this is simple, very simple: 2500 line of straight full set open ship template which we have put together.

B

We should have put together for shade full side along with elastic and monitoring. So so now, let's go ahead and deploy this template so OC applying. So we have spark template up and running now we can go and deploy it. So essentially, there are a whole bunch of parameters which you can tune to deploy this, but essentially I'm tuning. Few parameters like a PVC I, am asking for 10 gig of PVC to store my lozano for mastering and workers the computer requirements for my mush report. There's a spark moisture.

B

To course, if you went to gig memory and I'm asking for 0 workers I'm starting with a 0, so I can scale up later when I'm, when I'm, okay to start my job and finally I'm asking for the worker computer, that's a for Korsak new and for gig memory and at the lash I am I'm passing some of the HDFS HDFS parameters, there's the user principal and the key tab. So here this key T I am getting into work reports using using config map, but we can use like OpenShift secret or the ward secret.

B

So so now let's go and deploy this template.

B

So, as you can see, it has created whole bunch of objects in my project so like for better visualization. Let's move on to OpenShift console so to look at what's going on there, so essentially it has created three applications so like one is for logging, the spark environment itself, Prometheus for monitoring and the spark environment itself like a that's a master and and zero worker for now, because I'm gonna scale up later.

B

So it's a zero zero replicas, that's fine I got master up and running so low. So now, let's look at the OpenShift. I'm, sorry spark master environment, so zero worker is no running application. No compute, that's fine, but the master is up and running. No. No! Let's look at the Prometheus I'm using all the to log in I'm passing my openshift credential to log into Prometheus.

A

So when we talk about time to market, that's how you know how quickly the master starts up right and that's a huge win for us. Yes,.

B

Over like here, you can see the master is up and running and it has actually started scrapping it so like so like we got Prometheus monitoring up and running as well. Now now we can, we can see the logging aspects of it, so it's gonna, just gonna, take a bit of time to low, realistic, slash.

B

Yes, also, we got to know elastic up and running as well so like, as you can see in just a few seconds or in just a few. Second, we got fully operational spark environment up and running with its own logging and monitoring solution, and- and we are good to go to to to point our spark job against this partnership, which we are just wind up. So essentially, this is the simple shell spirit which we put together. It's a three-part so like for she's gonna scale.

B

The workers I am asking for two two replicas of our workers and in second strap. It's going to make sure that all the workers which I am asking for based on now on the number of replicas I'm, asking it's up and running you know, and in second step is it's actually executing the spark job and I am just I'm, just looping in because it's very small job and I'm, pointing it to spark moisture which I just created with a deploy mode, client and some of the executors memory and compute requirements.

B

And if you see here, that's the spark token I am I'm passing the spark token of HDFS, with the spark submit and I'm gonna I'm gonna run. My word count at PI. Again, it's the HDFS data I, don't need to pass whole URL because I am using default ifs and coresight XML and finally, it's on it's gonna tear down the workers to zero and you can even choose to to tear down the whole cluster and spin up when, when you're good to go so let both options are available.

B

So so now let's go in and execute the script, so the the straight full set was patched as you can see, it's moving from zero to two: it's not going to take more than five seconds now. Let's look at the logs here, you can see it has now HDFS token and so that all the workers can go and talk of HDFS and it should be same for for worker one.

B

So if you see the spark job has started already and spark master has now two worker and all the computer required to run the spark job and one job is already for one job is, is running already so now. Let's look at the monitoring, so Prometheus is already no detected, their workers are up and running, and actually it has even started scrapping it. So now let us look at what are the metrics? It is serrated, so it's the actual, generating Holmen shock matrix same improvement is getting a whole bunch of data from from workers.

B

But let's look at the heap usage and if you concentrate on the graph on the right side, you can see some data points such points generated for the worker and, as you scale up, you can see more data points here and similarly it has some metrics for master as well. Like number of alive worker went from 0 to 2 and you can put some kind of alerting here before you start your job and a similarly application for so like with Prometheus and spark integration.

B

We got full full observability into a vegetable spark State, including JVM, like how like house, my job is doing, etc so, like that was more about monitoring aspects of it. So now, let's look at the login, so, like a Suraj mentioned, we are using file bit to ship our logs to to to ship our logs from a master networker to elasticsearch and now I'm, using Cabana console to to see those loans.

B

So if we go here, it has the data available already from file paid and now I'm, creating an index with a note and filter rent index.

B

So once the ones index is created, we should be able to see the logs.

B

So now I am streaming logs, live all the way from all the workers and masters to here, so that so that be created there in a Big Data developer can troubleshoot their job in real time. They don't need to wait for the logs to be available or on some central location, etc. Right so like here, it's more about like it's more about like logging aspects of this whole operationalization.

B

So now, if you see you, my job should be finished now, but I am just rolling a bit left in a matter of time.

B

So, as you can see, my all five jobs are completed. I have some kind of word count result of here already and and finally, it has scaled down the workers to zero.

B

If you look at at OpenShift console all the work is gone because my job is Joan now I know: I am like returning back to compute for the all other jobs or even any any daytime jobs to to we're screwed. That's it. Thank you. Thank.

A