Red Hat OpenShift OpenShift Commons Briefings, 12 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Automate & Scale Data Pipelines the Cloud Native Way | Guillaume Moutier OpenShift Commons Briefing

Description

Automate & Scale Data Pipelines the Cloud Native Way
Guillaume Moutier (Red Hat)
OpenShift Commons Briefing
March 12, 2020

A

Hello, everybody and welcome to yet another OpenShift Commons briefing today we're gonna tell you how to automate and scale your data pipelines. The cloud native way, yo mutia from Red Hat will be reintroducing himself and telling you a little bit about himself and then giving us a bit of a deep dive in our database pipelines initiative. So Yom, please take it away and there will be live Q&A at the end of this. I will get the slides from him and we will post it all. On blog, got openshift comment on YouTube as usual.

A

Take it away deal okay,.

B

Thank you, Jim hi, everyone greetings from Canada and Technical Evangelist that were that in the storage business unit and I'm, working mostly on data, not storage itself, that data the way you consume it, the way you move it around and especially in the a IML field, but today we're going to look at a purist standard data pipeline and that the way you can automate it and scale it automatically at the cloud native way.

B

So, let's get started first, the cloud native, as you said, to set the stage I want you to go back a little bit on what what the characteristics can be for a cloud native platform here, I'm listing the things that are most important for me. But what you must never forget is why you are doing things here is the business outcomes? What are we trying to achieve when we are implementing those kind of architectures? For me, most important things are speed, efficiency and for most adaptability.

B

We we know now that technology is moving faster, real fast and we have to adapt our businesses or organizations to be able to to handle the this kind of changes. And so adaptability is and was my main concern altered my career and now we have the tools and we have the technology to be able to to achieve this, this business goals. So let's take a look at what I would call a legged legacy.

B

Data pipeline architecture and I call it legacy, but I know for sure that for most organizations still the standard way to do things. We look at architectures where that are very tightly coupled and not easily scalable, for example, if I take a very basic application, where a user will save a thigh to some storage that can be processed by by an application.

B

Well, the which works for most applications is that there is some storage mounted on a computer can be a shared folder or something like that, and then the the file is sent to the storage again, it has to be mounted against the server CIFS iSCSI, but still some kind of hard connection in between the storage and the application server, and then it's consumed by an application. Let's say some java application here. Problem is with this architecture. Well, first I have to have things very close from one another because of this mounting problem.

B

I cannot Manta CIFS over thousands of kilometers. Doesn't work well also the scalability problem when I am using this type of connection. That means that if I want to put up another application server, for example, because I want to scale my application capabilities well, it has to have exactly the same configuration as my first server exactly the same storage connection, exactly the same man point and behavior. So that's!

B

Ok, if you have one or two server, but if you have tens or hundreds of them, that's a burden that you have to take care that you have to take care of now. Let's look at a more cloud native way to do these kind of things. Well, we can think of an application again where a user will just send a file, but this time to an object, storage- and here it's a fully disconnected mode. You know using an object.

B

Storage consuming of just storage is only HTTP connection, so it's only a put or a get and then that's it. I'm finished I have no remaining connection in between my user application and the object storage same thing for your data processing function. They can consume the storage directly and as they need it. So that means they can be wherever you want and they can scale. It will be much easier to do, and now we have intelligent storage, I would say in the in the latest release of Ceph. We have now back at notification.

B

So that means that whenever something is happening in the object storage, you can send a notification. Let's say to a kefka bus that will that will itself trigger some data processing function here, I like to put Kafka in the middle of this kind of architectures, because it can hacked in two different ways. First, as a buffer, let's say my data processing function is not ready or not ready yet well there. The notifications will keep coming in inside the Kafka bus and when it's ready, then it will be consumed.

B

The topic the notifications will be consumed and then the function can realize its its process, but also Kafka can act as some-some hub for all those notifications. So we can imagine that we have different processing functions, maybe in different places, different data, centers, realizing different operations, but everyone feeling and the same on the same topic so we'll try to do it for real try to build an application. That will that we work like this so here for this demo. I took the example of ACH payments.

B

For for you, people who are not in the in the United States ACH can be seen as electronic check electronic payments, so it can be used by a customer, paying a service provider and employer depositing money, and your your checking account for payrolls. All those kind of things happening electronically for my demo here I will try to implement this very basic, very basic pipeline, where someone buys something from merchants and there is an electronic payment happening.

B

The which works is that the transaction has to be sent to the to the bank of the American Way, and this bank will produce what is called an ACH file. It's a standard file will come it and we come to each a minute, a standard file that will be sent to the Federal Reserve. Here it will be processed and make available to the receiving bank. The receiving bank will be the one of the customer, so it will be the one to process the transaction and debit the account of the of the customer.

B

Ok, so that's the the the basic process of ACH and as a reference here is the the ACH file itself and which works. You know very unfashionable transaction with the first line, giving information about the the bank, the bank itself and some basic information about the company. Second line more details about the company, and then you have all those transaction fields with the different customers here: the amount of money that they have that they have spent and which bank, which receiving bank this transaction should be sent. Okay.

B

This is how I have implemented it inside openshift, so I have here some kind of generator, we'll come to it that generates fake transactions and send send those files, enzymes inside an object, storage bucket. Then this one will trigger a notification that will be sent to kefka bus, and here I will be using kenneth, eventing and carrot of serving that's a way in kubernetes and in up and shift to create on demand paths on demand function.

B

So I have a service that will be listening for Kefka events and then spinning up a deployment of the container that will process the file. If we process the transaction, so here, what you will do is create an ACH file for for the transactions and send them to to the bank. So the bank of the merchant, so here I, will have a few packets I. Do my Gmail with seven different banks or seven different buckets to which the different files will be sent depending and the merchant sending sending the file at the origin bank.

B

Those files will be processed. Basically, what it will do is look at all the transactions and recreate new ACH files, this time, sending it to to a destination to the destination bank to the receiving bank. All those files will be created and burst into the into different buckets. So this was this time buckets billing belonging to the the receiving banks where they will be processed. So here the the standard process will be to look at the transaction and they beat gee I can't after after customer.

B

What we will do in this demo is that will only look unjam unprocessed and we will just sum them up in some wine in some some big, some big bucket, just to see how many transactions were processed in how many, how much amounts of money was was processed all throughout all throughout this pipeline. So to implement these few things that I need some calf care topics to be able to send my notifications. So here you can see at the bottom.

B

They have the American upload topic and and the ODF a topic where I will send a file. Then I have some buckets that I have created in my in my storage. Here are all the buckets that I have and don't worry. You will have access to the two to the code and everything to be able to to reproduce the demo.

B

So I wasn't going to into too many details on this right now and then we will program the back identifications themselves, how it how its done in in in surf and RHCs in the reddit surf storage. You can do what you do is to create a topic that will point to your Kafka to your calf cap. End point okay, so here I will create a topic with the name. Rg Fi and I will point it to my craft, like a calf, calf, cluster, okay and then for each bucket.

B

I will use this reference to the topic. I just created and I will here it's a simple put request to the name of your bracket. Here it's from an old D move, so here it should be a different name and notification notification verb and the configuration of the topic that I want to use in Kafka.

B

Finally, before we go on from a live demo, this is a transaction job. The wage works is that it will trigger a container that will generate our transactions and it will run 60 times with the parallelism of five, so that means that I will be able to create five five files. At a time inside the my inside, my clusters, I may open chief cluster. So let's go. Let's do this, so here what we can see here, I am in my project.

B

I can see that I have three parts which are the parts: the Kennedy parts, the server, less open ship server, less parts that are listening to two events, I have also in my openshift server. Less I have three different processes: three different services, which will split the ACH files or process them depending and they are ready, but you see those processes already. The services are ready, but there are no pods running so now we are scaled to zero. Okay.

B

So let's create these just transactions here, I will use that the exact same file just showed you and now it's being put into motion. So here we can see that we have five containers creating based on the the transaction transaction image transaction container, that I that I have designed and they will begin to create new transactions and as new transaction files are created. Well, it triggers containers, it triggers the orifice plate. That means looking at the ACH file and splitting them and putting them inside the red bucket.

B

It also triggers our GFI split, that's what's happening when it looks inside the ACH file and splitting them together to send it to the receiving banks and then the our GFI process. So here I'm processing the transactions themselves. It will be better with a live view like this. Here, it's a graph on a dashboard where I have my pipeline. We can see that we have already generated 15 15, different transaction fires. 16 now so for 16 have been processed and dispatched to the different Bank of origin, and so far we have treated 8.

B

We have processed 8 8 of them. Those fires are are splitted in for the different specific banks and they are sent here to to the receiving banks buckets where they are processed and so far we have processed 75 of them. Of course, we have many more files in this process because we take H originating file and split them split each transaction towards its its own receiving bank. We can see, as the process is going on, that the the CPU usage is increasing. Of course, we are spinning more clouds as we as we need them.

B

We have also the RAM usage going on and I have some lags here on the deployments, but it should keep up in a few seconds and we can see here the the value of the transactions that have been processed so far, so we can see it's going up. We are now at about 9 million dollars what I Jen right here for transactions? It's it's a random number of transactions between 300 and 500 of them for each guy and the amount itself is between $1 and $2,000. Okay.

B

So that's the kind of transactions and generating- and here we can see the different deployments now that we have. We are now up to 15 parts. We can see that we have five deployments of the create transaction part. That's the maximum parallelism that I authorized for this. We have, of course, my listeners for the Kefka events, but the treatment themselves.

B

The processing itself is how do you have a GFI split is what's happening here at this point, so here it doesn't consume much resources, because it's only looking at the files and depending on of the the American banks, sending it to the different buckets here. So not many resources involved. So there's only one deployment of these this process, but here, if I look at our deifies plate here, that's what's happening and in this box.

B

So what that means retrieving the file splitting it into the different transactions and recreating new files and then sending them to the to the to the receiving bank buckets. So it consumes some more resources.

B

So here that's why the server less functions has automatically automatically be scaled to two deployments, because that's what it needs to be able to handle the traffic coming in same for dr GFI process, it looks at the files and and process it adding to the amount of money that all those transactions represent, and then it needs also two of those parts to do the processing here what's happening. We can see that we have reached the maximum number of files that we wanted to generate so 60, so our create transaction parts have scaled down to zero.

B

Okay. We of course that's what we wanted to do, and then here we have reached also the number of 64 this. The first step of processing- so these are GFI split part should come, should come then to zero. In in a few seconds, we can see that it's we already are consuming a little bit less memory for this kind of thing. So here that's a neat way to demonstrate with only using bucket notifications and civilized functions. You can fully automate your data pipelines.

B

It doesn't require, you know some kind of application that will orchestrate everything and will take care of everything here. It's only a few, a few files, a few configuration files that you put into motion that allows you to to create very simply this kind of pipelines. So speaking of files and I will go back here.

B

Speaking of files, you will have all the code and all the all the different configuration files and containers images, and things like this in this repo I- will also put it in a few days, a full full world, true to be able to reproduce this kind of demo and, of course, feel free to to reach out for some more information or if you have questions or problems implementing this kind of things it will be. It will be a pleasure to to reply to this.

B

These questions and now I think we still have time for for a few questions.

A

Absolutely, let's see if we have I, don't see any questions in the chat, but I think we're all kind of totally loving the demo that you gave Guillaume so I'll open it up and see. If people have any questions.

A

I'm not seeing anything, which means you did a really thorough presentation. So thank you very much, the repo that you point out here on the demo page that has everything in it to reproduce this demo. Yes,.

B

There is everything there is the container code to be able to create your that the part that will the process or create the transactions. There is the Kafka topic creations. There is well. There is everything to be able to go from scratch. That is starting on the brand-new openshift installation and install everything that you need. Awesome.

A

So look forward to other people taking this for a test run and drying, and Emily and I really appreciate you taking the time today, Jim and look forward to having you back for update new updates on this topic. So thanks again and hey, everybody would like to re-watch this. There will be, it will be uploaded on the YouTube channel later today and I'll steal the slides from Guillaume shortly and also link them up there as well and put a blog post with some other resources up on blog that openshift com.

A

So look for that in the next coming days and we will continue to provide you with entertaining and educational briefings over the coming weeks to take place of some of the conference's that have been cancelled. So look for that on the events page at open, Commons, openshift, org so take care everybody, and thank you very much.

A