Cloud Native Computing Foundation KCD Africa, 1 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Migrating unicorn company to real-time stream processing

Description

Every company is a data company nowadays. Ruslan will present how he has migrated Bolt from batch data workloads and synchronous processing to real-time data streaming and asynchronous processing.
He will talk about the problems he faced along the way and the lessons team has learned from this migration. Ruslan will also focus on the unleashed value real-time data provided to the company.

A

Hello, everyone. Thank you very much for joining this presentation. Hope you'll, find it useful. My name is roslan. I work as data architect at bolt and today I will tell you our experience migrating unicorn size company from batch to real time stream processing.

A

I guess most of you are fairly familiar with bolt. We are leading micro, mobility and transportation platform in europe and africa. This slide shows the volume of money raised by different players in the market.

A

It can be a bit outdated, but it conveys very important meaning it basically shows how important for the success is to be efficient in today's business.

A

These are the topics we will cover in our presentation. Today we will start with understanding which requirements users have towards data nowadays, then we will talk about. Why would companies want to adopt stream, processing and migrate away from batch towards stream?

A

Then we will discuss where to get the events to process with stream processing. Then we will discuss the actual internals of the stream processing and we will wrap it up with discussing which problems we have experienced along our way and which lessons we have learned.

A

So, let's start with understanding what kind of requirements users have towards the data data can originate in many different sources. Those can be backend, microservices, front-end, microservices databases, be it sql, database or nosql database.

A

Some files- it can be even be user input after all, but first and foremost, requirements which users have towards data is that data has to be consistent across different systems and services. Users will not tolerate if they see one number in one place and another number somewhere else.

A

Second requirement is that users don't actually care where this data has originated from, be it structured source or database or unstructured. They want to have easy way to query the data and to get they want to have easy way to get to all of its fields and all of its values.

A

All of us live in the era of big data, meaning that volumes of the data we have to process and store and handle only grows over the time. But users don't actually care about that. They want to have access to both current data or the fresh data and all the historical data and last but not the least, is actually the speed of data delivery. Users, usually don't want to wait for the data to get to propagate through the systems or to get delivered to the final system.

A

They want to have access to it, the fastest, the better. Now that we know which requirements people have towards the data, let's discuss, why would any company want to actually adopt the stream processing and remember? We have talked briefly about the importance of being efficient in today's business and that's exactly the point of this slide com. Any company would prefer to adopt stream processing because it actually puts data in to the motion. It allows data to flow throughout the systems and it means that company can develop so-called reactive microservices.

A

Those are the services which actually um react or respond to some kind of change in some state or to some kind of event. Imagine you are buying something on amazon when you click this buy button, many actions have to happen like invoice has to be generated and sent to your email. The inventory on the warehouse has to be reserved.

A

The shipment or delivery process has to get initiated, and all of those events are triggered by one simple click and adopting the stream processing actually allows to do it, and it also allows you to detect all those changes in real time, allowing you to make real time decisions on top of your data.

A

It also allows you I guess most of you are fairly familiar with the concept of etl when you basically process. Big volumes of data in a batch mode and stream processing allows you to spread the load of those etls throughout the some throughout some time period like day or week, for example.

A

Basically, it means that you don't have to run the heavy etl job once a day or once a week to process all the data for the past period, and you can actually incrementally process every single new piece of data throughout the day or throughout the week in real time, and adoption stream processing actually allows you to build the services and architectures, which are much more scalable.

A

So now that we know why actually would you want to adopt the stream processing or events in motion? Let's talk about? Where actually can you get those events from?

A

We all live in the era of microservices and they have plenty of different benefits like, for example, easiness of deploy or maybe being decoupled one from another, but they also come at some cost and usually that's how it looks in your production environment and want it or not. At some point, you end up in a zoo where you have different plenty of micro services talking to each other and good luck deriving some minion from this house.

A

But one of the important features feature of the microservices is that they don't actually leave isolated from each other and they need to communicate or talk to each other to basically signal to other services that something has actually happened and believe it or not, but it may sound easy, but it's actually one of the hardest problems in modern software engineering, because you can never actually be sure that some communication between the services has happened, because services can fail, or maybe event or the communication.

A

This the request can get lost and you will never know about that.

A

Unfortunately, right now there is no engineer friendly or easy way to guarantee that some communication between the services has happened and those of you who have worked with similar setups. They know that usually people solve it with some kind of distributed transactions or state machines, and they also know how painful it is and when we at bolt thought about all right. How would we tackle this this problem? How can we migrate towards stream processing?

A

Where do we get events from we started thinking from the other direction we use mysql as our relational database for all our microservices and those of you who know who have worked with databases. They might know that most of the modern databases they have the so-called commit log under the hood. Basically, it is the log of all the changes which database has applied.

A

It is used as an internal tool for the database to guarantee data consistency, and not many people know that actually, this log, it can be read by other applications and that's exactly the way which we decided to go at bolt. But now comes the question all right.

A

If we read this log, the data which we extract from it well, it has to be stored somewhere right so that later on, it can be processed by someone else or maybe uh even better it can. It should be processed, it should be able to be processed by many different consumers.

A

One of the crucial requirements which we have agreed on when we started developing this whole pipeline was that well it has to be scalable because company grows year over year. We launch new cities, new countries, we provide rides to more and more customers, and all of us agreed that it has to be scalable.

A

But any time when someone says to me anything about scalability, there is one piece of software, one system which comes to my mind straight away and that's apache kafka for those of you who might who might not be familiar with apache kafka. It is the messaging system which was developed at linkedin and later on, it was open sourced.

A

It provides plenty of beautiful guarantees to the engineers and its users, but, most importantly, it has really high events throughput and it is horizontally scalable. One of my favorite.

A

Features which apache kafka provides is actually write. Once read many semantics, it means that you can send or write one event to it, which can be processed by number of independent consumers at their own pace at different points in time in the future, there is no need to duplicate this same event so that it can be processed by many consumers. You can write it once and then process it as many times as you want so now that we understand from where can we get?

A

Can we get those events from and that we persist them to apache kafka now comes the part of stream processing, let's deep dive into it right now on the market, there are plenty of different frameworks and libraries which are actually doing stream processing. So it's up to you to decide whichever one of those you would like to to use, but before maybe discussing them, let's uh define what actually a stream is, and stream has two important characteristics. First of all, it is unbounded flow of data.

A

It basically means that you are getting the events right now. You will get them in the future and there will be no end to that. You cannot say that at some point they will stop. It is unbounded, you will simply keep getting them continuously, and second important characteristic is that it is happening in real time. You don't have to do it once an hour once a day once a week. It is happening at any given point in time.

A

There are two types of stream. Whenever we are talking about the uh the stream processing as a framework, there are two types of stream processors which you have to very clearly differentiate between first one are so called stateless processors and they process every single event. Independently of all the previous events, they don't have any history whatsoever.

A

Examples of such stream processors can be, for example, some transformations or maybe extracting some field from event or maybe repacking the event into different form, something which you can do for every event separately and second, one is so-called stateful stream processors and they, whenever they process any new event. They also keep all the history of previous events, which they have processed examples here, can be, for example, calculating some aggregations or maybe.

A

Doing some filtering based on the previous history and so on.

A

Like I said, there are many different stream processing frameworks on the market, but there are two which I would like to specifically highlight in this presentation, because they are native to the kafka ecosystem and first one is called k-sql.

A

It allows you to do the stream processing in a syntaxes, which is very, very sql-like, which is very familiar to the engineers.

A

With such an easy sql like statement, you can actually define which events from which topic you want to process.

A

Which fields you want to extract from them and to which destination topic you want to persist them?

A

Second, one k, sql is a good tool for let's call them lightweight stream processing, but sometimes you would want you might want to do the um uh some very sophisticated uh logic it might be, for example, um something related to your business logic when you need to, I don't know, do some kind of very specific aggregations or transformations and so on, and in this case kafka system provides you with the kafka streams library. It allows you to define the stream processing tasks as a java applications.

A

So whatever you can express in your java code, you can do with your stream processing. Very good news is that both of the frameworks are can be actually extended. If you need to implement something which is very, very much tailored towards your business, you can write the so-called user-defined functions.

A

Basically, you define the java function and it can be whatever like aggregation or it can do some business logic, and then you can call this function boss from the k sql like any sql function here or from your kafka streams application, and it can be everything it can like. It's already said, be something very much related towards your business logic, or it can also be some kind of fancy, machine, learning and stuff, and it can be used for fraud, prevention for anomaly detection, anything.

A

So that's the setup which we have adopted at bolt. We ingest data from all the source databases. We persist them into the kafka brokers. We do the stream processing with the libraries I have mentioned and also persist the results to kafka and then those results are consumed by number of different consumers. We store some events to our data lake. We also allow backend micro services to consume those events and do the business logic and decision making on top of them, which problems have encountered during our throughout our way.

A

When I started this whole project at that point, there was no single managed kafka offering on the market.

A

As of now, there are already few of them available, but there was no at that time and if, in case you are planning to adopt kafka, I actually would advise you to go with managed cloud offering, because managing setting up and managing kafka clusters is not the easiest task. Believe me.

A

That was it's not actually a problem, but that was like one uh relatively big challenge and obstacle. We, which we had to overcome.

A

Second problem, is actually whenever you're working with stream processing, you can be getting hundreds of thousands events per second, and if someone comes to you and says, like hey, you know, this number is not actually correct. Good luck, debugging it because it's really really hard to find where actually this error is happening.

A

Next issue is that, unfortunately, but currently most of the stream processing frameworks, they are jvm oriented, mostly so want it or not, but you would need to use either java scala or some other jvm friendly language.

A

Next issue, which you have to understand, is that.

A

Let's say one of the big advantages of kafka is that it actually also keeps the history of the events. It is configurable. You can set it up to one day, one week or months one year whatever, and it actually allows you to replace some data from the past.

A

Let's say you have released some new logic and later on a few days after that, you have realized that there was a mistake there, but- and now you can replay this um history replay those events and correct your uh make some adjustments towards your business logic, but you should never think that you have to replay only some specific piece of data because, like I said stream is unbounded flow of data, so you should always think of it in the following way.

A

You should that you start processing data from some point and you keep doing that afterwards.

A

And other problems which we have seen along the way are the um let's say: for example, data deduplication network is not reliable, different services can fail, and eventually it would lead to duplicating some of the events, and so it is important, so your consumers are prepared for that.

A

Next, one is types in compatibility whenever you integrate many different systems and make them talk to each other, you have to be very, very careful so that they process all the data and types the same way and last, but not the least like I said if you want to start sourcing the events from your databases. You should also think about how do you handle database schema migrations like, for example, adding some fields to the table or maybe changing the type or creating new tables.

A

That's all I wanted to talk about today. Thank you very much for your attention. If you have any questions, feel free to reach me out, thank you.