Cloud Native Computing Foundation Online Programs, 17 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: The Evolution of TiKV

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello: everyone thanks for joining us today. I am Charles software engineer at pinkap. Today, I'm going to talk about the evolution of Thai KB, the talk will be divided into four sections. First, we will briefly introduce the history of Thai KV and summarize the major requirements for cloud native key Value Store.

A

Then I will give a high level introduction to the overall architecture of the Thai KB, which can help you to better understand the rest of the talk.

A

After that, I will elaborate how thake we meet the requirements we mentioned in the first part and finally, I will introduce how we adapt ikv to the modern Cloud infrastructure and why it can be solid building block not only for cloud native database but also for other higher level cloud servers.

A

Ok, let's get started first, why we wanted to build Thai KV at the first place. There are so many key values store out there while it does not fit for our case and what are some requirements for a cloud native key Value Store, then we have to talk a little bit history of Tidy B. The cloud native database support both oltp and olap workloads back in 2015. When we build up the first version of Thai DB, we build it up on hbase and hdfs.

A

If you have ever dealed with the hadoops of the wear Stacks, you may have the same feeling that they are not easy to work with. In addition, as we want to develop a database support distributed transaction, then we have to add distributed transaction support on top of the hbase which cause poor performers and high operation costs.

A

After suffering enough. We finally designed to build our own storage Engine with following requirements. First, it must be able to leverage in modern Cloud infrastructure to easily scale out like other nosql system. Second, it should support high performance. I o third, it should support distributed transaction inherently force. It should use modern data replication protocol, which we choose to wrap the protocol fifths. We want to make sure it is easy to maintain, so it must use a clean architecture and, last but not least, it should be easy to use.

A

Anyone who use the other key value store should be able to pick it up very quickly so from day one. We designed to develop a storage engine that can be solid, building block for other distribute system and really help people out and on this principle can be applied to all software from the Ping cap community in general, we can divide the aforementioned requirements into four categories, which is to say the cloud native storage engine we wanted to build should first be highly scalable.

A

A large data processing platform built on top of it may need to store tens of millions of entries and hundreds terabytes of data, which requires the back-end key value store to be highly scalable.

A

Second, ensure data consistency, as we mentioned before, we want to build storage engines supporting distributed transaction inherently and they use modern data consensus protocol third support high performance. I o, if we have hundreds of clients, try to read or write the storage at each end time. It would be critical that they can get a response within a short time and, of course, be extremely reliable. A key value store is usually played as the brand of the host system.

A

The failure of the data storage can result in losing control of the whole system, which is intolerable.

A

Next I will try to elaborate how Thai KV meets all these requirements, but before that, let me briefly introduce the overall system architecture of Thai KB from high level, and hopefully it can help you to better understand the rest of this talk. A tech, AV cluster looks like this. The KV nodes are on the button. You can have as many as you like, but we usually suggest to have at least three. We split the data into regions.

A

I know the name is confusing, but you can just treat a region as a subset of the data, as shown in the graph. We split the complete data set into five subsets, which are five regions. Each region contains multiple replicas, which are spread across different High KV nodes. In most of the cases, three replicas should be enough. On the top. We have the Thai KV client communicating with the high KV cluster through the grpc protocol and on the top right. We have the placement driver, which is the scheduler of the Thai KV cluster.

A

The placement driver has three major Jobs, first generate Global monolithic and a unique timestamps for the transaction, which can also refer as Tso.

A

Sometimes, people like to use the term timestamp Oracle but I think it is a confusing term. So, let's stick with Tso second manage all date region across Thai KB nodes like rebalancing, regions, migrate region from offline nodes to active nodes.

A

Third, as placement drivers towards metadata of all regions, it will also be responsible for routing, read and write request from client to correct the node OK. We have seen how takibi looks like on cluster level. Next, let's have a look at how thaikov implemented internally from the node level.

A

Each take heavy node can be divided into four layers. From button up. We have storage engine layer which is a low level, local key value store and we choose to use the rocksdb. We leverage this layer to persist the data into the local file system.

A

We did not Implement a local key Value Store from scratch as implementing and high performance. Local key Value Store requires a lot of efforts and roxdb has been proven to be reliable, with predictable performers in production by Manning date management system.

A

One level up. We have the consensus layer. This layer presents the abstraction representing multiple replicas which allow upper layer to feel like they are dealing with one piece of data instead of multiple replicas, like etcd, we use Raptor consensus protocol to ensure the data consistency between replicas across multiple type KV nodes, but different from egcd. We use multiraptor group instead of single router group to improve the scalability and I O throughputs, which we will cover in the following, slides on the top of the consensus layer.

A

We have the transaction layer, we use the multiversion concurrency control mvcc to implement the Google percolator protocol, which can meet transaction using two-phase, commit algorithm a lot of database terms right. If you feel like it, is a little bit too much, that's normal. It is not easy for database experts as well and, from my opinion, the distributed transaction is one of the hardest part to understand in database management system, but fortunately, Pi KV has already implemented and we can simply rely on it.

A

On the top. We have the key value API layer, which is responsible for handling requests sent from client. The API is quite generic, get put delete, scan Etc if you have ever used any other key value store before then I believe you can pick it up very quickly.

A

So now we have enough background and I can start explaining how Thai KV addressed the requirements we mentioned before.

A

First, let's try to find out is Thai KV highly scalable in Thai KV, we divide large data sets into small regions which each region containing several replicas making up their own Raptor group. The number of replicas may change depends on how Thai KV nodes are spread geographically, but in most of the cases each single Target node does not need to store the full copy of the data, that is to say Thai KV is horizontal scalable.

A

When volume of data is increasing, we can simply scale out the Thai KV cluster by adding more nodes. You may be curious. If each node does not have a complete copy of data, then how can we tell which node should we access? If we want to read or write certain entries?

A

If you remember, we have a central component called placement driver which will act as a scheduler of the Thai KV cluster as replicas of each region, making up their own raft group, the leaders of each group will hold the complete information of the region, like the number of active replicas. The topology of the replicas Etc and a leader of each group will send harpit to the placement driver periodically, including all this information.

A

Then, when a Thai KV cluster service can service and request it from the client it will first Theory the placement driver get a Target note and then route the request to the node. The placement driver also support users to create highly customized configurations like maximum number of replicas in each region. How can we rebalance the leaders across nodes and in the scheduling throughputs, because too much data balancing and migration may affect the performance of online servers to sign up by breaking up the data into partition and the spread replicas of the partition across nodes?

A

A tech AV cluster can be easily scaled out to hundreds of nodes. Here are two large real world type AV clusters: the left is one of our user streets. It is a Thai DB cluster, backed by a large Thai KV cluster, consists of 168 nodes with 1820 billion rows in the 318 terabytes of data, which need to support 100 million reads and 87 000 writes per second, the users of it is the largest one, but soon we had a larger one, which consists of 212 nodes and can hold up to 827 terabytes of data.

A

We have not tried yet, but we did not experience any pressure with this cluster, so I believe we can actually grow larger. Okay. Now we map the scalability requirement. Next, we need to ensure the data consistency. When we are talking about the data consistency of a distributed database or data storage, we usually need to achieve two goals. One is isolating transaction from each other. Another is keep the data consistent between replicas.

A

These are two complicated topics was a whole session to discuss and we do not have time to do that today. So let's try to make it a simple. In short transaction is an operation unit containing multiple operations, for example, you may want to read an entry and update it depends on the existing value. This involves one read operation and one right operation. Another isolation between transaction means two transactions will not interfere with each other, and meanwhile we have multiple replicas for each date partition.

A

So we need to make sure the contents of each replicas is consistent with each other. As we can see in the graph. The left subtree contains different transaction isolation level and the right substrate contains all the local object. Consistency model, if you still remember Thai KV internally, contains four layers in the middle. Two layers are the transaction layer and in the consensus layer, and then a transaction layer ensure two things. One is the snapshot, isolation, which means we create an independent, consistent snapshot of the database for each transaction.

A

Its change are visible only to that transaction until commit time. This allows us to handle transaction concurrently if they are not interfere with each other. Another is derivatable read, which means any data read, cannot change. If the transaction reads the same data again, it will find a previously read data in place, unchanged also for the consensus layer. We use the rafter consensus. Protocol type KV ensures strong consistency, which is also known as the linear realizability among replicas of each object.

A

The next requirement we need to meet is the high performance. I o wanna read as we use the multiversion concurrency control to implement two-phase commit, which means reversion in each key value pair and create isolation, snapshot for each single read and write, which prevents a read request from being blocked. If there is an ongoing ride on a send entry, we also apply many other optimization approach.

A

For example, we support follower reads that allow us to spread the read workload across all replicas, or we prioritize small reads: to prevent the overall throughputs from being affected by several large reads from the rights. Since we divide data into partition, with replicas of each partition, make up their own raft group then write a request on key value. Pairs belonging to different partition can be handled concurrently.

A

We also try to optimize the right performance on a local storage engine level by developing a roxdb plugin called Titan.

A

Even though rocksdb provides good and predictable performers, it may still suffer from right amplification as it is in a log structure, merge chain internally, basically, each time when we update or delete key value pair, the rocksdb will append the operation to the end of the log file instead of updated and will periodically compact this files to remove the Redundant key value pairs, which will cause the right amplification.

A

Titan avoid this by storing a large value outside of the rocksdb and the phone scan. Since we do the range based sharding that divide data into continuous range determined by the short key scanning. Keys sharing same prefix can be very efficient since they are usually located in a handful of regions.

A

Besides all these, we also apply some configurable optimization feature like we can enable the low base splitting on placement driver, which will split the hot region if it receive a lot of read, write requests.

A

Last but not least, we need to make sure the Thai KV is reliable. Otherwise, all the aforementioned Advantage will be minimalist. We usually recommend users to use at least the three Target nodes and the three replicas for each partition. By default, we will spread replica across nodes. Therefore, we can tolerate the filter of some of the nodes. If you plan to deploy Thai KV on public Cloud, you can also spread Thai KB nodes across multiple available zooms and on the placement driver will place a complete copy of data in each Dom.

A

If we lost the sum of the nodes, placement driver will automatically rebalance the replicas and the leaders to make sure the total number of replicas is unchanged. Besides that we also provide tools that can help users to recover the cluster from disaster.

A

In addition, we also make sure that the Thai KV is well tested. We run regression tests frequently to ensure the new feature, is Backward Compatible and will not break the historical version. We also run large multi-day performers tests that use real-world data to ensure the performance is predictable, since many users adopt height KV as the backhand storage engine to build their own software as a service products.

A

Predictable performers is critical to make service level agreements for higher level servers. If you are familiar with Japan organizations, they run tests on various distributed systems to validate the different consensus clan. We also run a range of Jepson tools on Thai KV to ensure we deliver the transaction promise we have made.

A

We do chaos, engineering with chaos mesh if you are not familiar with chaos, mesh chaos mesh is another project initiated by pin cap and has been donated to cncf as an incubating project. It allowed us to simulate unexpected situations like note going down or network bricks and so forth.

A

As we can see, takeavi is a highly scalable, distributed key value store with high performance I o and high reliability. It works pretty. Well, in an on-premise environment, however, we are seeing an ever-growing demand of deploying Thai KV on a public cloud. The public Cloud infrastructure is very different from the on-premise infrastructure.

A

Then, how can we ensure that take V running on the popular Cloud can still provide reliable performers?

A

Nowadays, most the public Cloud vendors provide virtualized disk. Those disks can be mounted to a local file system and the lay appear like a local disk as well, but internally they are forwarding. I o to multiple remote disk that are potentially shared by multiple users. Aws EBS, for example, will replicate any right IO to three different locations.

A

This internal complexity can be a real problem for our system, because the latency of starter is obviously higher limit local disk. Also, since you are sharing Hardware with other people, anything that you use will be charged that include these cabinets and iops.

A

Finally, we should all know the cloud infrastructure are not always as reliable as they claim it to be. Service decoration will be relatively frequent and it should be considered in system design.

A

Ideally, we want the large Kai KV cluster to behave similar to a traditional database relational database system, but unfortunately it's actually hard to accomplish that on cloud storage. First, we want to build a scalable service, but scale means more fails to be more specific. We want. We are worried that as a system, it's storage, Hardware performance is very likely to degrade.

A

The cost is problem two, because every storage operation is charged by the cloud vendors. The user now have more reason to care about exactly how and why our citizen is using those resources and by that I mean read and write amplification, which is the amount of I o. The system needs to send before finishing one user request.

A

Here is a simple graph to demonstrate our systems. Runtime usage of I o resources over time. The user rights are very stable, as you can see from the yellow bar here, but there are Amplified multiple times because of background rights that includes compaction as Thai KV use, rocksdb internally and the garbage collection. In addition to that, large events incur extra iOS that are usually not predictable from this graph.

A

We can see the most straightforward way to reduce cost and improve scalability is to keep I o usage, Under The Watermark at all time, for that we introduced two new features. First, is the new raft engine? It is a new log store for Thai KV that is written in Rust.

A

For those who don't know, we previously use roxdb to store all transaction logs. It is not the optimal choice, but it is a decent solution at the early stage and right now we want to replace it and improve it. The primary goal here is to write less lamb of rocks DB.

A

Consequently, we can reduce IO cost and reduce the possibility of heating storage performance limits, and this new raft engine can also help us to improve the performance which is under development, and we are really hoping more contributors can join us to improve it in the future.

A

Now, let's talk about how we accomplish the primary goal. Rough engine maintains an in-memory index of all log entries. The reason we do that is not to improve, read. Performers is actually about reducing the background. Work in roxdb compaction is needed to keep all data stored and the cleanup deleted the data, but in rough engine we don't need to sort anything in the garbage. Collection doesn't need to read out obsolete data because we have a map of all active data in memory, then rough.

A

The engine further reduce the foreground iOS by a comp by compression all log entries are compressed with rz4 before they are writing, mainly with those two techniques. We are able to reduce 30 percent of all server right iOS.

A

The second feature I would like to talk about is called priority. I o scheduling it's not a new thing. Matting system already has it. What we manage it to do is to add this functionality without introducing major change of architecture. That means we did not change the internal tasking system.

A

No addition I o cuning is required with no extra overheads. We trace the character, rights or system. I o into three different priorities: let's use priority a priority B and the priority C here during the execution we periodically assign individual. I o limits to those priorities. At the beginning, the I o limits are high, all iOS runways, no restrictions eventually here at approach, 2 the I o usage access, a free data mined Global limits.

A

That is what we call an overflow after the Overflow, we adjusted the. I o limits to lower the priority, to make sure that in the next approach system will not use so much I O resource, the algorithm is heuristic and not perfect, but it works pretty well in practice.

A

Here we conducted a test to simulate large events during online workload. In this test, a large table is imported, while a tpcc workload is running. After applying I o scheduling, as you can see. On the left hand, the system performance is much more stable than before.

A

Meanwhile, there are several new cloud based feature under intensively development like CPU, limiting that is designed for environment with limit resource rough to witness that reduce replication cost, using write, only nodes to replicate transaction log without readable data in the future we want Thai KD to further adapt to Cloud hardware, and we truly hope Community users can benefit from our works.

A

That's all for today's talk, if you are interested in Thai KV, feel free to try it out or join our community slack Channel. If you have any questions or comments, please shoot me an email thanks.