Kubernetes Meetup Videos, 21 Mar 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Introducing YugabyteDB

Description

View this talk presented by Karthik Ranganathan, Yugabyte Co-Founder and CTO, to get an introduction to YugabyteDB.

This talk was presented at Microservices and Cloud Native Apps SF Bay Area Meetup on March 29, 2018.

Learn more: https://www.yugabyte.com/

A

Communities are have it in production, great, a whole bunch of people, all excellent. How many of you guys have heard of staple sites or have used one wonderful, excellent, so I think we're all set.

B

A

So before we go into that, maybe a quick intro about us, so I'm Karthik, I, think Oliver. Before set the stage pretty well he's taught he talked about Twitter, the scaling pains, micro-services or the lack of the term back then. So, at the same time I was at Facebook saw the journey at Facebook.

A

It wasn't very different to be honest, like we were thinking, we were going to be it that this week and we thought that every week and we thought that for many many weeks until we really crawled out of it right like it, takes some effort right and and like I'm, one of the cofounders and my other two co-founders, along with a whole bunch of people in our team.

A

We worked on the database side of Facebook during this period during the period that they went from pre cloud native, microservices type architecture to a post architecture which really really held for a lot of requests. You know, you know all the goodness and that's where we learned, you know, learned planet-scale and what are the challenges with the data tier and so on and so forth. So we're about, like 25 people, now a lot of us from modern web companies, Facebook, Nutanix, Google, LinkedIn, so on and so forth.

A

We've built a whole bunch of apps on top of no sequel platforms like I've worked on Cassandra before it was called Cassandra before it was open source way back when before no sequel was a thing worked on, HBase was a whole bunch of people. We put it in production for some very, very large use cases, including Facebook messages, Facebook messages search, our operational data tier, which did alerting and metrics and spam detection and a whole bunch of others right.

A

So our current company is a distillation of all of that, but before we go into what we do now right wanted to quickly talk about how do people build planet-scale, apps right, the docker plus kubernetes ecosystem, it's great for the state list here. It's very intense based really easy to deploy a lot of great tools and that's pretty much where everyone is headed. But how do you deal with the data tier? How does that look? And, of course, we're talking about the we're public clouds have become mainstay.

A

There's like 70 80 90 data centers available to you, I just have two data: centers is no longer an excuse and your users don't really care. They want their answers quickly. If not they're going to the next app right or Amazon will come, get you depending on your field. So, okay, how did people? How do people do this? Well? Typically, there's a shortage, sequel, master/slave for the most critical data and then there's one or more, no sequel tiers for, depending on your velocity of data density of data, type of access pattern, etc, etc.

A

So you now put your data across these tiers and because you have your data in so many tiers, you now need to bring what the user wants into something that can serve. It really fast right. So now begins the fun dance, like figure out which pieces of data go where figure out which pieces of data go into the cache, because the user cares about them and the users behavior constantly changes.

A

So your cache has to adapt now figure out how to replicate each of those tiers and finally, of course, they're going to be failures in this sort of a thing figure out what went wrong and how right and do it quickly and keep changing stuff changing machine types. Changing data centers, keep doing all of this. Well, you get a generous three months to six months to do that, but it's a full-time effort from the app team, the ops team, the database team, everybody right and we've seen the story play over again and again.

A

Okay, so Amazon should solve that. The public cloud should solve that. Well, not really right. So, let's see how it really works. If you take AWS as an example, you have elastic cash that takes care of Redis, so you don't need to deal with the Machine.

A

You have Aurora or an RDS that takes care of your sequel tier, but it's essentially the same architecture underneath and you have dynamo DB, which replaces the no sequel tier right so you've taken care of the machines and the patching, but the app complexity and the complexity of running this system. With data integrity and production still the same same thing right, so what did we want to do? Ido goodbye? Well, we just start with solve world hunger and put it all in a single database.

A

Well, of course, I'm joking right, so there are certain trade-offs that you have to make when you're trying to build scale out applications in a cloud right. So we've made those trade-offs and we've seen what it takes to put a database for a variety of different workloads and what are the common asks right, and so that's what we're going to be looking at okay. So what are your choices really right and where is the world evolving? How do you make sense of this space if you want a strongly consistent database to build global applications?

A

What are really your choices? You have the traditional sequel databases, but they really not scale out and they don't really do planet scale. You wouldn't be able to deploy your data across different data centers scale up when you need to add a whole bunch of machines, because you need to serve more queries or store more data, etc, etc.

A

So the no sequel guys came up with a solution, we'll take care of the scale-up right. They gave up on transactions, they said will be high performant and we'll do scale-up, but there's still that 20 30 % of your workload that needs transactional guarantees and what do you have to do for it? You need another database which then brings in a cache and you're back where you started right. So databases like spanner took another tack.

A

They said we're going to push sequel to its limit, we're going to take it to the bleeding edge right, so they give you transactions, they give you scale out, but they're, not the database. You would go to if you want to write a lot of data or serve them really really quickly. So performance is what they gave up on. So what are we doing here at Yoga byte we're trying to bring in the best of both worlds into a single database?

A

So if you want to distribute your data across the planet, if you want to do it with serve your data with high performance, so if you just want a unified, integrated serving tier and if you want the ability to do transactions, whether it's single row, acid or multi, row acid, like distributed transactions, both we wanted to make a single data platform, because we think that the apps of the future are going to increasingly look like this okay. So we started to build this database.

A

What were our design goals, transactional, high-performing, planet-scale right at the core database layer and cloud native and open source? So on the transactional side, we built a data platform that understands single row versus multi row distributed and keeps the single-row extremely fast and the multirow extremely consistent on the high performance side. We didn't go Java totally with Oliver. We went C++ actually because C++ is still the language to go to.

A

If you want extremely high performance so ground up written in C++ for data for planet-scale, we have the ability to distribute data across to various regions like far and near, and we can support reads from nearby data centers so that you can achieve low latency reads while being able to write from anywhere and maintain consistency and and of course, at the core of the database, is automatic, sharding and rebalancing. So you don't have to think about sharding or rebalancing.

A

The database does that for you on the cloud native side, it's all built for the cloud and the container era so makes it very easy to deploy, run and manage, and it's self-healing and fault tolerance. So if there are any failures and there's a lot of failures in the cloud, you don't have to wake up at 3 a.m. to go figure out what was wrong with your database and believe me: we've done that as a team.

A

Many times you don't want to do it, or some of you may already know that so and the self-healing nature is that you will still respect the users promise, if possible, of how many faults you can tolerate right. As long as it is possible we're an Apache to ATO project, you can check us out on github and we started out not only being open source but also with open api's, which means you don't have to learn a new language right, we're a multi model database and we're going to look at what api is.

A

We support we support Cassandra, query language Redis and we're working on Postgres, so it's well known languages all right. So what were our core database goals right? We want it to be very similar to Google spanner in the core. So in terms of the cap theorem we're a consistent and partition tolerant database, a CP database, but we can offer H a right, isn't that, like bending laws of physics? Well, no, not really, because what we give up is just a few seconds of availability.

A

So on a failure, we are able to reelect a master and a leader in just a matter of a few seconds, and so you will be able to get your Chace right and on the there's, another way to look at it, which is this. The pace LC theorem, but anyway, the long and the short of it is. You can get all the three elements in the cap: theorem consistency, availability and partition tolerance.

A

As long as you don't hit failures, the trouble really starts when you hit a failure and when you hit a failure, you go by trades of latency for consistency because it takes a little bit of time for a new leader to get elected and at the core of the database. Is that it's a document model? It can support sequel like api's. On top it has distributed transactions, but a single row transactional by default, and it also like has the ability to support a native JSON data type like a lot of SQL MongoDB type databases.

A

Okay, on the API side, we started with well-known api's, because some of these api's are really good for modeling distributed systems. So we started with a Cassandra query language which knows about how to partition data across nodes or knows about. When you add a new node, how can you go and read data the subset of data that new node now contains with very, very low latency right or a radius like API, which is really good at modeling data structures, except this Redis is a full-blown database.

A

You don't have to worry about persisting your data elsewhere right and we're working on Postgres. Now on Postgres we have to teach Postgres how to split data across nodes or how to make the driver aware of multiple nodes to give performance out of a multi node system right. So those are things that we're working on yeah, so the core of the database, irrespective of which API you use, contains the same set of features.

A

So it you can do tunable, read consistency from all of these API s, which means you can read from your nearest data center and get a cache like experience or go. Read it from the actual leader and use it like a database. It's write optimized for large data sets, so we do streaming. Ingest really really well so batch reads: reads and writes: are pretty good data expiry with time to live that the database supports it? Like most sequel, databases do not have this.

A

No sequel, databases do, but that's something we have retain from the no sequel side and it's a single operational run book across all these api's right. So it makes it very simple when you run this in production.

A

So to summarize right, if you think of Cosmos DB as the bleeding edge of no sequel today and spanner as the bleeding edge of sequel, we're bringing the best of the two words into a single system.

A

Okay, very very high-level, quick architecture like I'm, hoping you guys will ask me questions if something is not clear at the core. It's a scale out system, so you just take a whole bunch of commodity machines at our containers and put them together, and the database now recognizes it and presents itself as a coherent whole. Each of these nodes or pods contains a document store a document store which is a heavily extended and modified version of rock's DB. That works well with a raft based replication engine.

A

The raft based engine is needed to keep consistency of data across different nodes, on top of which we have a global transaction manager right. So now this entire thing comes with automatic, sharding and load balancing.

A

So you don't have to worry about what happens when I add another node or when I remove a node right, it can automatically shard in load balance itself and finally, on top of everything you can access the database using your well-known api's and obviously we have been extending these API, so you can still interact with them with the open source drivers.

A

But we have additional things like we have extended Cassandra to add a begin transaction in transaction type, semantics to do distributed transactions which Cassandra does not support or a JSON data type and so on and so forth. So for those you will need to use our drivers, but they are in the open source as well.

C

A

Right so the leader, the question was the leader election: is it based on the RAF's protocol? Yes, the leader election is based on the rats protocol for each chunk of data. Each chunk of data is called a tablet. Each of the tablets form a raft group and they elect their own leaders.

C

A

C

A

Tablet is a uniform partitioning of the key space, so you take the entire key space and we chunk them up into a lot of small tablets. Each node hosts some set of tablets. The tablets have rep that go across multiple nodes, depending on the replication factor. Each tablet with its peers will elect a leader and they are a completely independent unit. It doesn't depend on any. There has no external dependency, so no zookeeper, nothing. It will do its leader election by by itself inside.

A

D

A

You did the question yeah I'll, repeat whether yes,.

D

Sorry, happy for to datacenter I shall relationship Yahoo.

A

D

Put into one tablet so.

A

Yeah so yeah great question, so if to date sets of data have a relationship is the question was: do you put them into the same tablet if they're in different tables? No, we do not they're in different sets of tablet. However, we allow these tablets to be co-located on the same machine. This is a feature that we are working on. It's in our roadmap. We call them a co partition table, so these tables are co-located right and split the same way: okay yeah.

A

So it turns out that distributed transactions is really like a whole bunch of tricks if you want performance out of it. So this is one of them actually and that's a great point.

A

Okay, so all of this runs on top of any is so no external dependencies just needs memory, CPU and disk, and it can and go go forth, and then you know do its thing and because of the api compatibility, it's very well integrated with the ecosystem, so it works. Sparc works on top of the database, so you can use a Kairos DB for a time series data or a Janice graph as a graph database or a spring book type ecosystem to build apps on top. So it's really easy. Yes, please question yeah.

E

I'm curious about the latency: if the nodes in this diagram or geo distributed right in the in the planet-scale I can I can see a reads or fast, but what about right? Yeah.

A

Great question so right follow laws of physics by the way I think you guys are asking great question I'm supposed to be giving you stickers and I've forgotten anyway. You know who remembers with good questions right so yeah, so it follows the laws of physics you're right. If you're trying to achieve a consist see across faraway geographies, are going to take a longer time, there's just nothing, nothing to do to get around that, but one of the things that we have is.

A

We have extended raft to have an observer node like an for an async replica, so you can have a read-only replica, that's placed far away or an async replication pipeline that will not incur right latency, so you'll be able to easily design it as one set of machines offer. You are geographically relatively close and offer you strong consistency.

A

While you place a whole bunch of replicas that are far away, which will still get the rights in timeline consistent fashion, not eventually consistently timeline consistent and you'd be able to serve data from there you'll be able to write data there as well. It will like redirect it to the right place and get the data grid. So I think it was another question or maybe not yeah.

E

A

Yeah yeah stateful says I'm getting to that so I'm just telling you what go so the question was how how does it work with stateful sets and yeah we'll get to that? So all right, so I'll quickly go through where we are with the project: 0, dot, 0, dot, 9 beta version publicly available on github. You guys can try it out on your laptop. You can try the various scenarios. Like you know, it's pretty easy to do that.

A

We've stress tested up to it like 50 nodes. We could keep going higher, we're very comfortable as a team running hundreds of nodes in a cluster, because we did that at Facebook. The same design goes in here. What you'll notice is even at 50 nodes we don't give up our load read: latencies are real agencies are in the hundreds of microseconds, depending for a simple key value type read so in this case we're doing like 2.6 million reads at 200, microseconds and 1.2 million writes in a 50 node cluster at like just 3 milliseconds right.

A

But this is a 3-way replicated consistent right right. So it's like it can it can survive a failure. It's a really high performance. We've tested it with ycs B as well. A lot of people asked us like if you're doing consistency, what are you giving up? How much performance are you giving up?

A

Well, it turns out, if you architect it right and you go all the way down to the bottom, like all the way to how the data is being replicated or how it's being written, you can actually get very, very good performance out of the system, so these are our why's. Why CSV benchmark results, we're written a blog post about? How would how is it that we can make our system have high performance along with consistency, you guys can check it out blah gigabyte com, we support distributed acid.

A

So on top is a classical bank accounts kind of table. There is a row which is an account name. Another row, which is an account type and a third row, which is a balance which is the total. The interesting thing with Cassandra query language is: it will tell you how you can partition your data? It looks very sequel like right, but you can partition your data across nodes right. We want to keep all the data for a single account together right. So that's what this thing is doing.

A

It is charting the data by the account ID and keeping all the account types very close on disk. So it's really really fast right and the transaction that's being done below is for a user John to transfer 200 dollars from his checking account to a user Smith who could like live on, and these guys could live on different nodes.

A

You can do the same thing across different tables, so it's just like your regular transactions right and this forms the basis for implementing other things like consistent, secondary indexes which are highly performant and so on and so forth, and these are things we're working on.

A

Ok, so there's a we're deployed in a whole bunch of customers, but this is one of the deployments that people really like I think this goes to the question that I think you asked in this set up is geo distributed two copies of the data in US West, two copies in US east and one copy in Tokyo, so completely planet-scale. This is an example of users, logging in and changing their password, so with the application running on all the data centers across the globe.

A

What we find is around 200 micro, second latencies for people trying to log in and about 200 millisecond latency for people trying to change their password on average right, because your data for consistency really has to travel right, and in this particular example, it can survive a whole data center failure. So if you think about users, logging in and changing their password slightly higher latency in the order of 200 milli is okay to change password. You have to be absolutely quick to log in 200 micro.

A

Does that you don't need a cache in front right, so it's multi cloud so works with Amazon, AWS, Google on-premise and we're working on Azure. It works with stateful sets, which is what this whole talk is about, but but anyway, so the point is, it has no dependencies is easy to deploy anywhere alright. So for the demo, we're going to look at yoga store, which is a react Jas and nodejs Express app and the app is a shopping, cart app.

A

So it's going to be a bookstore type of app right and the app itself is on github, so you guys can download it. Try it out, like you'd need to setup. You go, buy it as a stateful set, there's a whole bunch of instructions there. You guys can check it out alright, because it's not very interesting to watch.

A

All these things get set up so I'd already I have already done these steps and kept the system ready, but effectively that that's the one line that you need to run in order to bring up gigabyte in a stateful set right, and the second thing that I did was containerize the application. As well, so it's like the standard stack, it's a pretty popular stack, so containerize that and ran the command up there again. You can find these commands on the github repo and brought the app up and running.

A

So let's go take a look at the app so yeah before that this is my deployment on the localhost, so it's just showing the the kubernetes dashboard. What you see is this is a three node cluster with three tablet servers, so it's got three nodes or slaves that do the actual work. Its replication factor is three. So that's why there are three masters, so it's replicated three ways and can survive a failure of any one node. And finally, this guy here is the service.

A

That's running the stateless application right and you can have a whole bunch of them. You can scale them out pretty easily, so on and so forth and for the actual storage we use a persistent volume plane, so we're mounting the disks into each of the pods of the database and it uses that to store data yoga by doesn't can just work. Fine with ephemeral, local storage, as well like just local disk, is fine.

A

Okay, so I'm going to go into deployment that I've done on on the Google cloud right. So this is a GCP gke install, so google kubernetes engine based install. I ran the same set of commands after installing the G cloud, like whatever API. What this shows is its. This is a four node cluster replicated three ways right, and these are the masters that form a highly like available raft group and they keep track of the metadata in the system. There is a one table in the system.

A

It's a a products table in the yoga store key space right like it's. Just like your. If you're familiar with sequel, key space is like a namespace and a table just as a table. Okay. So if we go look at the tablet, servers or the slaves that are doing the work, you see that it's all a single cloud single data is like a rack or region and single availability zone install.

A

So this is not a multi cloud installation right, so it's just, uh but it can very easily become one right like it's just like easy enough to deploy you gabite that way. Okay, so here's the app itself right. The app is a bookstore app. What it shows is a whole bunch of books with some data.

A

The books are split into categories like there's, some static categories like business books, cookbooks, mystery books and so on, and there's some dynamic categories like show me the books with the highest rating right or show me the books that have the most reviews right. Typically, what you would expect from a a store now? How do we build this right, like? Let me go back here, so the multi API strategy. It turns out really really simplifies this, because you have Redis as a database. That's still highly performant and Cassandra as a highly consistent and performant database.

A

It's easy to put the less dynamic pieces of data like the title of the book or the description and so on and so forth into the Cassandra database, which just looks like an SQL table. Now, on the Redis side, you can put the very very dynamic pieces of data into Redis, for example, the number of reviews, or the total rating right.

A

All of those go into Redis, so querying for the top books in a category is just looks like a sequel query right, select star from something where category equals something right where you can use a ready, sorted set to keep track of your books, sorted by your number of reviews or your ratings right so, and that makes it really easy to model the app because you have the right data structures at your fingertips, and you don't have to worry about now. I've put my data in Redis.

A

Let me figure out where the backup of that data goes and if this thing dies and gets replicated elsewhere. How do I serve the data? How does my app need to move data between tiers? All of this is gone. It makes it a lot lot simpler, so we can go and check this out if I find yeah.

A

So this yeah, so I deployed this in gke just minutes before the talk started, I had a three node cluster I added another node, another part I scaled it up to four. What you see is there now four nodes right so or four parts, but.

A

The data will I'm actually going to show you actually I'm going to show you with respect to the queries, but the data is split into small shards. The shards are automatically moved across different nodes, so, as you add nodes and remove nodes, the shards of data automatically rebalance themselves. So you don't have to worry.

A

Yes, so question is: if you add nodes, would you need to pay the balancing cost so in yoga bite? Because it's strongly consistent? We actually copy compressed file so as the equivalent of that, so the database automatically takes compressed files and copies them over exactly the subset of files that are moving to the new node. So, yes, there's data copy cost, but the data is very efficiently copied and this does not affect your foreground app right. So you because you're not reading through the database you're just taking the file.

A

So a database has to read some subset of data from the disk uncompress it and then do a whole bunch of reads and resolves on them in order to serve a user query right and most typical databases today like if you take a Cassandra or a MongoDB or most of these databases. If you add a new node, will read almost through the database to resolve the data and then get the data out and copied into the new node while with yoga bite.

A

Because of its strongly consistent properties, we can go to the compressed data files and copy them straight and then have the new node join. The replication in the raft ring.

A

Exactly yeah, please yeah.

E

A

Decides that you don't have to worry about any of that stuff, because if I mean typically an expectation is that if you enable like distributed transactions right, you will be subject to clock, skew one of the things that yoga byte can do and we're working on. It is to integrate with an external like atomic clock service like an amazon's time. Sync service kind of thing in at which point you're distributed transactions, are reasonably performant as well.

A

But if your query pattern automatically becomes single row, it automatically starts getting the benefit of high performance and we do the work to compute at what point. A single row asset conflicts with an ear like a provisional right for a multi row. We do all of that tracking sure all right so yeah. So what I was going to show is that you can easily run a query like that right and it's gonna and that's so that's the type of query that you would run to say. Hey, give me books in the business category.

A

Give me 3 books in the business category or you could say, give me 3 books in the give me all the books in the business category that are hardcover books right, because your app keeps evolving. You have to show different things right and that's what we just did with a Kassandra query language, but connecting to the same cluster right, no difference right, I'm now connecting to the same cluster and as a database I can interact with it using the Redis CLI, which tells me give me the top 10 books sorted by the number of stars.

A

They got the total rating, the highest-rated books right and with this course like with the ratings, or give me the top 10 books with the maximum reviews, because these are the pages that you keep throwing up and these pages keep changing and you can write into Redis and query from Redis and, at the same time, write in to Cassandra and query from it.

D

And your format also no.

A

It's not so the question was: is: is this table the same product table? That was there on the cassandra side and the answer is no. It's not. These are two different tables internally, but they're both persisted with strong consistency, so you will be able to write to Redis and to Cassandra. If you, what you would have traditionally done is you would have written to Cassandra for one step or sequel for one type of like data, the other type.

A

You would keep it in memory in Redis and you would have figured out another place to persist their underlying data for Redis. So if reddy's dies, you can go, read it from there repopulated right. What here, what it happened. Redis is a different DB Cassandra, our sequel is a different DB and you'll have a third DB. When you do a backup, you'll have to figure out all of this stuff. In this case, you just write into one database.

A

It will give you the data back using all the different api's, the backups and all are taken care of.

D

A

Use a different format, but the files are in the similar format that the core database understand. So the core database is the same. It has different API layers on top, so think of it as yoga byte tables at the core there's a different yoga byte table for Cassandra products and there's a different yoga buy table for Redis, and they both store data and like they both write data into the underlying doc dB in a common format. But they will lay data out the way they want, because the way it is optimal for Redis and Cassandra.

C

A

Question so in case a node fails. What is the reconstruction time? So, let's split that into two parts right, a node fails how how will my queries be affected right? That that question is if a node fails only the tablets that are leave that have leaders on that node will get affected, but in a matter of seconds right, which is a configurable value. But but currently it's like a few seconds two to three seconds.

A

You will have new guys serving the values, so your so if you have three nodes, so let's take a 3-node example right and but each of the nodes are replicated each of the pieces of data is replicated three ways and each of the pieces of data is in a raft group. So the minute or node dies in in a few seconds like in two to three seconds. The next guy will take over as the leader and start serving and the client is smart enough to know this guy died.

A

Let me go find the new leader and serve the data, so your app is not affected right.

D

A

Know yeah we'll get to that yeah, that's a great yeah! So if you don't have a fourth node, you can't do much right. But let's say you had four nodes with replication factor of three. If a node fails, there's a configurable time up, we wait for the node to come back. We don't do anything right because you don't want to simply do unnecessary work. That configurable timeout is five minutes at this point after five minutes will automatically read applicate as its data to an existing node. If there is one.

C

A

Replication will go on in the background. Your foreground queries are not affected. Your app is still running. He's asked me to speed up, but I will take your question just give me one second. So the other thing I wanted to show you guys is the load as users or the bangladeshi click form for those of you who see Silicon Valley. So so there you go farm activated right. So now we can go and see what's going on in the database, so the minute I did a refresh.

A

What you see is that the reads and write start getting scattered across two different nodes right so and depending on the app and the type of keys, you will see that the load, then the distributions will wave. You will vary right so and one of the things that's that I would also like to do is we can actually scale the number of replicas sets on the fly so we'll be able to just add, like a new guy right.

A

So it's pending creating.

A

Well, it'll come up soon enough yeah, so I'll keep going to finish, but I'll come back and show you guys. It's done hope it's done.

A

So what you notice is that a new node just got added to the system right and the new node is now so at yoga bite we're all about running in production right. So we won't like go and rush into copying all the data you can throttle how slowly you want the new node to start taking data. So it's going to happen in the background in a way that your app should not get affected. That's the main, the most important thing right, so so what you'll notice is that like?

A

If you look at the number of shards right, you have a certain number of shards and you're, seeing that the load is slowly getting moved to the new node while the load is like the data is being served by all the other guys. So if I refresh you'll see that more shards slowly start moving and the data gets copied, the node now comes into the system. It starts serving data, and all of this is online right.

A

I mean we can wait for it to finish, but I think you'll probably kill me so so, let's yeah, so our database is Apache 2.0, it's in the open source, we'd love for you guys to get involved like it's. It's C++, but hey, it's still a good language. So it's highly performant so, but but anyways download it try it. You can play with a lot of failure scenarios on your laptop.

A

We have an explorer section that talks about how automatic sharding works, how like how we can do load balancing automatically how we do distributed transaction, so you guys can download it on as the kubernetes or like a binary or, however, and try it out on your laptop watch stars and follow us on github we'd love for you guys to try it out to help us out.

A

Yeah I. Think I said that already so that's all I had I think there was a question that you had yeah.

F

You can go to any of the followers node right or digital to go to the leader. There.

A

Is a multiple modes? It's a great question. We offer tunable consistency, so if you want to treat it as a database, it has to go to the leader node, because that's just the way it works. But if you want to read from a follower, we allow that option and you can also optionally read from the follower in the nearest data center to the app so.

F

No question: so what is the year data overhead? Let's say: I want to store one gig of data and what is the storage bloat that happens. This.

A

Is a ok, so this is a yeah. It depends on a lot, but I'll tell you at a slightly different way. We do a prefix compression for the data in memory and then we recompress it using on top of the prefix compression. We do snappy compression and store it in disk. So it's pretty well compressed, but that's a hard question to answer, because yeah yeah, please it's a modified rocks TB. So in the sense we have taken out Roxie B's right ahead log we took out Roxie B's MVCC.

A

We integrated, rocks DB with our raft log as the source of truth, like for starters, and a.

B

Lot of other changes.

A

Yeah, so his question is: is it a Roxie B at the bottom and and the second follow-up question? Was it's still a level DB fork, so it has a very large write amplification. We have changed the way the compactions work based on our HBase learnings, so we actually do a size tiered compaction. So we try to strike a strike, a trade-off between size, amplification and number of rewrites right so and today's systems, IO, is the bigger bottleneck, then draw disk size right, but but there's I mean the answer.

A

Also, is not that straightforward because, like it's not a single tablet in a machine like suppose, you have 500 gigs of data in a machine if it's a simplified, if it's a very simple single shard system, you would need another 500 gigs of data to do compaction, so you can only do 50%, but we do sharding inside. So it's it's really like. The now is a 500 gigs, divided by the number of shards divided, so there's a whole bunch of computations which decreases the size and I.

C

Just wanted to ask like, since you're strongly consistent and also you are distributed. Do you support sequel, join a great.

A

Question so it's in our roadmap, but here's our main thing with joins. As you add nodes, your joint performance could get worse for the wrong kind of joint right. So this is the one sticking point that a lot of companies like like Facebook, for example, has an entire sequel tier. They don't do joints, they don't do cross no joints right, but they try to place data in such a way that your data that you want to join comes together, which was the co partition table kind of example.

A

So, yes, we will support it eventually, but right now we don't because we feel it is very difficult to tell an end user at what point? Does your join get worse with node addition, because you add nodes to make it better? Not worse, so.

B

What kind of assumptions do you make about storage and how do you qualify? What is the parameters that you require for for good performance and transactional consistency? Great.

A

Yeah, so we assume that we yeah, so the question was: what kind of assumptions does yoga byte make about storage? And what do we qualify for performance right? How do we qualify for performance for transaction felicity yeah? So we assume we run on top of a file system if we typically recommend XFS, because we have a lot of experiences about any file system will do we assume that the file system has sink support, so we are able to store data it like, etc, etcetera right at the core?

A

We assume that you have a SSD type device, because SSD is the norm now we are not like you know, still designing for the hard disk age. That said, however, we have enough architectural depth in the inside gigabyte to enable tearing of data to hard disk like tears automatically like in the background, so you will still be able to move older, colder data to cheaper tiers, but keep your heart data in the in the serving tier right.

A

So and now, as far as transactions go, it's a split like transactions depend on like distributed, transactions depend on clock, skew single row. Transactions depend on your network as well, so it's not just storage. So from a storage perspective we just require SSDs the faster. It is the faster you get, but at a certain point it will be diminishing returns because these other guys will take over sure. Oh.

A

Yeah so yeah I think one point that Curran was mentioning is that we don't depend on software rate, so we can just use j-bot like just disks, just a bunch of disks from more questions, guys all right, I've got stickers, I forgot to hand them out to all the people who asked good questions, but please do the others and they're in the back as well. Apparently, so please do stop by say hi and yeah. Thank you. Thanks for.