Apache Cassandra Contributor, 30 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GMT20230530 170422 Recording 1652x992

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Is is an sorry um Zoom just and I threw up our uh notification alrighty, so uh the the cep is uh titled um Reading Writing Cassandra data with spark bulk analytics and uh the historical context here is that uh Cassandra as as a database, does great when it comes to doing uh Point, reads or Point rights uh type of queries. uh But when it comes to uh you know, scooping data out of Cassandra into a system like spark in order to do any sort of analytical workloads.

A

um That is usually a problem.

A

uh You can't uh issue a query that scans uh the entire Cassandra cluster uh to read all the data out of Cassandra right uh and similarly when, when we want to do rights, uh Cassandra is great, are doing Point rights, it's horizontally scalable, uh but sometimes we want to like just bulk load, a lot of data to Cassandra and um when, when we do any sort of uh like heavy read or heavy right uh activity, it does impact the databases, uh read and write uh latencies.

A

So uh with this CP, uh we are trying to address uh some of these uh issues, which are generally impact uh which generally impact the database when we are doing a lot of pointers and point right queries. So uh that's the basic motivation. We are able to read and write a lot of data um in Cassandra now uh when it comes to the actual um the cep. There are two major contributions here. So one of the contributions is this uh Cassandra spark analytics Library.

A

uh This Library basically is a um a library that allows you to run a lot of these um uh run. The major functionality of the data of the data extraction or dead Port uh on spark um and uh basically the way we import or export data is through uh the API that are implemented in the Cassandra sidecar.

A

For those who don't know a whole lot about this and sidecar, please look at our CP cp1, which was the very first CP that we proposed, which is uh the Cassandra site management process which now, in today's world we call it as the Cassandra Sidecar. uh So uh with that historical context, uh basically I'd like to dive deeper into how this, um how this actual um functionality is implemented for uh Cassandra users. So for that, let's look at the actual API.

A

So, uh let's consider a use case where you want to ingest a lot of data into Cassandra, and we call that as bulk right and in order to do that, uh you, you have some data frame on spark. So the underlying assumption is that we are.

A

We are using spark as an engine that has, uh you know, has some amount of data that you have loaded uh in a in a data frame, and you want to write that data frame into Cassandra, so data frame nicely, um you know, goes and maps to a table within Cassandra. It is a row oriented structure and it is. It has rows and columns very much similar to what Cassandra has so. In order to do that, um here's it has all the code that you need to write.

A

um All we are doing is, uh basically, you know, you're, taking the input data frame, and we have said calling write on it and we have passing out a bunch of options which allow allow the the bulk right functionality uh to uh to basically in be invoked within the Cassandra sidecar, which then goes ahead and writes the data uh in this Android makes it available, uh and similarly, on the on the flip side, when we want to read or a bulk read, uh basically, we do the exact opposite of that, which is we create a a context and uh using um using spark, and then we load uh the depot that we want to load from kitchen and drop now.

A

One of the things to remember here is that when we do any sort of bulk reads, uh we are going to create a snapshot on Cassandra. A snapshot is a common operation that we do in order to. You know, get a view of the data within Cassandra, and- um and this is what the bug feeder also does. It does create a snapshot on each of the nodes, each of the instances that Cassandra has and when we do that bulk read.

A

We are reading the data through the sidecar and do a data frame that that is made available in spark and if let's say you wanted to do some sort of aggregation, this is a simple example where we are creating in aggregation and let's say we want to cop count the you know the entire data set on this column C. uh We can create an aggregation through spark, so anything that you can do with spark.

A

um You can do it on this particular data frame and uh you know you'd basically be able to do it at um using all the data that exists within within Cassandra now, a lot of You Must Be Wondering. So this is functionality that you know uh can can be implemented using the Cassandra driver. So what's the need to have this mechanism this this different mechanism of doing this pretty much the same thing so um so skipping over I want to show you the architecture of this and the data flow.

A

So let's look at the bulk read functionality. First, um I hope this is big enough.

A

Move this off from the side, uh okay, so um so what happens here is um the big difference between what you can achieve. The excuse me, the Cassandra driver um and the bulk reader functionality of this library and the sidecar is whenever your job starts up in spark.

A

It gets distributed to all of these tasks and uh the for the bulk reading functionality. The the driver is going to invoke uh snapshot uh functionality within the sidecar, uh and each of these sidecars basically go in and uh create a snapshot on these individual nodes that exist at the Cassandra. Once the snapshot is created, then what we can do is uh the individual tasks. They will scoop up all the SS tables that come in from the sidecar uh into into spark.

A

So that is like difference, a big difference right, where what we are doing here is. We are avoiding through Cassandra's cql protocol and I'll dive into a little bit deeper into why we want to avoid the SQL protocol, and here the expectation is that you want to work on the entire data set that exists in the Cassandra cluster. So, um what's Happening Here is: let's say you have um you know a Cassandra cluster with 10 terabytes of data or 100 terabytes of data.

A

Your sidecars are able to stream all of the SSD balls um at um at a binary level right at a block level, uh Without Really interpreting any of the data in memory. So we don't really uh serialize or deserialize any data that exists uh in in Cassandra we just directly ship the SS table or to the windows of spark tasks, um and uh this means that we are not going to create any sort of garbage.

A

We are not going to uh incur any penalty in terms of CPU or memory pressure, because all we are doing is zero copying streaming the data out of Cassandra sidecar and uh into the tasks that exist on spark. So we can go as fast as the network would allow us to do it once the once. The data gets into the individual tasks, the the library. This is the library that goes and maintains the Quorum. So um it's everybody should probably know here is that Cassandra is a quorum based service right.

A

It's a database that um is going to replicate the data set across multiple nodes in the in the customer cluster in order to maintain availability and in order to uh you know, maintain durability of the data. uh So in case uh one of the node dies uh answer reminiscently. There are two other nodes in rf3 configuration, and this is kind of the Dilemma that we we have. Since, let's say we scoop up the data from a Cassandra node.

A

Now the task has to basically read the SS tables from the other two replicas in order to make sure that all the three replicas are on the same page. So uh we basically um have implemented that in the library and the way we Implement, that is again using Cassandra's code itself, uh the Cassandra all jar is packaged as part of the library and all it does is it.

A

It reads: data from those individual access tables and ensures that all the rf3 or whatever is your application factor is- um is satisfied in in spark and from that point onwards, what we do is we actually deserialize individual rows um and um columns in the the spark task and uh provide a consistent view of the data that was snapshotted on Cassandra. This allows us to achieve throughputs that are typically not possible on Cassandra, because we are skipping all the serialization destabilization logic on these individual Cassandra nodes.

A

The side effect of doing that is we don't end up creating any sort of pressure on these Cassandra demon nodes.

A

um uh The actual demon process that is serving the read and write path of the database is not impacted as a consequence of this right. The only thing is yes, we are consuming a lot of network bandwidth, but network is typically cheap um and so um I we don't. We don't see any sort of meaningful uh changes in the latencies or the throughput of the of the treatment, um and this is what the bulk reader achieves.

A

uh Once once you have the data here, you can do whatever um you know, analytical uh workloads you have, you can uh train your machine learning models, you can kind of create uh views, uh analytical views of the data set and then once you're done you, you can just discard this data um in the traditional ETL architecture.

A

um People do something very similar, except that they they basically read directly from Cassandra, and they scan one token at a time and they dump all of that data into uh something like hdfs or Hadoop or S uh these days, and they then scan that data or um you know if they have converted it into parquet or some other format. They can run analytical workloads on that.

A

While that is a great architecture, it also incurs the additional cost of putting the data on an external system like S3 or hdfs, which itself is replicating the data many ways and that increases your time and cost. But this approach you're directly reading from the Cassandra cluster, without impacting um the the cluster's performance and typically uh spark and Cassandra have been, have been paired together in the past, um and this. This uh continues that so your existing code will pretty much work, as is on spark.

A

um The only difference is now you're adopting a different library in order to read and write that data. um So uh that's the that's. The spark bulk read functionality uh similar to the read functionality, there's write functionality as well. Again, all of this functionality has been implemented in terms of like uh no-toll uh import uh uh that that pretty much exists in Cassandra um and uh what the writer does is basically the the exact inverse of what the reader does um this case.

A

Let's say you have some sort of a job that is creating a data set, so you have um I, don't know CSV files or you have some XML files, Json files right and you are creating a view of that data set, and you would like to write a large quantity of the data um into Cassandra all at once.

A

You could do one uh record at a time, but uh again you end up with the same exact scenario where, uh for the batch load of that data, you are going to dominate the the night uh path of Cassandra and while Cassandra's right skills. uh Well, uh you, we will see uh an added pressure on all of these individual instances, so instead of um you know generating the SS tables in the individual customer nodes. What the bulk writer does is. It goes and uses um the the same library that we have.

A

It uses that to take the data set and sort them into rows that go into the individual Cassandra nodes based on their tokens, and it will then create the SS tables on the spark worker nodes itself. Once the uh the SS tables are created, uh then it uses the sidecar apis to push the data into individual Cassandra nodes uh on the disk and then once all of the data is in.

A

All we do is import the data into Cassandra, so the lsm3 architecture of Cassandra makes it possible for us to just create new access tables, put them inside Cassandra, and then you know, call node to import and Cassandra basically makes them available as part of the um as part of the view live serving view of the data. So uh it makes it very easy for us to work with with new access tables and that's the that's the capability that we are using in this model as well.

A

So with the bulk writer and bulk reader, there have been some benchmarks that we've done and essentially what uh what it reveals is that you know, since we are generating the access tables on spark or we are, we are interpreting in such a business part. We are not impacting any uh any of the nodes in the Cassandra cluster itself, so you don't need to like create a separate Cassandra cluster. uh The historically people have created two Cassandra clusters. Each one is for Reading Writing data and the other is for purely running analytics.

A

Now you can, with this model, get away from creating that extra cluster. Just you know, run your analytical workloads on the same same class, same set of nodes that you're doing reads and writes, um and there is enough redundancy that is built into the bulk, read and bulk right uh capabilities that we can ensure that you know we don't overwhelm any of the nodes. uh So, for example, we have throttling.

A

uh So if, if you are not sure whether you you would want to saturate the network bandwidth, so what you can do is you can certainly set some throttling on the side. Car sidecar has those throttling capabilities and you can limit the throughput if that is a area of concern, but overall uh that is sufficient, return logic and uh sufficient guards in place that we would not like overwhelm the consignment cluster.

A

One of the things that the bulk writer does uh is uh is that uh once it's done with the import or if the import uh Fields part way, there is data that that might be lying around in these um in this um disks that hasn't been imported or has been partly imported, and uh basically there is logic that would clean up uh the data that exists, but has not been imported. So um all of these capabilities already exist.

A

uh So um as far as the actual implementation in the sidecar goes, there have been a significant number of apis that were introduced as part of the cep.

A

um Let's see um yeah, so all of these site current points are uh basically their services which give us some metadata that is required for the bulk, read or bulk right functionality to work, uh but all of these are composable and rest endpoints. So if somebody would like to use these endpoints in order to replicate some of this functionality in other systems, or you want to write a like a standalone, Tool uh Cassandra has this SSD poll loader tool that that exists inside the Cassandra repository um this.

A

These apis can can augment or even replace, that function D, um and with this these apis, um you could reuse them in um in your uh in your Tooling in any of your scripts there. If you would like to as well so the sidecar opens up the ability for a lot of innovation uh on on this on the side, and it doesn't really impact the the database in our very significant way uh if you are invoking these uh apis.

A

uh So apart from that I just wanted to um so for more information, please read the cep uh there have been. uh There have been a lot of questions uh that were part of the discuss thread, so this discuss thread uh which talks about the cp28 um was uh was started by Doug and there have been some interesting questions with interesting back and forth.

A

So I can cover some of the questions if we have time, but um you know in order to address uh some of the concerns or you know Alternatives, um we, uh we have already covered that in the CP document.

A

um If you like to contribute uh the there are two uh repositories that you can contribute, one as the Cassandra analytics repository, uh which basically implements the library side of the functionality and and Cassandra sidecar.

A

um If uh sorry, this is the repository for the Cassandra sidecar.

A

um This is um open to anybody who is interested in uh contributing to the Cassandra sidecar project uh outside of the analytics uh work that that has happened in part of the CP, but um right now, I think the main main contribution that we are looking for are people to test this out, give us feedback. uh There might be some rough edges um and um you know we love to have folks pitch in and try it out. Other example um examples in this repository as well.

A

As you know, this page exists actually readme exists which will walk you through setting it up uh and running your first analytics job. It does require a little bit of coordination uh if you'd like to deploy this and try it out. uh Currently, the sidecar doesn't have an authenticator, which is an area for contribution as well. uh So once we have some sort of authentication, it might be easier to deploy this s cluster and run it.

A

But if you would like to run it on your local machine, you can certainly do that and try it out as well. So um what I would highly recommend doing is just giving us feedback at the start, and if you find this useful, uh do give us. You know some feedback as well. If you find some rough edges, things don't work, please go ahead and find jeera's in the Cassandra project, and we can.

A

You can take a look um for those who, if you who might be interested in looking and getting an idea of what cep one was all about um it, exists here and it is I would say, still work in progress, but this cp28 adds a significant amount of functionality into the sidecar. So I think we have a.

B

Few minutes I will stop sharing here and if anybody has uh questions uh I'm happy to take those.

B

I'll start: um is there anything that needs to be done or could be done on the spark side to make this work better um and mainly because I think they're, like the two deployment methods that they that they use now, which is the um one with the spark executor and one without.

A

So is: is there like some things that you know? Maybe we can work together with the spark uh spark project to make this easier better. So um that's a great question. um I I need to think about it a little bit, but uh this is basically just a library that you would bundle in your spark application.

A

So when you build your spark jar, you're just going to pull this in as a library like you do with the Cassandra driver or any independency really, and there isn't anything special that we really need to do uh for for this to work with spark. So it's fairly decoupled with sparked as bits that make it easy. Like you know, Concepts, like data frames, don't really exist in other systems, but they do exist in spark.

A

So there are bits of code that make make it easier to work with spark, but not necessarily um a try to spark right. So to answer your question, I I need to think about it and get back to you.

A

So one of the questions that uh you know came up during the discuss thread was uh so: why not include this functionality in the Cassandra Daemon itself? Why? Why put this in the sidecar?

A

um And there are some interesting answers to that, but uh the basic core uh answer is um well. We want to make sure that we isolate resources between the main Cassandra demon and the sidecar.

A

um The this particular functionality has the potential to generate garbage, um not that it does, but it can in in some situations and um and the other. uh You know, scenarios that it can dominate the network and so uh What. uh What's best is if this is isolated in a separate process, you could use c groups or you could use other mechanisms that exist in the industry in order to limit the number of resources that are used by this particular process.

A

If you're running this in kubernetes kubernetes, you know part, you can run this as a separate container and you could limit the amount of CPU and you know, memory and other resources that that particular container gets, uh which is which wouldn't be possible uh to do um in in jvm like you would have to. We would have to write a lot of code to uh create that isolation and it wouldn't be as strong as what c groups allows you to do.

A

um So. That's the big reasons to keep this separate, um but it does add a little bit of operational overhead to run a companion process. But um you know this functionality is purely optional for those who actually.

C

Meter, so you can tack it on if you, if you need it.

C

Does anybody else have questions for Dinesh.

C

Everyone here to learn: does everyone want to use this.

C

Hopefully, people are here to learn and see what it's all about thanks, Patrick I saw your note um yeah. Well, we appreciate you joining. Thank you Dinesh. um Hopefully we we see some people uh contributing to this I I put a note in this in the chat, uh but just so folks know this is great for first-time contributors, as Dinesh has said, it doesn't require deep expertise in Cassandra or spark.

C

So this is fantastic for if you're wanting to get involved in Cassandra and contribute um and Dinesh is on the ASF slack, it's a great place to connect with folks in the community. If you're not already there um any.

A

Next Step, you think you would like to see from folks Dinesh on this. Should they reach out, should they get into the cep, read about it? What's the next step for folks absolutely I.

C

Think uh trying it out and reading reading the CPS that indicated by trying it out as well is something that I would encourage everybody to do. Okay, excellent! Well, we appreciate it and we have the next contributor meeting it's the last Thursday or last Tuesday of the month. So the next one is June 27th. If folks can join we'll be going through all the different ceps that are anticipated features for 5.0, so it's a great place to learn about it, figure out how to contribute test.

A

Things out in advance of 5.0 and so.

C

Forth, uh thank you again Dinesh for joining thanks everybody for joining and learning. We hope to see you next time. Thank you. Thank you. All right take care. Everybody bye.