Apache Cassandra Cassandra Summit 2013, 26 Jun 2013

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: C* Summit 2013: Cassandra Message Bus - An Open Message Bus for the Cloud

Description

Speaker: Boris Wolf, Lead Engineer CMB Project at the Comcast Silicon Valley Innovation Center
Slides: http://www.slideshare.net/planetcassandra/c-summit-2013-cmb-an-open-message-bus-for-the-cloud-by-boris-wolf
The Comcast Silicon Valley Innovation Center has developed a general purpose message bus for the cloud. The service is API compatible with Amazon's SQS/SNS and is built on Cassandra and Redis with the goal of linear horizontal scalability. This presentation offers and in-depth look at the architecture of the system and how they employ Cassandra as a central component to meet key requirements. Latest feature enhancements and performance data will also be covered.

A

Alright, good afternoon, everyone welcome.

A

My name is Boris wolf, I work for comcast, Silicon, Valley and today I want to introduce to you a project that we've been working on for over a year now and that project is called CMB at the time we started work on that CMB stood for comcast message, bus and then, when we released the software, we renamed it to cloud message: bus and then, when I was looking at the conference schedule to find out where my talk would be I found that whoever had put that together had actually renamed it once again to Cassandra message, bus and since I noticed that I've actually become quite fond of that name, because it's a fairly accurate description of what it is.

A

It's a general purpose message: bus infrastructure based on built on top of Cassandra and reddit. Now, if you, if you look more closely, you will find that CMV really consists of two distinct web services. One is a queuing service called cqs that allows you to create queues and NQ and DQ arbitrary messages, and then there is a second complimentary service cns, which is a topic based publish-subscribe service. Now the the noteworthy thing about these two services is that they are fully API compatible with their Amazon counterparts, SQS and SNS.

A

So if you're familiar with those amazon web services, there is a simple queuing service called SQS, and there is another pub/sub service called SNS. The simple notification service and our implementation is fully API compatible with those amazon services and that compatibility goes so far that you can actually use the amazon sdk. The java sdk, for instance, to interact with our implementation of those services.

A

So in this talk today, I will first do a technical, deep dive into how we implement those two services and how we use Cassandra and Redis to to our advantage there and then, in the second part, I will present some testing results and performance data that we've been able to gather to show you how its scales and then finally I'll conclude by highlighting a use case, how we use this service in at Comcast's in production.

A

So but first of all the question: why would we even copy the this API? Why would we even go through that effort of building our own version of this and to explain that I first want to highlight some of our key requirements, so someone one of the requirements was that that we said we wanted to have a message: bus that basically replicates messages across data center boundaries. So imagine you have clients in different data centers.

A

They are interacting with the same queue by writing messages to the queue and reading messages from the queue and in other words we want messages to be able to flow freely among data centers. That is what we would call operating a queue, an active active mode. Then, in addition to that scalability, obviously we want the system to to scale. So, ideally, we are looking for linear or near linear, horizontal scalability, and what do we want to scale? Well? First of all, we want to scale the message throughput.

A

What the the number of messages we can pump through an individual queue, so you could imagine, for instance, a very busy back-end service that just runs across a single or just a handful of cues. So we would need to scale to thousands of messages per seconds of throughput and beyond, but there's also a second aspect of scale, and so you could imagine a completely different use case at comcast. We care a lot about set-top boxes so that we ship to customers living rooms.

A

Essentially, so you could imagine a back-end service that is designed in a way where you have a relationship, one message: queue assigned to one set up box. So in such a scenario, those cues probably would not see a whole lot of traffic because it's limited to a single user or a single household. But we would need to scale to millions of cues, so we would want to scale in both these dimensions and then finally latency requirements somewhat arbitrarily at the beginning of this project.

A

We set this goal of 10 milliseconds response time for the 95th percentile, and all that really means is that, despite the fact that we want to deploy this service across multiple data centers from the client perspective, we still want a response time characteristic that feels more data center local in the 10 millisecond range.

A

So we did evaluate a number of open source message, bus solutions and we rejected each and every one of them for very different reasons, and then we finally arrived at Amazon's, SQS and SNS web services, and we liked the design of those ap is a lot and I want to explain to you.

A

What is so beneficial about this API, so this year is giving us QSR I'm starting with SQ s and later I will go into s and s, so this is giving you the SQ s queuing service in 60 seconds. If you will so. First of all, the the service focuses on guaranteed delivery. Everything is centered around guarantee delivery message.

A

Loss is not an option which, by the way, is also a one of our key requirements for the service, whereas orderly delivery, meaning that the messages come out of the queue in the same order that you put them in and also uniqueness of messages that you don't see any duplicates as you pop messages of your cues. Those things are secondary. The service makes a best effort there, but there are no guarantees. Then there are a few simple core API.

A

So, of course, there are some to like create queues, delete queues and so forth, but the the core API is the most important ones are just three. There is sent message to put message on a cue, receive message to get a message off a queue.

A

Then there's this third one delete message: what do you need delete message for so imagine a scenario where you have a single message queue and then you have two clients competing for messages from map Q, so those two clients, identical redundant clients, possibly call receive message, say every hundred milliseconds or so to check for messages. So, as a message arrives, one of those two clients will get that message. First, come first serve right, so what happens then happy day scenario?

A

Is the client processes that message does with it, whatever it's supposed to do with it, and once it's done with that, it will come back to the service and make an explicit call to this API delete message passing in what's called a message: receipt ID for that particular message: telling the service I'm done with processing that message and we're done rainy day scenario is client, receives the message and then crashes taking down that message with it before it's been has been able to process it.

A

That also means that client will never come back to call delete message, and in that case the service there's a defined timeout period that we call visibility timeout by default. That's 30 seconds after 30 seconds that message that otherwise would have gotten lost, will magically reappear on the queue for a second client to pick up and process adequately. So, as you can see here, basically, the service does not trust the message recipients.

A

Not only does the service guarantee or guarantee that the service itself doesn't lose any messages, it even tries to guarantee that the recipients, the clients of the service, do not lose any messages. So this understanding of guarantee delivery built into the core API. We thought that was very elegant and we adopted that- and here I just want to spend 60-seconds on talking about the advantages of adopting an API / building one or designing one on your own.

A

So, typically, when you set out to do something like that, you have a specific first use case in mind right. Otherwise, you wouldn't do it and what often happens, or most often happens, is that somehow, as much as you try to come up with a generic multi-purpose, API design, somehow those requirements for that first use case find their way into your API design. It will be somewhat biased towards that first use case, and even if you manage to stay clear of that issue, it's almost guaranteed. You won't get it right.

A

The first time just like with any other software project, you will probably come up with a prototype. You will try it out. You will realize it doesn't work the way you thought it would work. So you you spend a lot of time and discussing various designs and you go through a number of iterations.

A

So all that time that you spend doing things like that can be entirely cut out of your project time if you are adopting a well-designed, somewhat battle-tested API and then finally, if you're building something that you want other people to use and not just your own project group or the project groups near you.

A

If you want to give that to people outside of your organization, you need to think about other things like documentation tools right, maybe you want an SDK that makes it easy to interact with with your service, and you need some kind of user forum out there where people can ask questions and seek answers and again, if you, if you are re-implementing or implementing an existing standard, then you can expect that some of those things are already around and you don't have to reinvent all of them.

A

So overall, it's really all the advantages of implementing an existing standard over trying to set one on your own. So in hindsight, following this pattern has been very, very successful for us and, and we felt it was- it was a very good approach, but back to the original questions question, why didn't we just go with with SQS right? Obviously, we like the aspect of guarantee delivery. We, like the simplicity and robustness of the API and obviously Amazon scales pretty well, so you could have some business concerns right.

A

The costs associated with with running these amazon services or losing control over your data, but those were not really our concerns. Our concerns will repeal a technical one of the major concerns were the router latency, not so much throughput but latency. So by definition, if you interact with an Amazon service, it's a cross data center transaction right. So what I said earlier that we want like 95th, percentile, 10 millisecond response time?

A

You will probably never get a 10 millisecond response time, if you, if you interact with an amazon web service, unless the only alternative would be if you can host your entire ecosystem, everything you own in ec two in the same networking zone as SQS, really but at least right now, that's not an option for us. So latency was a big issue and then also there are a number of limitations that are put on a user of these amazon services.

A

There is a maximum message size, for instance, there's a number of timeout periods that have maximum that have certain maximum that you cannot get around and, of course, we reimplement all those same limitations, except that those limitations simply become properties in a configuration file to us. So if you deploy the service, you can adjust those numbers and in a way deploy your own flavor of this service tailor to your own needs.

A

So to summarize, we set out to build a horizontally scalable queuing service on top of Cassandra and Redis, which is API compatible with Amazon's, SQS API and so going into the architecture and the details of the design we chose Cassandra. Of course the main reason was for its cross data, center persistence and replication ability, and we no its scales horizontally very well and in fact we had an original implementation that was purely based on Cassandra, and then we realized we.

A

We couldn't meet the latency requirements that we have set for ourselves, and so we added Redis into the picture at first too, to improve latency as a cash. But then we realized that it's also very useful helping out with the best effort ordering of messages, and we are using reddits to manage that visibility time on mechanism that I was describing earlier that whole business of hiding messages and then making them reappear. If you don't delete them all of that we handle in Redis and not in Cassandra.

A

So first I want to talk a little bit about data modeling. How did we model the column, families in Cassandra that sit behind the cqs queuing service? So whenever you do a project in Cassandra, the recommendation usually is figure out. What are your key data structures and what are the most frequently asked questions? The most frequently asked question so as we did that we went through a number of iterations and I just want to highlight three of those. So the first approach, some wood naive. We said: okay, how about?

A

We represent the messages for a single cue in a single column, family, one column, family, /, q, and in this scenario we have a column family with a single column which has a static column, name message body. The row keys would be the time stamps when the messages were received and the column values would be the actual payload the actual messages. So we would at that point in time we didn't really know what we were doing. We came from a relational database background, didn't know much about Cassandra and the problem here is so.

A

What is the query that you typically would ask from a queue? So probably you would to know what's at the head of the queue or the tail of the queue right and in Cassandra terms what this means with this design is he would you would want to do a range slice query, and you can only do that if you use what's called the order, preserving partitioner and again at the time.

A

We didn't even know what that really means, but we consulted with people who did know about that, and everybody told us basically stay away from the order, preserving partitioner at all costs. If you can, if you can at all right, so we said okay that doesn't work, so we quickly rejected this approach here. In fact, we never implemented that that was our first iteration, so we said: if that doesn't work, how about we do it? The other way around?

A

Let us store the messages that are sitting in a queue in a row instead of a column. So now the first advantage here you are seeing here is that now we can store these messages, we can store messages for multiple queues in a single column, family. The row key now is an identifier for the QA q ID. If you will, the column names are the timestamps when the messages were received and the column values are actually actual message payloads.

A

So now we're taking advantage of a nice standard cassandra feature, which is that it will sort columns by their column name automatically for us right. So if we configure this right, messages will be sorted in chronological order from here in this visual from left to right, because the column names are timestamps.

A

So this is good, because now we can actually do callum slice queries and we can ask for the head of the queue or the tail of the queue or any anything in between really there's still a problem with this approach, and this is, as you can see, everything for a cue goes into a single row right and everything you have in a single row will end up on a single note in your cassandra ring so for a very busy Q. You could bottleneck on that on that.

A

One note in the Cassandra ring also keep in mind that what we are doing here is not really the a standard, typical cassandra use case right, if you, if you look at the literature typical introductory examples, would be like a logging application, for instance, very right, heavy time series data where you, while you always right, you rarely read, and maybe you never delete right, whereas the queue has a lot of churn.

A

We have an equal amount of right, read and delete operations, so we are producing a ton of tombstones, as we are doing this so again by putting all of this in a single row we are, we are having the potential for bottlenecking on on a single note in the cassandra ring and facing some issues. In fact, I believe in his keynote Jonathan Ellis said yesterday, something like I was told. He said I'm paraphrasing that don't do this. Basically, don't try to build a queuing system on top of cassandra and well.

A

We did anyway so now, I'll show you I'll, show you how we, how we're doing it. So we didn't go with that design. We went with a modified version of the design that I just showed you, so this is essentially the same, except that now we are charting the messages for a single queue across a number of rows. So we are in this configuration here. We are charting across up to 100 rows.

A

So you see now we have a composite roki consisting of the queue ID concatenated with an index number that ranges from 0 through 99, and we write messages to random charts to a random to random rose out of these 100 rows. Still messages will be sorted chronologically from left to right, but they are written to random shards and when we read we will read from a random chart so now that that works pretty well, because now we are actually distributing the messages for a single queue across up to 100 nodes in the Cassandra ring.

A

If your ring even is that large, the only problem is that we are missing out a little bit on orderly delivery right because we are reading from random charts. It's all random. It could happen, for instance, that here is the third messages here right. So if we happen to read from this shark 99 first, we would get message 3 out before message, one for instance. That would be out of order right.

A

However, what would never happen like the first message in the last message? They happen to sit in the same shark right and chart number zero. You would never read message six before message, one because message: one is essentially in front of it in the same chart. So what this really means is globally. If you look at a high traffic, you globally, you have orderly delivery but locally. Any two consecutive receive message calls may be out of order and we are using Redis to straighten that problem out and I'll show you how that works.

A

Looking at a data flow example, so I'm giving you a simple data flow example where we're sending we're. Looking at a single cue, we're sending three messages in message: 123 in that order, then we make one call to receive message, and hopefully we get the first message out that we send that's message one and then we're assuming the happy day scenario. We are coming back to delete that message so in this visual, what you have here upper left is the web service endpoint.

A

In fact, this is the only component of the system that you, as a user of this service, will ever see and interact with this. One. Here is the cassandra column, family, the architecture that I just described and then on the right side, we have our Redis caching layer and you can already see. Redis is not just a simple key value store like memcache, for instance, it actually offers rich data structures. You can create lists hash table sorted, sets all kinds of things and we're making heavy use of that.

A

So, if I call sendmessage for the first time, the message call comes in on the service endpoint. The message is written to a random charge, an hour Cassandra Callum family and the message ID goes into a Redis list, so notice. I, hope you can see that in the ID that we are storing in that list is again a composite key, this time consisting of the roki from the Cassandra Callum family concatenated, with the column name. So this is still very little information very dense information, but it's enough to actually pinpoint the message.

A

Payload locate the message payload in the cassandra column, family. So then I send the second and the third message, as you can see, they trickle in into random charts in the Cassandra column family. But they line up perfectly in that Redis list. So we have one red is list / q. Then we call receive message and what that does. Is it essentially pops off that message?

A

Id from that Redis list reads the payload out of Cassandra, and it also puts that message ID on a hash table in Redis, along with a future data timestamp 30 seconds into the future, and there would be a background process that checks this hash table periodically. And if we don't delete the message within that 30-second time window, it would be placed back on the list and that would constitute making that message read visible. But here we are assuming happy day scenario.

A

We are calling delete, which means it will be read that ID will be removed from that hash table. And finally, we are deleting it actually from Cassandra. So I should step back one slide back because you should notice here when I actually received the message notice what happens in Cassandra right? Nothing really by handling this entire visibility timeout in Redis we're solving saving ourselves, some churn in Cassandra.

A

Otherwise, we probably would have to do an extra right and delete operation Cassandra and create an extra tombstone that by doing that in Redis, we are where we don't have to do that.

A

I will take questions after the talk. Okay, so architecture recap. We have a Cassandra persistence layer messages are sharted across 100 rows per q and we are mainly doing that to avoid wide rows and by that I mean rows that are wider than 500,000 entries. We are doing that to minimize churn and the creation of tombstones and, most of all, to distribute queues among cassandra notes.

A

We are introducing else also the reddest caching layer originally to meet the latency requirements, but then also to improve on the orderly delivery and to handle the message. Visibility.

A

Key cassandra features that we are using well, of course, the main reason just I just want to highlight two things on this slide. The main reason to choose cassandra in the first place is its ability to replicate across data centers. Of course, we are using local quorum reads and writes, which we find strike a very nice balance between still having the data center. Local response time characteristics that we want, but at the same time making sure that eventually data will arrive in all data centers.

A

So that was the main reason to choose cassandra right, but there are also, as you get into the nitty-gritty implementation details there, a lot of nice little features that help us out. So, for instance, I just want to point out one of those here. There's TTL right. You can set a time to live on column on Calum values, so that fits very nicely with a feature of the queuing service, the sks service, which is a message retention period.

A

If you don't pick up your messages from the queue within four days, they are supposed to disappear from the queue and we didn't really have to implement a whole lot to make that happen. All we are doing is we're setting a TTL on the column, values that represent those messages.

A

So that's just an example of that scalability the core API send receive and delete. We implemented those as constant time operations, so they should scale linearly and obviously they scale with the size of your Cassandra ring number of service endpoints API servers you have, which are stateless and the number of reddits charts you're using so we're using charted Redis by the way. The only thing I should point out here is that ques cannot be sharded across reddit servers. I mean the whole point of using a Redis list is to get perfect ordering right.

A

So we have to do that through a single Redis chart and that could be a bottleneck for a very busy queue if you're not able to charge your traffic across many q's, but in practice Redis is very fast, and this has not really been an issue for us so far.

A

Availability, we're banking, of course, on the availability of our Cassandra ring, but I should point out that the service does function without Redis around. If red is crashes is unavailable, the service will still be there. You will see out of order delivery. You will probably see a lot of duplicates because we're using that hiding mechanism, but all of that's okay, because the the service guarantee is guaranteed at least one.

A

Its delivery duplicates and out of order are ok, so we still fulfill that contract, even even if red is is not available, and here I have a slide that visualizes how we do data center failover right now. So what you see here is to data center status and a one data center to both of them have the cqs service fully deployed. So in this example, I have three service end points behind a local load, balancer, three Redis charts and then, of course, a Cassandra ring that spends across those two data centers.

A

So all clients, clients are at the top here they go through a global load, balancer that global load balancer routes all traffic to in currently active data center, which is in this case data center. One should that data center go down for any reason from one moment to the next, the global load balancer will fail over to the second data center. All traffic will be routed to the second data center. That's ok, because all the data, all the data, all the data restoring in the cuse, is already available in the cassandra ring.

A

The only problem is that you see the Redis charts over there. They are completely unaware of anything that has happened so far. They are blank. Basically, so we have on a cache. Miss will have again a background process that will kick in and heat up those caches, boot strip them essentially from data with data read from from the Cassandra ring and depending on how much traffic you have after a few seconds or a couple of minutes, the service, the performance characteristics of the service will return back to normal.

A

So this is not quite the active active that I was proposing earlier on my requirement slide, but you can see that by building this service on Cassandra, we've laid a great foundation for moving in that direction. It's really not Cassandra. That stands in the way of that. In fact, the only reason we cannot do active active right now is the reddest component, so we do have a roadmap and some ideas on how to get around that and improve that.

A

So we can actually do an active active in the future, but for now this is the recommended deployment and use for failover of this service. Okay, so with that that was my talk about SQS, so now I'm going to move to the top publish-subscribe service, SNS and cns our implementation of that. So it's the simple notification service. It's a topic based pub/sub service.

A

There are a number of endpoint protocols supported among them, HTTP HTTPS and then the interesting thing here is that you can also subscribe, cqs cues to a topic as endpoints and of course they are API compatible to Amazon right. So you can even subscribe. Sqs cues as n points to to a cns topic.

A

Again we have a set of simple core api's, so you create topics you subscribe, endpoints to those topics and then really the only API that you really use once you have all that setup is publish, so you publish a message on a topic and what that means is that message will be fanned out to all the subscribers for that topic at that point in time again, the service does not trust, really trust message recipients. So there is a redelivery policy built into the service, which means this applies mainly to http endpoints.

A

So if your endpoints are temporarily unavailable, responded with a 404 or are completely off the network, the service will attempt redelivery I think by default it will be 3 retries over the course of 60 seconds and that's the default policy. But you can adjust that policy to whatever you see fit now. I, don't think we have the time to go into the same level of detail as I. Give you I want to give you a dataflow example here as well to explain to you how our implementation of that service functions.

A

So I can't go into the same level of detail as I did with CQ s. So don't worry if you don't, if not every detail makes sense to you immediately, but I just want to give you a rough idea of how we implemented that pops up service so I have a example. The ideas you have a single topic that topic has for subscribers as 1s2 as five and a six. The first two are HTTP subscribers and the second to our cqs cues, as subscribers so again, I have a visualization of that on the left.

A

Is your service endpoint, where you make API calls on the far right? You see those for subscribers for that topic as one as through the HDP endpoints and s5 and s6 those cues and in the middle, this blue box here this is what you, as a user of the service, never get to see. This is our internal implementation of it, and the main point I want to make here is that cns is implemented over cqs. We are using cqs cues internally to distribute the work of fanning out messages to subscribers.

A

So I called publish to publish a message on a on my topic, T and all that API server does is. It will place a message on this first internal queue, which is called the published job queue, and this message will contain the payload that we actually want to publish as well as an identifier for the topic.

A

We want to publish it on and then from there that message travels to that second queue, which is the endpoint published queue, and as we do this, we look up all the subscribers for that topic and replace the topic identifier with a list of all subscribers. Now this list can be very long right and if it is, we actually split it up into multiple messages.

A

So in this case this is. This is just an example so for for endpoints, this doesn't really make a lot of sense, but think of this as like a topic with say a thousand subscribers, where we would possibly split it up into subsets of 100 subscribers each.

A

So then you would end up with like 10 messages and each of those 10 messages would list 100 of these subscribers, then, from there messages, get picked up and fanned out to the subscribers that are listed in each of them, so the first one goes to the first two subscribers and the second one to the second to subscribers and that's the whole flow and to end so what you don't see in this diagram, like all this shoveling around from messages from one queue to the next queue and from there to fanning out to subscribers.

A

This is done by a standalone process that we call publish worker and the whole reason that we are doing this design is the fact that this publish worker scales perfectly horizontally.

A

So we can fire up as many of these publish workers as we want, and this is exactly what allows us then, to to publish to topics which have a large or an extremely large number of subscribers with very good end to end latency, because we can do that in parallel because chunk it up into little subsets of subscribers, and then we can send them out in peril.

A

So architecture recap on CNS. We are using cqs cues internally. Those cues preserve messages when these publish workers that I just mentioned are either down or unavailable. Also, another aspect is the sea q. Sq has the visibility. Timeout I talked about right, so that helps, if published, workers crash right, say in my last slide, say the published worker that picks up this message to to publish that to the yen points. Five and six picks up that message from the q and then crashes right taking down that message with it.

A

If that happens after 30 seconds, that message will reappear on that cue. That's a feature of the sea q SQ and another publish worker. A redundant publish worker that has not crashed could pick it up and andrey deliver that may produce some duplicate deliveries, but again, that's okay, because the service only says guaranteed at least once delivery and then finally, we have the retry policy that helps with delivery in case that the receiving end points are unavailable temporarily.

A

So if you're, comparing amazon's SQ s and SNS with our implementation, cqs and cns, the goal, of course, is full api compatibility and we're pretty much there. We have all AP is implemented. Most parameters implemented there really only very few exceptions. For instance, we are not implementing the latest version of the digital signatures in amazon version. One and two are okay, but version 4. We don't, but really the the things we don't implement are very few.

A

We do, however, have a few enhancements, so we offer a number of AP ice for management and monitoring of these services, and I'm sure amazon has these to accept that they don't make them available to normal users right so, but but we have things like that and all the limitations. What I mentioned earlier, like maximum message, size, maximum message, retention period, maximum visibility time out all these things you can configure if you deploy, seek us and cns, see where we are time right now.

A

Okay, now we're good.

A

So if you followed me until here- and you wonder well, is this stuff production ready right can I consider using this in my own projects? My answer would be those three things here.

A

Well, first of all, we have made the source code available as open source, so its up on github for you to check out, and if you do, you will notice that this, the CMB core code, basically all the architecture that I just described, that part of the code, has really not seen a lot of change recently and not seeing a lot of feature enhancements or bug fixes for that matter. We've done an extensive amount of testing, including scalability testing and we're using it at comcast for an increasing number of production systems.

A

So first, so the rest of this talk I want to spend telling you a little bit about the testing we did and the performance metrics we gathered and then finally, I want to give you a single use case as an example. So testing we've done functional testing and by that I mean unit tests. So we have pretty good code coverage. We've done stress testing by that I mean mainly error case and error case testing, so a corner case testing. So we have simulated Redis outages. We have simulated data center, fail.

A

Overs we've done endurance tests where we ran the service for days or weeks or even months at a time. So we've done all that, but what was most important to us, the actually proving that the system has that property of linear horizontal scalability. It took us a long time to get around to doing that, and there were several reasons why it took us so long I'll just give you one reason. So if you think about it, you need a bunch of machines for this deployment right.

A

You need API servers behind a load balancer, you need Redis charts. You need a cassandra ring. You also need probably extra machines to generate the load for testing, so even for a small deployment. You end up requiring probably like 10 machines or something like that, and then you want to scale up right, probably to 50 or 100 machines. So you can't really do that with a couple of F boxes anymore, you need to go to the data center and well I.

A

Don't know if you have any experience with requesting like a sizable chunk of data center capacity from a data center in a large organization for me and was at first and and it turns out that there's some process and overhead involved right so from the time that you actually request this capacity to you getting access to it. Sometime can pass right. So so, while we were waiting for that I guess we got somewhat impatient.

A

We were at some point even contemplating doing this scalability testing in ec to but then we figured it would be just too silly to first clone and Amazon service and then hosted an ec2. We could have done that, but we didn't so what we actually ended up doing. Is we tested this in an openstack cloud, so it turns out that at comcast we are operating a number of data centers based on OpenStack.

A

Some of them already run production systems, but there are also a few that are still fairly experimental and only have few tenants and actually the project group that operates these data. Centers approached us and said: like hey: don't you want to test your service in our OpenStack data center? So we can at the same time see if our infrastructure works, and so it was a win-win for the both of us and that's what we ended up doing and I'll share some of the results that we found.

A

So first we did seek us throughput, scalability testing, and what we wanted to show is message throughput as a function of the Cassandra ring size.

A

So the way we went about this is we wanted to make sure that all components of the system are essentially oversized, except for the Cassandra ring right so that, as we run our tests, everything would be underutilized and pretty much well, not exactly idle, but at not stressed, and then we would make sure that we put some kind of load of stress on the Cassandra ring. Take whatever a throughput number.

A

We would measure and then increase the ring size and do that same test again, so we played around with ring sizes, Cassandra ring sizes from four to twenty four nodes and had the necessary capacity of API server and reddish shards around this year. This bar chart shows you a typical run test run that we did. We did many many of those types of tests, so each bar represents one test minute and what you see here is an absolute numbers. How many API calls we we sent to the service?

A

Color coding tells you what calls they were. The blue ones were send message, calls putting messages on cues, then the green ones are received, message calls and the brown ones, I believe, are delete message calls. So one interesting thing you see here already is that apparently sends are faster than receives and deletes right, and this is actually a characteristic that we would expect from a system, that's backed by Cassandra right that rights are cheaper than then reads and deletes so by the way these tests. This is looking from the perspective of a single API server.

A

We have ten of those. So what you're seeing here is ten percent of the load. Those tests were pushing two million messages each and we did that in roughly 15 minutes or so translating two million message. Many messages translates to 6 million API calls right because you have send receive and delete and yeah on a single API server. You would see ten percent of that load, which translates to 200,000 messages for that very same test. We also captured the response time. Percentiles.

A

The first chart here shows you the response time percentiles on the API server. So that's end-to-end time that each API call took for that call, mix of send receive and delete, and the resolution for that is 10 milliseconds. So each change in color means an increment of 10 milliseconds in response time.

A

The greener it is the faster it was, the more red or purple the longer it took, and then the second and the third diagram show your response time percentiles for those same calls but broken down into how much of that time was spent in Redis. That's the second diagram and how much of the time was spent in Cassandra and the resolution for those two is one millisecond, not ten milliseconds, but one millisecond.

A

So without going into much to do much detail here, what you can see is that the API service, as well as the Redis charts, were still pretty happy, whereas Cassandra was stressing like during the middle of this test, more than fifty percent of the calls came back as purple, which means in this diagram 10, milliseconds or more or way more like. If a call, if one of those calls took a second or two, it would also fall under this purple category here. So for each and every test we ran.

A

We try to drive the system to this point where all components were happy except for Cassandra, and then we tried that for different ring sizes and here's what we got so here you see ring sizes ranging from 4 to 24 notes and the throughput we got ranged from roughly 3,000 calls per second up to 15,000 calls per second, and if you look at that in a charge, this is ring size versus throughput.

A

You see this starts out perfectly linear and then at 12 no drinks, as we add a slight drop off, and then it continues linearly. So, if you're asking, why is that slight drop off there? We're still investigating this? This is brand-new stuff. By the way we took these numbers over the past couple of weeks, so you are actually the first audience to see this here. So I I would say that. Well, first of all, we are not really testing. Just a Cassandra ring it right under perfect lab conditions.

A

We are actually putting a realistic load on a relatively complex system that contains a Cassandra ring and as much as we try to make all these other factors disappear. We may have not completely succeeded at doing that, like from the numbers. What we looked at it still looked like there was some room and Sandra for four more, so this line could have been more linear. To give you one example like 11 of those factors, for instance, we didn't have a real load balancer at our disposal.

A

We were actually resorting to using a software load, balancer H, a proxy and, as we were increasing the load, we saw that there were some issues with that, so we have to tweak that a little bit and they may very well be other things in those tests. But overall it's still I mean we wanted near linear scalability and it doesn't look like we are entirely off here. I mean we were actually pretty happy with these results.

A

Then we did the same thing for cns for the pops-up service. We wanted to see throughput scalability here as well, and here the most important metric is end-to-end latency right. How long does it take from the moment in time that you publish a message on a topic until all the subscribers have received that message? That's end-to-end latency.

A

So in this case, what we wanted to do is we wanted to make sure that the underlying cqs service is underutilized and not stressed, and we wanted to see, we wanted to stress the publish workers instead and see how we could increase throughput as we increase the number of published workers that handle the fan out. I'll spare you, the numbers, I'll just show you the visuals and the most interesting thing here is circled in green.

A

So this diagram shows you throughput on this axis and end-to-end latency on that axis and what you see here, the red line was taken for a system that had only three published workers and the blue line was taken for a system with six published workers, and what you can see is that, as we double the number of published workers, we could increase the throughput from four thousand messages to.

A

We could double the throughput to eight thousand messages per second, because we doubled the number of published workers right while retaining an end-to-end latency of roughly 200 milliseconds, so again near linear scalability. So that was my testing and performance metrics. That I wanted to share with you and then I'll conclude. We are almost at the end. I'll conclude with a use case. This one here is the sports app.

A

So this runs on the x1 setup box, which we now deliver to all new Xfinity customers and one of the nice things about this x1 setup box is that you can run apps on it, and this here is showing you a screenshot from one of our most popular apps, which is the sports app. So here, on the left hand side where it says, travel HD. That is where you would continue to watch your television program as you bring up a nap and then the right hand side.

A

The the real estate on the right is occupied by that app, and you can follow certain life games as they go on. So we cover NHL, NFL, NBA and a couple of other leagues. You can see what games are going on.

A

What's the current score, you can drill into a detail, screen for a live game and see some more statistics about it and behind the scenes, all this data we get from a third-party data service, things from a third-party data service called stats, calm and in our back end we are using a CNS topic to fan out game event, notifications to all active instances of the sports app. So I have another screenshot that shows you what kind of traffic went over that topic, and so this one here was taken during the period of about one month.

A

Last year you see these little red clusters here our evenings or afternoons where game activity was going on and typically we have like, maybe 10 or up to 20 messages per second, when there's game activities. So it's not really a very high throughput application really over that entire month. I think we collected roughly a million, publish calls a million messages. Then maybe there are some evenings where there were more games, so there are higher loads. But an interesting point is this spike.

A

You went up to 130 messages per second for a while, and so this was not a production system here. This was actually a q Isis q, a system that was not terribly redundant. In fact it had a single published worker and that published worker failed at that point in time, and as you can see when that happens, when you have no published workers around no message is no message gets lost because these messages are all stuck in one internal c: q SQ.

A

So when we, when we fired up the publish worker again, it basically just caught up and published all the messages that it have missed out on just reading them from a an internal cqs q. So I thought that was interesting, so this is my last slide. Moving forward, we seek our seek us, and cns services are still under active development. We will continue to follow SNS and SQS api's as they evolve. We still have plans for more load and stress testing, and one focus recently has been ease of deployment and and scaling.

A

So, as we were doing all this scalability testing that I was describing earlier. We we also started developing a bunch of puppet script so that we can now basically add a push of a button, deploy an entire system at any size that we want.

A

So we don't have those puppet scripts up on our github repository yet, but we will make those available in the near future and then finally, we are looking at more in-house production deployments which are currently all isolated by application and then eventually, maybe we can even offer CQS and cns as a service and yeah. That's that's all I've got for you, I, don't know. If we have time for a couple questions yeah, we do okay, great. So if there are any questions yeah, if you could come up to the mic, is that okay, thanks.

B

What range of message sizes did you use in your load testing those.

A

Were one kilobyte messages and we actually found that in that range like anything like between just you know a few bites to up to like two kilobyte or so we pretty much get the same numbers like you, you would start see like Layton sees, shift and throughput go down if you have like message sizes that go beyond north of two kilobyte. Basically, okay. Other questions. Yes, go ahead,.

B

So you show a use case of sports on the setup box right like a UI. So how do you publish the event to the setup of is this set up as a HTTP server itself and having a puppy, a puppy, IP right.

A

That's that's an excellent question. So I was simplifying the storing a little bit, the in-between our topic that that publishes and the setup box. There is an app server in between so the setup boxes are entirely cloud-based, so they are pretty simple in their design and the actual app is running in an app server on in a data center.

A

So each of those we have a number of of these app servers as we scale up, but each of those observers takes care of a whole bunch of set of boxes, so we are actually publishing to the app servers and they then in a second layer fan out to the actual setup box. So the HTTP endpoint is on that app server, not not on the setup box.

A

Yes, there's there's another mechanism to get from the app server to the setup box. That's right! Okay! You know any other questions all right, then that's it. Thank you very much.