Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: UBS Securities: A Journey with Cassandra at UBS Securities

Description

Speaker: Roy Bailey, Director of Neo Platform Services at UBS Securities

In this talk, Roy will discuss how their large scale client-facing application initiatives at UBS Securities utilize Apache Cassandra. This talk dives into their search for a scalable solution which allows them to serve their investment bank's equity time series data across the globe.

A

Good afternoon everybody, my name is Roy Bailey and I work for the UBS investment bank, not a bad crowd. Considering the competition I've got on this particular slot, I would say: I was expecting a few less, but this this presentation is a presentation that I gave earlier in the year on a meet-up to explain to people the the story.

A

If you will the people's story in the database choice story that UBS made some years ago on cassandra and it's quite a high-level story, could I just ask how many people in the audience have got Cassandra live in production, okay and then the rest of you perhaps investigating and thinking about it. I'd say this is maybe useful to those thinking of adopting cassandra and those of you already in production is probably not much. That's very new here, UBS neo I work.

A

It falls under the e-commerce department, but UBS neo is the platform that I work for there's a little URL there. If you want to know a bit more about it, you can punch that in and have a look. It's got a micro site, but this platform was a very bold initiative nearly five years ago now to rebuild a client facing application across the entire investment bank and it's the neo platform that introduced. If you willed the the new innovative database technologies onto the stack.

A

So where did we begin as I say? Ubs neo was a very large initiative nearly five years ago, to consolidate many many client facing applications that the bank already had. Those applications were in themselves very good. They you know individually, they could win awards for certain functionality in their in their particular field. But what was clear is that there was no unified identity for UBS.

A

There was no consistency around the look and the feel the behavior and the color schemes, even so, UBS wanted to pull that together and create a much more consumer driven identity around the client facing application that would hit all of the different asset classes or different areas of the business and the client base and bring that together.

A

So one of the areas that the bank needed to present information from was the time series data and at the time we have a lot of departmental systems that would collect time series data dotted around the globe having been built up from you know, five ten years ago, maybe two to service a particular customer base in that region, and when we came along looking to build out a global platform that was going to hit that time series data stores to chart that data, it created a new kind of demand that those systems simply weren't designed to handle, and during some of the testing that we were doing on those time systems applications.

A

You know we would bring that down in tests simply because we had so many users hitting them trying to pull back large volumes of time-series data. So this really presented the problem that we faced as I say nearly five years ago, when this initiative was started, we needed to find a way of serving time series data data, very scalable.

A

This gives a pictorial view of what I've just said. We have seven data centers across the world, but a lot of the departmental systems that had time series data on them. They had been built in a particular region for a particular vertical of the business and they simply weren't up to the idea of having global users of the volume that we were looking to service coming in and hitting those departmental systems directly.

A

So this is where the solution of Cassandra came came in was to sit that between the departmental systems and our and our front end user interface and to really start collecting the data out of these systems that were mastering.

A

If you will, the time series data and providing very fast reads wherever we were in the globe to the front end system, so that it was extremely quick, didn't present any load, it protected the underlying time series systems, it pulled the data up from those systems on a daily basis to refresh the Cassandra distributed cache if you will and- and that was why Cassandra was chosen. That was the problem five years ago. This was that was the problem that Cassandra was brought in to solve.

A

So why did we choose Cassandra? Obviously, at the time there was a lot of investigations on a number of technologies and five years ago, of course, Cassandra wasn't on version 2.0 and it didn't have a summit.

A

It was compared against a number of other choices at that time, but to summarize the the fact that it was able to globally replicate across all of our data centers the fact that the write speeds were very, very good for uploading that data every day and and obviously, if you look at the use use case, pattern that we have the eventual consistency, really wasn't an issue for us, because we're loading end of day price data up at the end of the day, ready for the following day, when we have lots of global users hitting that that Cassandra store so there's plenty of time for the global replication for the for the latency there and obviously the the storage itself as you've, probably seen if you, if you've, either used Cassandra or you're investigating it.

A

You will have seen that time series comes up again and again as a very, very good fit for the time series for the cassandra' storage model and come onto that a bit more later. But.

A

These were the reasons why Cassandra out outbid the other contenders at the time we did face some challenges there.

A

Five years ago, remember Cassandra was still in beta. It was still on a nought point, six I think at the time we started.

A

That is not the traditional pattern of a big Enterprise, Investment Bank, where established products and vendors, proprietary, vendors and managed services that tended to be the rule.

A

So it was a major shift in in thinking to be able to bring in something that was firstly, open source and secondly, not even on a 1.0 release, but I think that highlights, if you will, the the robustness behind the Cassandra engine, the fact that people like ourselves and think credits, which is well mentioned in earlier- that they also picked this up at version noir point six noir point seven around that time, so it kind of goes to show that even in those early days, this was not a fragile system.

A

This was something that that we were picking up to solve real problems and that that's a good credit to it. Documentation at the time is very hard and limited again. Naught point six version, so the only documentation you had was very much with the the forums and and that and a lot of that was blocked from the internal internet policies.

A

So there's a lot of working around that circumventing that and- and it took several months obviously of effort to put that together, because there simply wasn't the expertise available either in the bank or outside, and there wasn't readily available documentation and things were moving quite fast at that time as well. So it did present a number of challenges.

A

So what did we store like I say the the initial problem was our time series data, but what became apparent very very quickly was once we had Cassandra in in our mix as an available data store. There was a number of other use cases that quickly came on board off the back of it simply being available.

A

The second purpose was instrument data itself, where we had to load an enormous amount of financial instrument, data into our 2, vo search engine and again the same kind of problem really, although not not to the degree of the time series, but we had a lot of departmental systems that would would feed the search engine always going back to those departmental systems, put an enormous strain on them. If we had to ever rebuild the search index so very, very quickly, Cassandra was used almost like a document store.

A

If you will to again to bring data from departmental systems, make it readily available, keep it keep it refreshed and allow the other parts of our platform to access that very, very quickly and refresh very, very quickly from it and then the sort of human side of you know: we've no longer just got this relational database. This managed service. We now have this Cassandra store.

A

There was a whole bunch of different use cases where people started to put data into Cassandra I mean you know, developers that tend to get a little trigger-happy when they get a new technology, so in in a few edge cases they didn't necessarily pick the right store, but generally there was different. If there was a need to store some data down, then people were looking to Cassandra, because it was very quick and very straightforward to get data in and get data out.

A

So there was a number of additional purposes it got used for.

A

We made a few choices around how to store this data. First one is natural keys, so in Cassandra you have rows or petitions I think they're called now and columns. In our case where, if you look at our use case, it was really as a distributed cash. So natural Keys made sense because we're not mastering data into Cassandra.

A

We are pulling the data up from the mastering systems and then we're making it very quickly available, so being able to use natural, Keys humanly readable, Keys right from a user typing in something or picking something in search coming down to get that data that that made a lot of sense in our use case, the time series itself being able to store that in a single row in Cassandra, so that all of that data is kept together and again, the the storage pattern of an ordered column allows you to to get hold of that data very, very naturally, and very, very optimized for that kind of querying that we're going to be doing and then, as I say, with the other use cases where we were storing other types of data like the instrument, financial instrument, data and other stores.

A

The only consideration we had to have was is this data atomic. It doesn't involve lots of relationships and joins, and things and- and so data is stored again in in rows that naturally collect related data together, but in columns that keep them separate so that you have the the the atomic nature of pieces of data that go to make up the collection.

A

So, in the case of instrument data, you may have three or four different feeds of information that go to make up all of that instruments, data so putting that all on the same row, in the same way that we do all the same time series on the row putting all of the reference data chunks, if you will or documents into the same row, made a lot of sense and when it came to the additional storage areas outside.

A

There was a lot of use cases here, very skinny, very skinny tables of key values, binary blobs for images. Cassandra really just didn't- have a problem that the the key was. Do you have relationships, or do you need to just store chunks of data that you want to be able to query efficiently and get out quickly.

A

This diagram just kind of illustrates that natural fits of time series data going into Cassandra. We have the feeds from our departmental systems, coming up loading, the data into Cassandra, using natural keys for the rows, and then obviously the variable width per instrument, depending on how much time series data we have, it may be going all the way back to 1900 or or it may only be going back a few years or a few months, depending on the age of that particular instrument, but using the time in the column to be automatically sorted.

A

That allows you to go in very very quickly and say we want the last three months. Thank you very much or we want to go back on our charting and get you know three months. A year ago, very, very quick, very efficient worked extremely well for us.

A

Upgrades is not always something that when you start to introduce something new, you really stop to think about. Certainly, I've worked in several places where yeah there's been a push for a new technology, and people are very enthusiastic because it solves a problem, but they haven't really thought through the whole lifecycle. The total cost of that of introducing that technology. So to touch on the upgrades we've been through as I say, we went, live with version, naught point 7 only a couple of data centers five nodes each.

A

We then had to upgrade that to the 1.0 release. That was a little bit more work for us, because at the time the tooling wasn't around to to sort of reconcile that data. So we were kind of nervous about going from a point: seven release to a 1.0 release in case that any of the data was lost. So we spent a little bit of extra effort there to make sure that that was all sound, that we had the the the checks in place to be able to check some of the data before and afterwards.

A

But that went through very very smoothly. Just you know there was. There was no surprises out of that and then more recently this year, we've upgraded to Cassandra 2.0 through the data stacks product, and that was incredibly smooth as well and and those kinds of lessons on a journey like this is quite important because you know as a big organization. We don't like to go through these upgrades. Very often it's actually quite hard to to get the time to tell the business.

A

You know we're gonna have to sort of divert some of your precious resources onto upgrading a database technology. What am I going to get for its there performance, maybe something else. It doesn't go down very well, so you want it to be painless. You want it to be reasonably efficient and in our cases we've we've not had any difficulties there.

A

We are using the thrift driver, which is why the last points there, because it was five years ago. Obviously, cql is very new. We're currently in the process of looking migrating. Everything over to cql and I can only say that's another step in in a positive direction. It looks very, very promising indeed,.

A

Ok, so what we learnt through this process, we did go through several pain points. You know around the testing and and I'll share those with you really really wide columns. So we built our own indexes, because the local index is that Cassandra supports were not suitable for kind of lookups that we wanted to do, but in our in our urgency we kind of put too much into one row to its credit.

A

Cassandra is simply complained and said: really quite badly hurt my memories not doing so good, and it was very, very easy to see how we should have chunked that down and broken it out into into different partitions in order to spread that load across across the different Cassandra nodes. That was, that was a quite a big one for us, but it, but again it was nice to see that Cassandra sort of dealt with it. Yes, it complained as it as it should, but it didn't fall on its knees and give up it.

A

It limped its way through and it kept going. It just was reporting the fact that you know we had too much data in one row and we were trying to get too much in one channel.

A

The tombstones was another one where, when you're, putting in lots of data into Cassandra and then you're, trying to clear it out, Cassandra will take awhile to actually flush away what they call the tombstones through the compaction process.

A

And if you use the default compaction strategy, then you have a lot of these tombstones building up. Then you will get timeouts because cassandra is effectively having to seek its way past lots of tombstones in order to get to the real data, but again changing to leveled compaction has has leveled that out and removed that problem performance tuning. This is something that you can. You know when a product works.

A

You tend not to spend a lot of time on the tuning that we've got a lot of value from actually looking at the settings that we had in Cassandra the memory in particular and making sure that that things were set up correctly and using the natural keys for the actual data itself, which is the primary use case, work very well. But if you, if you iterate through partition keys, then it's quite a slow process. So looking at creating your own indexes so that you can do fast, query lookups!

A

That was another lesson that we learnt there.

A

So just a couple of things: if, if you are thinking about picking up Cassandra, you know natural Keys versus your yogu, it's your your surrogate, Keys I. Think that depends very much on your use case, like I say where we're really using this as a cache or distributed cache.

A

So natural Keys made sense, but if I'm mastering data, then I probably would want to go with a key and thinking about your queries and understanding how how the the table model of rows and ordered column names works to your advantage for the different use cases that you have and your time to live so how quickly something is going to stick around off and have deleted. It is something else you might want to think about, but being able to expire data without having to write your own code is extremely handy.

A

So I guess this all comes down to you know just really, knowing your database, you can pick cassandra up. You can run with it. You can throw data in and like I say, even when you're not doing it particularly efficiently it.

A

It usually does a pretty good job of just handling it, but it does make a lot of doesn't make a lot of sense to understand how your rows and your columns are ordered, how your queries, how you're going to pull out that data and to tweak some of those database settings just to make sure that that you get the best performance possible, because you can make quite a bit of difference. There.

A

These are the some of the front-end charts that are pulling data from Cassandra. This was our solution for highly scalable time series data and it it's been a very, very successful implementation.

A

So, just to round off overall Cassandra did just work. I could say even five years ago on a Northwind, six, nine point, seven version it works and it worked extremely well. It has that time series sweet spot because of the column ordering if you use that standard pattern, and it's extremely good at taking a lot of Rights and even a lot of reads across globally distributed cluster, so overall I think our production support department that obviously have to maintain and look after issues around a number of data technologies as well as other other infrastructure.

A

They really just don't have a lot of problems with the Cassandra tour I think it's probably their favorite out of all of them, because we've not had production issues that have caused them any pain. So that's another good endorsement that pretty much rounds it up for for me. Thank you very much for listening happy to take questions now or outside afterwards. If there's anything else,.

A

It's about 20 million instrument, financial instrument, data records blobs, if you will, which are then broken up into different chunks, and about 5 million on instruments that actually have decent amount of time series and that covers equities, FX bonds, IRS or it spans all of them from lots of different systems around the globe.

A

So I could see something there.

A

B

A

Yes, in instrument that I mean it gets called instrument reference data instrument, static data- it's your you know. It has data like the different identifies classification and the currency or all of those all of the metadata around that financial instrument.

A

It was, it was different in instead of having time across the columns, we had effectively the documents across the columns that made up that entire instruments, reference data, so for a bond you would have issue and you'd have schedules and you'd have ratings, and so you'd have all of these pieces that made up that entire financial instrument. They would come in from different sources, and you, you know, casandra's, very good for that atomic storing of data and retrieving of data. You don't want to be trying to merge at the Cassandra level.

A

You want to keep keep those blobs if you will in separate but together in the row for that instrument. So so the same kind of conceptual model. Is there that and instruments data is in one row so that it's all together it's on the same nodes and when you want to grab it all, you've got it all there to hand, but the columns are then used slightly differently because, instead of time series you just you have different chunks of data that go to make up that full picture. I.

A

Wasn't actually around when they did the evaluation, the first setup I think you know a lot of the time would have been spent just comparing loading the time series, data from the departmental systems into Cassandra and the the others that they were investigating and then obviously hitting those to read.

A

But the pain points I was alluding. To is more the the fact that you know it was in beta. There wasn't a lot of documentation. Now we have a lot of good material out there for the different versions back. Then there wasn't a lot, so you had to really just experiment and that's where a lot of the time was taken.

A

For the reference data you mean we've thought about adding that into the we don't have a requirement for our current use case, because here we are again using it as a distributed cache to pull data from other mastering systems to protect those systems when we need to reload it into the search engine.

A

But we have thought about using Cassandra, for if you will the time series of reference data as well- and there are actually a number of I call them spin-offs, they're, not really spin-offs, but there's a number of departments that are now using Cassandra they're, either about to go into production or they are close to going into production and that's one of the use cases, but they're a lot bigger than this one. This is just the the oldest. If you will.

A

So what we said, the other choy, the other I.

A

Probably can't answer that question. To be honest, I mean infrastructure wise. That's not my field of expertise anyway,.

A

That that's, where you get into the classification and building your indexes for that particular query. So that's where we created various indexes for different sorts of lookups, where we wanted to classify instruments together and then be able to come in and and grab them. So it just creates that that that fast response to be able to say give me all the instruments of this classification and then I can go away and pull that data down that I need. You know very very quickly on each of those.

A

But it's again it's building your own indexes I think there was some hands over there, but.

A

A

Each data center is five nodes and standard 3 3, node, replication, yeah.

A

Yes, yeah, but we haven't really had, and it is there's certain use cases where it makes sense to change that, but most of our use cases this is you know perfectly fine- makes a lot of sense.

A

Yes, yes, yes, and- and it's got very good settings in terms of being able to tune throughput and things as well, because when we went out to Sydney, for example, there's a there's a lot more latency there and we didn't want to bring down the you know. The pipe there so being able to to set a throughput on the replication was was a good benefit, because we we could tell the network guys yeah we're not going to take up all your bandwidth. Just because we're having to be bulk loading lots of data.

A

You can throttle that. So it's very good.

A

Well, being able to to query some of these data's with us, a simple select statement is a big benefit with the the thrift model.

A

Unless we build the tools ourselves to pull out our data and to show it, you don't really get a lot out of the box and with the introduction of cql you can, you can use cql to query out thrift tables, but you get a lot of blobs, which isn't very helpful. So it's it's partly being able to to query that you know just be able to run some routine maintenance if there is an issue with an instrument, then not having to develop a specific tool to pull that data out.

A

To just be able to do a select statement and have a look at the data is, is one motivation. The other motivation is, is the fact that you may have seen from some of the other from from the keynote speech, for example, that you know cql as a as a performance tuned driver that is being taken forward is is going to outstrip the thrift driver more and more as time goes on. So I see the the thrift implementation as something that's going to go end-of-life at some point and that that's the secondary reasons.

A

We just used the standard thrift driver, didn't use the aplex one name and then then we'll be looking to use the standard data stacks java driver for CQL.

A

There is a there is a learning curve in in that process, but what we've been able to do so far is recreate if you will the same sorts of models that we've got from our thrift in cql. So it's really more a case of choosing the timing. When, whenever we're going to go in and do a little bit of maintenance or enhancements for the business, then at that point, we'll probably flick it over.

A

There's I mean one one of the tricks I blogged about was to create your table in CQL and then go into the Cassandra CLI and have a look at exactly how your data is being stored. You'll very quickly be able to see how the cql partition key and the column keys end up actually in the storage engine in some ways that we ashamed when they deprecated that, because it's a very nice well, it's very comforting to be able to create that cql for me anyway, having having built it once into it.

A

To then be able to check that. Actually, what I've defined in cql is storing the way. I know it should be storing in improved.

A

Yeah we I mean we could be in all honesty, and certainly the cluster runs 24/7. We we haven't had any outages, we haven't had any problems, but in reality our customers aren't around at the weekend and there's all sorts of maintenance goes on in the infrastructure world. So if we ever do need a chunk of downtime, we do have that luxury of being able to take a few hours out at the weekends.

A

Yeah yeah we haven't, we haven't had any outages, I think I'm out of time, actually so I'll be outside if anybody needs them. Thank you.