Apache Cassandra Financial NoSQL Apache Cassandra Use Cases, 27 Dec 2014

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Noble Group: Normalised, Non Tick Time Series Wearing a DSL Cap

Description

Speakers: David Haines, Head of Front Office Development & Aleksa Vukotic, Head of Platform Development at Noble Group

Noble Group, a market-leading global supply chain manager of energy products, metals, and minerals uses Cassandra to power a decision-support system to assist the traders and analysts in ever-changing market conditions. In this talk, Noble will explain data modeling and querying techniques they employ to ensure high throughput and high performance data access using Cassandra.

A

David haynes, I'm head of front office development at noble group and I'm joined by my colleague, alexis vocotick who's. The head of platform engineering we're here today to talk to you about how we deal with time. Series data at noble, uh noble group is a commodity trading outfit. It's primarily a global supply chain management. Firm we've had explosive growth since 1987 with a bunch of diversified businesses, we've had a general historic I.t strategy of buying off-the-shelf third-party products and then figuring out how to integrate them within our estate and we.

A

As a result, we didn't focus terribly much on building out any in-house developed software. That whole strategy has changed in the last two years.

A

We'll talk about that in a moment. One of the things that we've had to deal with at noble is that we've had a number of different trading systems across all of the different businesses in which we operate.

A

We've got a bunch of other we've got 10 trading systems in play, we've got 20 different systems around risk, credit market, etc trade flows, rooting each system is responsible for managing its own data, so we've got a lot of duplication and a very difficult time managing all of that data. On behalf of the business.

A

So the main vision that we had was really to consolidate all of this data and put together a strategy around that to find some common ground around what that data is what it looks like. What is the nature of it, and so we had a massive group wide or firmware effort around how we store data, how we manage it and how we query it and, as a result, we built out this. We started to evolve this global data platform internally. We call that stratus and it's built on top of a microservices architecture.

A

It's dealing with what we refer to as polyglot persistence, where we look at again, as mentioned earlier, the nature of the data. So we look at time series we look at objects and reference data and we have different data channels and underlying technologies to support those so for time, series which we'll be talking about during today's session.

A

That's where we have cassandra is the backing underlying data store for objects. We have a self-describing object, transactional object, store and that's built on top of elasticsearch and lastly, reference data is managed through a graph db neo4j.

A

So today, again, as mentioned we'll be talking about our our time series uh db, we deal with a lot of different types of time series data we deal with prices with flows with weather data, fundamental data supply and demand and yeah. This will lead us into what the sort of requirements that we had in this channel.

A

So we wanted to start with the simplifying the data model and, to that end we just wanted to have a single, simple model and not deal with the sharding across different column families.

A

We had to include observations, forecasts and again across all different classes of of data. We needed to ensure that we had responses within a maximum of 200 milliseconds and that we're able to write an awful lot of data through this. So from our perspective at the moment, we're writing one million points per day and that's growing um and, lastly, we needed the ability to version data um version data that was of interest uh to be versioned for various business needs.

A

So not all data is versioned, so with that I'll hand over to alexa.

B

Okay, as david said, the key thing we wanted to achieve with storing time series data is have a simple and unified data model. With that in mind, we started to think what is the minimum data model? We need to store any any time based data information, and this is what we came up with. This is a cql schema of our column, family, and it is very simple and very self-describing. Really we have a symbol and curve which defines the row key.

B

We have a timestamp as a column key, and then we have a double value, which is the number that we observed uh for that particular time series. While the key has two components, uh we realize that a lot of data we have um contains an instrument and target of what we are observing or recording, and also a variable of that that can be different, but also need to be linked somehow to the original target.

B

As an example, you can think of a weather station, let's say london, and then you can measure different things in london temperatures max temperature mean temperatures precipitation, wind, wind direction. Anything like that. In that sense, london would be a symbol or the target of observation and the temperature or anything will be the variable of what we store would would be stored here in the in the curve uh element of this model, and this can apply to anything else as well.

B

So, if you think about the prices, you can have, I know brand crude as an instrument, and then you observe different kinds of prices, open, close, ask beta settlement and so on, and so on and and also applies to any other type of data that we have so very simple, very easy, taking whether as an example, we have uh keys are quite encoded for us, because that's how you, our users, understand them. So if you see here, this is our station.

B

Egl is a common weather code for heathrow, for example, and our users would know that. So that's why we include that in the key rather than just saying heathrow- and the data looks like this- it's very simple: it's a time series data, you have a timestamp, you have a temperature, and these observed temperature minimum temperature heat row during this period of of this year, and we can do whatever we want for that data.

B

We can plot it graph it and show it to the user in this, this format, all nice and easy, very simple data model, but very simple data. This is just an, but we also have another big class of data that we have to cater for, which are the forecasts.

B

When you talk about forecast, you have an additional time component that comes into place. The forecast timestamp so have a forecast timestamp, and then you have the actual value timestamps, which are which are what the forecasts are for. So, for example, today you'll make a forecast on a six hourly basis for the next two weeks and then you'll do the same thing this evening and then again the same thing tomorrow, all for the same same stations.

B

So the question was, then: what do we do with this data? How do we store this? This is how the data looks. Like you see the forecast dates here and then the actual forecast times uh in the uh column, headers and then the the observed these are the temperature temperatures. I guess I think this also forecasts for uh for heathrow, but for a bit more summery or warm september time earlier this year.

B

So how do we do fit this into the model that we just described? Do we actually fit it? Do we have another model? We wanted to try not to have another model. We wanted a unified model single model to describe everything. So what what could we do? We could have done something like this.

B

In addition to the symbol and curve which would be heat, throw and temperature, we can add the forecast timestamp to the row key and then have the forecast, the timestamps, the value timestamps as columns and all the rest will be the same, and this this would work. It's basically pretty natural description. So it looks something like this. This is a bit bit simplified, but you have a your station, then you have a forecast uh date as part of your key and then you have time stamps up up there for for all the values.

B

The problem we had with this is, if you uh want to fetch any piece of information from cassandra, you have to know the row key and in this sense you would have to know the timestamp of the forecast, which is not something we always know that hansel will have to have manage it somewhere else to map. What are all the forecasts for this particular station forecasts are not uniform; they don't come in the same periods. They always sometimes they run 10 past an hour, sometimes a half a past hour.

B

So it will be quite cumbersome to manage all that. So we said: okay, let's, let's, let's try not to do that, there's something simpler. We can do so. The idea came if you look at this table, which is how you would see forecast what about, if you just pivot it, if you just replace columns with rows and rows with columns, you'll get something like this.

B

It's the same data, it's the same table, so how about? If you, if you look at the cassandra table like this, so to have the forecast date as a value as a key column key and then encode the actual value dates, but encode them in such a way that you can semantically easily reconstruct them and understand them, and that's actually, when we talked about to our users of this data that that's actually how they see this data, they see it as a set of offsets from a different diff from a certain timestamp.

B

So if you say forecast for 4th of december 2014, you are going to see forecast for six hours ahead, 12 hours, ahead, 18 hours ahead and all the way two weeks up until two weeks.

B

So we could do this. In that case, station and variable stay is our symbol and curve, and then we, what we add to the row key, is the forecast offset so it's six hours h6 for six hours, for example in this example here and use the the timestamp of the when the forecast was actually made as the column key.

B

So the keys are now look like this, so this is still heathrow understandable to anyone who understands the weather data and it has addition of h6, which means six hours ahead, so have h12, and you have a very well-defined semantical model that you can actually easily read what this actually means. So you can also have d1, which is a day ahead or we w1 for week ahead or any kind of offsetting or horizon format. You want.

B

And this is how would it look in cassandra if you want so all data here, for example, so it says here h60, so this is the temperature 60 hours ahead from that point in time. This is a 60 hour head forecast from that point in time and so on and so on.

B

So you can easily get data like this, which is basically how the in this example uh six hour ahead twelve hour ahead and a ten hour head data uh forecast is moving through time, but what we lost in this model is uh the visibility of the actual forecast on a given date. This is something that's used. I guess a bit less often than the data that how we showed it here, but nevertheless what we also needed is to be able to easily give to the user. Give me the forecast for the 6th of september.

B

So give me this.

B

In cassandra this means the slice query again against a range of row, keys which are semantically understandable, so you can easily construct them and then do this and that's exactly what we did. We built something that user would easily get this kind of information as well- and this would look like this- these are the screenshots from our actual tool, which they will talk about a bit later briefly, but this is what the user would have to know. They'll have to know their station code.

B

If you want the variable they want and the fork for which forecast it one day they want it and they will get a graph or a table or whatever they want from it.

B

Behind the scenes, this function is a small dsl that we built on top of it. We do actually a slice query against the range of rows and reconstruct the the actual forecast on the date.

B

What is the benefit of this? This approach? It's it's for the time series data that we deal it's quite universal. You can actually apply to anything you want. This is a weather forecast.

B

We also saw previously weather observation, but we can do it for fundamental data, for example, uh gas, pipe flows or gas, pipe flow forecasts or power, plant production, actual values or forecast of production.

B

Anything of that sense of that sort would fit into this model and what's also nice, also the price data, which is another big class of data that we deal with also fits into this model. So if you, if you anyone of you comes from finance, you might recognize this. uh So these are brand crude contract futures closing prices uh on a different days in november.

B

These are monthly contracts, so these here are actually codes uh used in finance for the described month. So this is january. This is february. This is march in 2015., so so brent for delivery. In january 2015 the price is 106 dollars and so on and so on. But what what typically people would like to get uh from when they're doing finance analysis for pricing? They would like to see what they call the forward curve of a brand contract, a forward curve for brand contract.

B

Is this one vertical slice of the data which is very similar to the vertical size? We did we seen when we did the show the forecast before so. In order to get this information, you would actually use exactly the same function and exactly the same code. We did to pull the forecast. You could use the different symbol, obviously, because this is a symbol for brent. This is the variable, the close price, and this is the forward curve of brand as of two days ago,.

B

So what we got with this model is a very simple and unified was most important model that we can rely on. It's one single table, one single column, family, simple, key, simple, simple row, key simple column, key and and just double values, and that applies to everything we want to store. So we can handle all every user's needs for time varies the time series data based on this very easily. It's also very performant, we'll touch that in a second, but it does work. uh There are a few drawbacks.

B

Obviously nothing is without them, and I'll just mention two here. One is that if you actually need a forecast for a particular date or this forward function as you want, you have to read multiple rows, which is obviously slower than if we did it the original way, where you can read only single row and get it that way.

B

What is another drawback potential is that you have now limited the number of rows you have for a forecast, but the rows will become wider and wider. The more forecast we store the roads will be wider and wider, which is generally not a problem. Let's say you take a forecast on an hourly basis, that's what 24 hours a day times 30 times 365! That's not that's still not too much.

B

When we started this, we said the goal was to store at maximum one minute ticks for one minute takes of prices. For example, uh that's what 60 ticks an hour eight hour, eight hours a day of typical working day, 250 trading days. Typically, that comes about hundred thousand points um a year, which means it take ten years for us to get the million columns in a row which is still all right, but it is something we are thinking about and maybe uh considering applying some sort of sharding of data for that sort of volumes.

B

So that's what our data model is I'll now come back today with just to show you a few nice cool bits we built on top of it, namely this dsl, which allows user to easily query data without actually knowing the complexity or our model behind it. At all,.

A

Right so, as alexa mentioned, we we built this dsl and the reason we built it was to basically provide the traders and the analysts a very simple way in which to to operate on this data.

A

So we chose antler and you this will give you an idea of what the the grammar looks like um we've got. Basically, you can see more or less a line here for each of the functions that we make available through this grammar and what we've done with this is we we've built through java service.

A

We built the the implementation and when a user provides a a formula or a set of expressions that will get fed through and what we'll do is we'll go through a two-pass parsing process and the first pass is effectively to determine what are the actual symbols for which we'll need to go back to cassandra and fetch data four to then assign values to those variables or symbols for the second and final paths, all right, um so we've created this service.

A

uh This is the dsl fronted service that we can make that we make available through a number of different channels through the through web apps um through excel, but obviously through various programming languages, and this gives you an example of the sorts of things that you know. Our analysts and trainers can can quickly build. They can create these dashboards where they can store formulas for a given graph and then quickly assemble through a very simple tabbed interface.

A

This is fairly rude, rudimentary and ultimately, this gives you an idea of what they're they come in and look at. They can quickly and easily I'm sorry. This is a bit hard to see at the moment.

B

A

But they can quickly and easily enter in uh a formula and see how that evaluates to the right, and it allows them to again, as mentioned, apply all of those functions defined in the grammar and operate naturally on things. So, for instance, if we're looking at an instrument such as heating oil, h0, that happens to be its root symbol.

A

When you talk about futures, you would have something like hoc one, which is the first nearby contract for heating oil and which is the second nearby. If you want to spread, it's really quite simple: it's hoc1, minus hsc2, very normal natural sort of way for them to express.

A

That, and in closing I guess what we have at the moment is we have 12 nodes spread across three data, centers, um emia and noam. We are currently ingesting around 1 million points per day and we're seeing around 200 milliseconds of query time for formula.

A

So are there any questions.

B

Yes, we have so what we have is roughly 40 years worth of data, but it's that's um of end of day prices which isn't in the end too much whether 365 points a day. We also have for weather data. Roughly, I think data from 2007 are which are which in most gonna like to have one hour forecast. Obviously the more data you want.

B

The performance you'll pay a performance hit really. But what tends to happen for analysis purposes. Is that that if you want all of the data, then you'll, you accept the weight for it. If you want to run a model through it, but we have because we have a natural order, is of column keys. If you want the most recent data to do some quick pricing on the day, then the performance there is no penalty in performance. You can.

B

The draw can be as wide as you want and- and if you want last month last year, something like that, it will be nearly constant performance in that sense, but yeah the more data we have, it will be slower, but that's just the nature of it. Isn't it.

C

D

Yep, when you're, when you're compiling that do you keep the components of all the offsets that they may.

B

So we keep uh because we have different types of offsets. We keep them. We treat the offset as a ref reference data, so we will keep them somewhere else, but we know we can easily translate from from a particular curve. What sort of offset does it expect?

B

So we know, for I, don't know uh heathrow forecast supply, but this wsi provider we know they are. They are made four times a day and they are six hourly. So we can easily get that information and then build the forward bit.

B

Yes, yes, so one of the reasons we you're doing it. This way is because we can easily know what the roki should be, because it is um obviously with every data set. We work with the end user to see how how they see the data, so we can. We can store it basically in the same way, so they can know it because that's more important than than if we know it or not, but yes, we do.

B

We have to be able to build all the constituents offsets based on this just a single symbol, but that data is kept somewhere else. Yes,.

B

B

This, that is a production that is a production. uh What we have for dev and uit. We have two nodes: clustering in each of the regions which are separate. Basically, so they can developers can run against them basically, but we don't run. Obviously all those 12th.

B

You want to answer: what's the question.

A

Yeah, as mentioned, I think we at the outset, we we went through a number of different explanations of different data models. You know we had a whole bunch of different ones along the way and ultimately we we felt that it was simplest just to to stick with one and so long as we I mean, because, basically, if the dsl fronted, you know um the the query path that simplified the whole game for us, um but it's it's not the dismiss.

A

Obviously you could do this in many different ways, but for us it just it just simplified the whole thing.

E

B

E

You can say that you can tell.

B

Yeah, so so, if you want, let's say a forecast for so phone observations that that just see as simple as date range it's just you get the data agent get the data for that date range. If you want to forecast between the two days, uh then then again, it is just a range only across a number of rows that are defined by the by the offset that we have.

B

Yes, that's exactly that's all we do and then just return it as as one curve, basically like a synthetic one that we built.

E

Measuring units.

B

So I mean, if you, if you talk about indices there, they are, what they are is just the observational data. There is no forecast element to it. So it's a single time time component.

B

Yeah, so there will be a different. There will be different rows because you know how we have the row would be. I don't know footsie index, let's say, and then you would have a different variables on it. Is it close a day? Is it ask start opening price closing price whatever it is? There will be completely different roles completely separate from that perspective.

B

Okay, I think oh.

E

B

B

To be honest, we haven't evaluated it. No, no. We are aware we're aware of existence, but we haven't tried it out now, at the time when we started uh it felt that this was. That was the most mature solution. We can pick at that point. Obviously it's nothing set in stone, but no, we haven't.

E

Yep, did you use white, ordered partitioner.

A

E

A

No, it's random.

B

So again, we started using cassandra because yeah xander is nice for time series what we wanted to keep it as simple as possible. So we didn't go any deep into the cassandra. We knew what the model should look like and we wanted to use that, but we haven't really even tried to optimize it even further. We wanted to see how this it would work with what we have at the moment. It might be that we will evaluate see if we can get with different partitions.

B

Some different other features of cassandra, which we are really not using at the moment. We get some something better, but what we want to constantly just see how our model affects all that. So from that perspective, we only we started with something which is very simple, so random partition all the standard settings and we kept using it, and it's still working for us, although we do have quite a lot of data now. So no answer is no, so how.

B

So we roughly have um roughly a million per day. Obviously they don't come, all they come in bursts.

B

Most of the data we get is is a, for example, end of day prices from all the exchanges, so that will come yes, yeah yeah, so we to be honest, we don't have a metrics, we don't keep them, but we get everything as soon as we can and we we don't have any complaints about latency weather data. We receive them throughout the day, so we have a forecast made at 6 00 a.m, where we get the data from the providers, probably by 6 10 by next minute, or so it's all available to our traders.

B

So we to be honest, you we don't have metrics, but we we don't have any any problems with any. um Yes, yes, which is probably why we didn't check.

C

B

An interesting question, probably one from our weather analysts: uh do you want to try.

A

Yeah I mean it wouldn't be through the dsl that we've provided very.

C

Very very readily.

B

Okay, that's about it! Thank you very much. Thank you.