Apache Cassandra Cassandra Summit Europe 2014, 27 Dec 2014

Previous Meeting

⏯

youtube image

►

From YouTube: i2O Water: How Cassandra Helps i2OWater Save Over 235 Million Litres of Water Everyday

Description

Speakers: Mike Williams, Software and IT Director at i2O Water

In this presentation, I will give an overview of the SaaS Platform and overall system that we have built at i2O Water to migrate our customers and assist i2O to scale it's business. I will discuss it's merits and especially the benefits that technologies such as Cassandra bring to overcome technical challenges that we faced with a more traditional architecture and tooling. I will discuss some of the challenges we have faced using leading edge open source software tools and how we have tried to overcome them.

A

Mike Williams and I work for a company called I 20 water I'm grateful to have been invited here to speak today about how we make use of Cassandra and indeed other technologies to help the world save water. I'm sure you would all agree that that's a very laudable thing to do.

A

Water is an extremely precious resource and, unfortunately, probably in the not-too-distant future, it will become a very expensive commodity, I'm going to cover who we are, what we primarily do: I 20 water a little bit about how we go about doing this task of saving water.

A

Looking at some of the third-party technologies, we use within our solutions alongside Cassandra and then look at some of the main use cases that we have for Cassandra and briefly talk about some of the challenges we face and then I'd like to if I have time at the end, just finished with some of the future things that we will be doing so who are I to or water well, I to a water have been in business since about two thousand and five, and we currently have solutions that operate in over 25 countries.

A

In the world, we were a small start-up in 2005 we've grown to about 60 people, so we're not a huge company by any standard of the people that are attending some of this conference. Today, we currently work with over 70 water utilities around the world to help them save the water and just as an example, water can be incredibly cheap to produce and deliver here in the UK, for example, those are from the UK or Northern Europe know that we have a lot of rain.

A

Therefore, it costs less than a few cents per litre to produce water. However, there are certain parts of the world the Middle East, far east, where water is extremely scarce and is also very expensive to produce potable drinking water for people it can cost in excess of two to three dollars per litre by comparator around the wheel.

A

We currently have about 2000 systems and I explain a little bit what a system is later installed and this car currently around accounts for around two and a half terabytes worth of data that we manage it within our platform, just to pick up on something from the keynote. This morning, people Billy was talking about how we might be doing things that can affect people's lives.

A

So just obviously water is something that's very important to people's lives, but just as an example just recently in Saudi Arabia, there was a large festival where people would go and commit their Hajj ceremony at Mecca. As you may know, there's quite a lot of people go and do that there's around about 5 to 6 million people go to the specialty set up camps. I 20 worked with the customers in Saudi Arabia to deliver clean drinking water and air conditioning and control.

A

All of that water delivery with 750 of our systems to deliver literally one weekend's worth of water supply guaranteed in previous years, there's been, unfortunately, interruptions to that service, which has resulted in people dying. So what we do is really important and I'm really happy to be here, to tell you a little bit about it. As my initial slide said, our total daily savings of water for our customers around the globe is currently in excess of 235 million liters of water. Do you have any idea how much water? That is?

A

Anybody have anything that they could compare that to a lake- okay, possibly okay, so it is. It is the equivalent of being more than 100 Olympic swimming pools worth of water. If you think about that, that's a lot of water, that's being wasted every day, and we can help save that. In fact, the amount that's being wasted is far in excess of that.

A

So what do we do? I apologize at the back. You may not be able to read this: can you or not, yeah? Okay? So this is a very simplified diagram of a water distribution network.

A

Water is fed into this distribution network, either from being pumped and abstracted from the ground or from reservoirs, and then usually, water utilities divided up this network into a series of zones or areas that serve certain numbers of properties or factories or combinations thereof.

A

I 20 uses a combination of intelligent hardware devices which we design, manufacture and program in combination with a software-as-a-service platform, with intelligent machine learning, algorithms and optimizers that control the pressures within this network and optimize the energy usage that the water utilities deploy for pumping water, for example, from pumping stations or from reservoirs.

A

As we add more of our solutions into their network, then the customers derive much higher value from our technology towards such a point where they're, confident with our technology and that they can in turn deliver a great value and service to what is important to them, which is to their customers.

A

So how do we do it? So apologies I'm going to just talk a little bit about water networks and water pressure, and no. This is not necessarily what we're all here for, but it just gives you a bit of background as to what we're doing so within a within a water network. The top graph shows how the pressure varies over time and we have two days worth of data shown there when there's no active control of the network. The pressure into this zone is constant.

A

So the blue line at the top represents the pressure of the water entering the zone and further downstream in the zone at a point which is known as the critical point. Usually it's the point where the pressure is lowest or the services worst. The pressure varies. It follows this fairly well understood, diurnal pattern.

A

The gap between the two points in the network, where the pressures are being measured, is caused by frictional losses or leakage or other factors.

A

The red line represents the minimum pressure that the water utility is committed to deliver so that their customers within that zone all have water throughout the day, whatever they are going to do with it. The excess in pressure between the wiggly green line and the red line is the pressure we look to remove from the network.

A

This excess pressure leads to high leakage in the network and lots of bursts. Most of the networks are old. They don't have very good infrastructure, they're leaking constantly and by delivering excess pressure. As you can imagine, with a hose or a pair of coupled pipes, the more pressure you push through that the more likelihood it is that it's going to blow and burst a major bursts are not what customers want so when we'd apply an I 20 system.

A

What I meant by that earlier is that we actually put two of our intelligent devices into that Network, one at the inlet to the zone and one at this lowest point and by gathering the data from those points and crunching them through our machine learning, algorithms. We learn the characteristics of that zone and in doing so, after a short period of time, usually only two weeks, we can reduce that Wiggly green pressure by actively controlling the blue pressure in the network, and it ends up looking like this.

A

So instead of there being a constant pressure being forced into the network, we vary the pressure. According to the demand, as the demand increases use the peak times or in the morning, when people get up to have showers, then we will ensure that there is sufficient pressure downstream for the customer. The title we can put this green line to the red line, the more water we save. You know, because there is a direct relationship between the pressure and the lost through leaks and bursts, as I said earlier.

A

Just to give another concrete example: one of our customers shabbos, who control the water in Kuala Lumpur in Malaysia Kuala Lumpur, is very, very scarce. On water they often have complete lockdowns on water. Sometimes they have outages where they can't deliver any water, so water to them is extremely precious, so I 20 currently covers about seventy percent of their water network. For them we produced up to forty-eight percent reduction in bursts in that Network for them over a period of a year.

A

We save them about hundred million litres of water a day which to them is incredibly valuable and that returns to them in terms of cost about 7 million pounds worth of savings which they can reinvest in their infrastructure.

A

So this is really valuable. So how do we go about doing some of this stuff so previously right? Where we were a start-up, we were formed as I said way back. What did we do so we had this very, very simplistic architecture. We had this very what you might consider to be a standard, n-tier architecture based at that time, around a microsoft net stack, because that was the background of the developers that were first with the company and it was built on top of clustered relational database.

A

Now that was a prototype which we had to take to production as a start-up, because we were trying to prove ourselves in the marketplace and demonstrate the value which I hope you've seen from some of the slides we can do with our technologies.

A

Unfortunately, as we all know, putting prototypes into production is never a great thing, and it continues to be there today, alongside the desperate need that we had to to address the challenges of not only the architecture but the growth of our business as our business grows. We deploy more devices, we have more data, we have more customers, we have more users, etc.

A

So the problem of scale and maintenance and basically trying to be a true size platform. This architecture did not address so currently. How does it look so? I, don't know if any of you were in the previous talk, but there's some similarities to the architecture. We have to the architecture that was demonstrated by the previous people, so weary architected our platform in 2011, and it uses a complete event driven architecture. So are people familiar with event-driven architecture serving up here?

A

For me this with the bright lights but term I can't really see you guys, but do people know what event-driven architectures are few nods and a few hands? Okay, thank you. So, in a very simplified manner, we have a. We have a set of very loosely coupled collaborating services arranged into what we call our ecosystem and these services communicate very indirectly very loosely via a distributed set of brokers and they raised and consumed important events.

A

We also have, as will come to see you a little bit later. There is flavors of data stores that we use within our architecture to hold the data that we require to help us, run our algorithms and help the customers save their water and to allow them to remotely control their network, which is also very important. It saves them a lot of manpower if they can control their water network from a web application rather than having to send men in vehicles to sites and locations which still quite a lot of them, do.

A

This architecture it addressed most- if not all, of the previous challenges we had with our prior architecture.

A

So we talked about services. Lots of people talk about services, so what the services to us services to us are logically a group of what we term single responsibility: handlers in software encode, plus some infrastructure code- that's present in all of our services and some data stores. These data stores can be architected so that they could be physically or logically separated between read stores and write stores pending on the domain that the service is modeling, the services themselves can be scaled out. The services can be scaled out across servers.

A

If we wish they can be scaled by having multiple instances, they can work together completely independently. They can be scaled up. We can have multiple instances of handlers, running performing the same tasks, and we can also group handlers into thread pools to enable us to allocate work, two groups of handlers should we find dynamically that some of the services are finding themselves under heavy load and through analysis. We can dynamically spin up new handlers or groups of handlers with more threads.

A

This enables us to grow and shrink our ecosystem according to the demand of what's happening within it. So some examples of the domains that we use are integrating data clearly coming from the assets within the water network, allowing those assets to be configured such as the customer can choose what settings they have to control the physical pieces of hardware they have in their network and the key one for us is also the pressure optimization how we optimize the pressures in the network.

A

This event-driven architecture promotes a great autonomy of these services and that's why we think of it as an ecosystem, and we have a set of governance rules in our in our code base that protects us against changes that are going to break our ecosystem.

A

We made some technology choices which will come to see a little bit later. That's allowed us to develop these services in a language, agnostic fashion. These services, let us say, loosely collaborate which other via event. So events are raised when a service determines that something important has happened and they consume events that may have been raised from one or more other services. They know nothing of the existence of the other services. Our services are not allowed to communicate with each other in request response patterns, for example,.

A

So, just a little bit now the technologies we use, I 20. So at this sort of front facing side, we use a series of web-based technologies for providing our web presence, but we also use these for our communications with our devices. We also have distributed cache that we use quite heavily, even though some of our data technologies are pretty fast. Some things are just not fast enough, so we have to use Redis as a cache.

A

At the back I mentioned, we have a multitude of data stores, Cassandra being the principal one where most of the rump of our data lives. We also still use relational database. We use postgres coupled with post GIS, some extensions for geographic information. We hold the geography and the topology of these networks, so it's quite important that we can navigate that and the customers can see it and we also use elastic search within our architecture for auditing and for free format, text searching against our event.

A

So, unlike the previous speakers, where they were talking about allowing those events to be searched out of their events, tour, we denormalized our events into our auditable store and we allow elasticsearch to index those to allow us to provide free form searching.

A

This is all glued together with some middleware based around rabbitmq and distributed brokering in a pub sub mechanism. The two logos in the middle, which you may or may not recognize, and the one on the right is amqp, which is advanced message queuing protocol, which is a kind of language agnostic message protocol and the one on the Left MQTT is an open standard produced initially by IBM for the Internet of Things, so our devices use MQTT to communicate with our platform.

A

Why we became language agnostic is that all of our events that fly through our system are encoded using Google's protocol buffers, which in turn itself is somewhat language agnostic, so any of our services can be written as long as they can under stand, amqp, which most client in which most technologies have client libraries for that connect to rabbit and Google protocol buffers, which is fairly ubiquitous. Of course, pretty much most programming languages within a 20.

A

We have an awful lot of still dotnet experience, so we have a lot of our services still written in net and C sharp. That's where a lot of our business logic and some of our algorithmic workers written, but we also use nodejs in our ecosystem, which works largely around the web side and the web backend and the integration with the services.

A

So how does Cassandra help us? So we've been using cassandra in production since 2011 version 1 and we're now up to version 2.6. It gives us great right performance, but we're not using SSDs. The previous speaker also said that we, because of the nature of this data and the nature of the customers we work with. We have to be in a very secure high availability data center. We can't use the cloud they won't allow us to keep their data in the cloud.

A

So it's not cheap for us to switch hardware and infrastructure very easily, so we've not yet moved over to SSDs, but we think we will soon. We get good, read performance. We don't get lightning Reed performance, we're using spinning disks by the way, but we use the cash very heavily to help with that from a customer user experience perspective and Cassandra itself has a superb scaling model which I'm sure we all know about, and those who don't are here to find out.

A

Presumably at this conference much more about it and as we might great more and more of our customers or as we acquire more customers, we can scale up quite easily Cassandra by adding more nodes.

A

So we're going to talk a little bit about our use cases. The predominant use case we have is time varying data I, don't think of it just as time series data, although that is sure quite a large part of what we do. But we also have to track spot events that occur in the water network and they occur at different points in time and we have to correlate them together. So we also take care of those in Cassandra tree. Evolution is something I'll spend a little bit more about later, but that's something else.

A

We use Cassandra for, and it's extremely helpful in solving a problem that we have our algorithm development. We use via streaming data out of Cassandra before they had spark. So we wrote a lot of our own code to do that, and so we've been examining that as to whether that's an alternative, and so we have our machine learning and optimization algorithms, which also pull their data from Cassandra and store their data as needed in Cassandra to enable them to very quickly recompute Andrey optimize, the water network, I think to ourselves.

A

It was seemed at the time a fairly unique use case, I, not sure if it is today and I may find out more from others, but within our ecosystem we have a very key feature, which is that we have an auditor service similar to the event service. That was mentioned in the previous talk, which enables us to perform historic, replay and I'm going to talk a little bit about that later.

A

So I chose to show some of these by hand rather than using graphics package. I'd say: we've been using Cassandra for some time, so our data models have had to evolve as Cassandra has evolved and like many people using Cassandra, we've made many mistakes, we're a small we're a small group. We don't necessarily always know what the best practice is. The best practice often changes too. So we start with some of the simpler events that occur within our network, where we are just using combinations of fields as primary keys and clustering keys.

A

These are things such as when channels of data go high, I'll go low. If a device resets or switches power, source or its battery is getting low, other devices are low powered. They run off batteries they're under the ground. They live for up to five years being unmaintained by a human. So it's very important. We know what's going on with them from their power.

A

There are alarms which are raised and raised in the network, and so we have layouts of alarms and, as you can see, this was what was flavor of the time when these tables in Cassandra fact they weren't call tables. Even then, when we first started using it.

A

Time-Varying data, so this is more recently the work that we've been doing on the measurements that we hold. We hold measurements at different levels in the network at location levels and that area levels, and so we now have the ability to shard the data a lot more easily than we were able to previously by time and we'll a lot better at being able to cluster and partition the data in such a way that we can pull this data back for our algorithmic work in a much more efficient manner.

A

So examples of things that we might use for areas and locations are flows and pressures, sorts of things you've seen on the charts, but we also have to look at things like voltages of our devices. Gsm signal strength, I didn't mention, but our devices communicate with our platform over GPRS gsm mobile phone networks, which are highly unreliable for one thing and the signal strength grossly affects the energy usage of the device.

A

So it's quite important for us to utilize line some of our modeling to enable us to have predicted views of how the life of the device is going to behave. Given the current environmental conditions, it finds itself in, whilst we designed them for five years at a particular usage case. If it's in a very low weak signal strength area, then the device will draw much more energy from its batteries in order to communicate.

A

Temperature also has a dramatic effect not only on our devices, but also on the consumption of water as I'm sure those were fortunate enough to live in a country where the weather occasionally gets hot. We tend to use the lot more water during times, and so we use temperature data recorded by our devices to also spot correlations and patterns in usage and consumption. That would enable us to change our control models. Andrey optimized.

A

So when we use that when we use our algorithms of course date, data within cassandra is encouraged to denormalize it, but it's also within our ecosystem. We encourage our domain developers to duplicate data that there's no real downside to duplicating data, so our services designed the data models that best fit their purpose, and so, when we're doing some of our analysis, work or optimization work, we have to time sync our data. What I mean by that is, we have to take in data from various locations within the network.

A

As I say, the minimum is two and they have to be time synchronized correctly. Otherwise, the algorithms won't produce the optimal output in terms of pressure savings. We also have found that, as time has evolved, as we've worked with larger and larger water utilities that we have to integrate a lot more tightly with their in-house Garda systems. Those gather systems have to have elements of quality of data, so we've had to evolve.

A

Our data models to involve include some some of these other fields on there, which are the things like the normal values and normal ranges that they would expect to see on those data, and this is one of the examples where we use Cassandra indexes for outlier detection and also looking for validity of data. We, this is probably one of the only areas where we use Cassandra secondary indexes. We haven't found them to be of great benefit. I have to confess.

A

So talked earlier about replaying- and I was having this, what I thought was a unique feature to this- certainly watchman when we thought of here at the time, but it probably isn't now so within our system, every event in our ecosystem is audited since epoch and that that sounds like a silly thing, but in actual fact, for us, even though we deployed our new architecture in 2011, we still have to deal with data that occurred before 2011, because we get data from other sources other than our own devices, and historic data is very important because for constructing machine learning, algorithms, you need training data, you need historic data and therefore having data that occurs before your system existed has to be dealt with within our Avenger of and architecture.

A

We use this feature to enable us to develop new services to go into that ecosystem.

A

Whilst it is running, this new service appears in the ecosystem, so a little newbie service there that we created and when it joins the ecosystem, and it announces itself and in announcing its presence it states what events historically it is insert interested in and which events it's going to be interested in for the future hour or little service, then fast from Cassandra pulls back all of the relevant event that that new service is interested in and historically replays it to it.

A

This allows the service to catch up and behave as though it was in the ecosystem since epoch. An example of this would be as deploying a new algorithm, a new learning algorithm. We can deploy the algorithm into the ecosystem. We can request historic data of various types from the past and we can construct the learning that we require to then test out that service. Other services are completely unaffected by the existence of the new one, the other use case we have for Cassandra as evolving trees.

A

What do I mean by that? Well within a water distribution network I showed you the diagram earlier as I sort of laid out. The customers represent it quite often as tree like directory structures. So when they start off using our our system, they might try on a few areas, a few zones, and so they'll have effectively just a very simple tree or network of their of their water utility network, but over time it grows.

A

It grows for lots of reasons they buy more systems from us. They change things, networks, change the customers. Network gets rees owned as they take on new customers of their own within our our solution, data arrives constantly, but it's often late and it has gaps in it and the reason for that is because of the mobile gsm communication. It's not very reliable, as I said earlier. So therefore, we have to expect data to turn up at any given point in time and the packets of data that arrived relates to time.

A

That was some way before it. So we have to track how the network looked at the point in time related to the data that arrives, and so we have to keep this historic evolution of the network trees over the whole course of epoch over time, and that's quite important because we have to perform certain types of aggregations on our data not just time aggregation in terms of looking at data every minute, 15 minutes hour or whatever.

A

But we have to carry out calculations associated with how the water moves in and out of zones of that network, and that enables us to be able to analyze the network to look for problems within it. And if we don't track these trees properly, we run into issues where we have missed balances of water within the network.

A

Going to talk a little bit migration now so we've had to deal with effectively two forms of migration within Cassandra. Since we started using it. The first one is fairly straightforward that most people will face. No doubt our schema changes, and sometimes those schema changes are because we adapt our models. We've had changing requirements.

A

What we discover things that we didn't know about before, or we improve our knowledge or indeed Cassandra versions, change which sometimes require us to change schema, not very often fortunately, but our choices of how we do migrations are driven largely by the volume of data that existed in the old schema and how much has to be transferred to the new schema.

A

We can do it via standard extract, transform and load via files or code or tools or some other mechanism, or we can use our architecture. We can use our event-driven architecture and the replay mechanism I explained earlier to enable us to migrate data from one column family to another.

A

We can replay the data explicitly marked with metadata accordingly that enables those column, family data, to be migrated and appropriately to learn anything new about that data that we're now capturing within the new column family that perhaps we hadn't thought of when we had the old column family.

A

The second type of migration we have had to deal with is, as I'm sure, you've seen. We have to migrate customers from our old architecture to our new architecture. This time we use either specially written tools or we use our event-driven architecture again in certain spots to pull data from our old system. So we actually mimic the devices sending the data to our new platform as though they were doing that originally, even though they were always sending it previously to our old platform.

A

So we effectively use the same mechanism to drive that data through our new architecture, rather than doing extract, transform and load from Microsoft sequel server into Cassandra. So that's a big advantage for us, because those ETL type of processes are they're, not particularly speedy.

A

The very fact, the final part, the thing at the bottom, the device switcheroo, is that we have these devices, as I'm sure you're now aware, they're communicating with our platform when they're communicating with our legacy platform, we have to remotely tell them to start communicating, so we call that the switcheroo. So we have a very, very nervous point in time when that happens, when the devices get some new instructions as to which platforms start talking to so what challenges did we have with Cassandra the biggest challenges that we are?

A

A small team we're only eight developers? In fact, we weren't eight start with. We were only three and we've grown to aid and we have limited Cassandra knowledge. Maybe some of that has come across today.

A

There's a smallish talent pool we're based in the south coast of England in Southampton, and so it's very hard to find people with this. These skills, upgrading versions in cassandra has been a challenge. Minor version upgrades generally no issues but major issues. Major sorry version upgrades we've had some challenges, their.

A

Data modeling- it's been mentioned numerous times yesterday in the training sessions on sure it gets mentioned lots in the talks, it's quite a hard subject with Cassandra. It takes a different mindset for people to work with. There are many choices and those choices evolve over time as you've. Seen even in some of our examples and we we didn't always get all the patterns, we thought we did, we put them in place. We ran through the data. We would do it by some of the techniques which I've explained and it didn't behave the way we expected.

A

We didn't get the performance we were expecting, and so we have to go back and re luck, that's quite expensive for a small team. We are trying to develop this system, not for the technology sake, but for saving water for the world.

A

So therefore, it's quite important that we try and get this right first time as much as we can so I've got less than five minutes left I was going to talk about some of the things we're going to in the future, but I think I'd prefer to stop there and ask if anybody has any questions, otherwise we'll get cut off.

A

Yes sure! Yes,.

A

Yes, we did, ah it was a. It was a reasonably large coding challenge, not so much technical challenged. Actually Cassandra work very well. As we switched over, we didn't have any Cassandra issues. It was just busy work really, yes,.

A

A

So that's a good question: we we typically see somewhere in the order of 1 to 2 magnet orders of magnitude, particularly when we use. We also use protocol buffers between our devices on our platform and so there's a huge amount of compression that we want to gain there, because, with the more data we send across the mobile phone network, the longer the modem is on the more energy you use. The low battery weakens you. So those are the typical orders of magnitude.

A

Any other questions. Yes, please.

A

Yes, they're deployed our services so that there's they sit there. They they react to events and events include things like data packets that have arrived and they will analyze that data and they will relearn any characteristics and they will reproduce their outputs, which might be control models for the devices. It could be. An amelie detection for asset condition, monitoring looking at assets in the water network. Are they going to go bad? We heard this morning about health on people. We also have health monitoring on physical hardware assets that the water companies hold.

A

Oh sorry, I'm being cut off. Okay. Thank you.