Red Hat OpenShift Seattle 2016 | OpenShift Commons Gathering, 7 Nov 2016

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Commons Gathering Seattle 2016: Big Containers, Big Orchestration, Big Data

Description

OpenShift Commons Gathering in Seattle on Nov 7, 2016 talk by Red Hat's Will Benton on Big Data, Apache Spark, OpenShift and Kubernetes.

A

Good morning, so my name is Will benta and I'm, a software engineer at Red Hat. So we've seen this morning, how openshift makes it possible to develop and deploy applications I'm going to be talking about a particular kind of application, data driven applications and I'm, going to talk about my team's experience, developing analytic applications and using data science at scale and talk about how our infrastructure requirements changed as we went from doing analytics as a workload to developing and deploying and maintaining analytic applications and how openshift makes that possible. For a little bit of background.

A

For a little more than two years now, I've been leading a team focused on data science in red hats, I, sort of emerging technology and emerging technology group in Red Hat, and we really wanted to figure out what the data driven applications of tomorrow are going to look like on open-source infrastructure's. So we wanted to look for problems inside Red Hat and help people take things that they may be prototyped on a single machine and bring them into production scale.

A

We wanted to go from models that were sort of black boxes for predictions to user interpretable models that told you something about the real world and didn't just give you a yes or no, and ultimately we wanted to do all this. Well, you know having at least as good predictive performance, as you know, existing solutions of things we were looking at.

A

So when we started off in this effort, we had a lot of experience with apache spark, which is a framework for data processing, I'll talk more about in a minute, but we had. We had a bunch of spark machines in a dedicated compute cluster. We had a cluster a networked, POSIX file system in essentially the same rack as our compute nodes, and we were orchestrating everything with Apache mesos. This worked pretty well when we were just a small development team, we could sort of coordinate with one another and say: hey I need this.

A

Many machines for this. This model I'm, going to train and and sort of informally allocate things as we needed to where it sort of broke down is when we wanted to take the things we'd built and put them into production and share our applications and our data with our collaborators and and ultimately, what we ran into is that this was a great way to run analytics workloads. But the problem is that analytics isn't really a separate workload anymore. We really put analytics into production today as part of contemporary data driven applications.

A

So what we wanted to do going forward is have a way that we could use a shared spark cluster both for development work and for production work, so both sort of interactively and its scale. We want an easy way to share our cluster and our work with collaborators, and we wanted a really nice continuous deployment workflow so that we didn't have to so. We didn't have to do a lot of extra work to update our applications, and we wanted to do all of this without becoming expert system, administrators or scheduler policy ninjas as well.

A

So in the rest of the talk, I'm going to talk about, apache spark introduced it for those of you who are just becoming familiar with spark and talk about why it's a great fit for contemporary microservice architectures I'll talk about some architectures for contemporary data driven applications. I'll talk about two of the particular issues we had to solve: to bring spark to open shift, scheduling tasks and dealing with persistent storage and then Allah sort of show you how you can get involved in use this stuff yourself.

A

So we'll start with some background. How many people in here have heard of apache spark great grants? How many people have used it before great great? So all this will be review for some of you, but we'll go we'll go through this just so. Everyone has the same context. The tagline for spark is that it's a fast and general framework for distributed data processing, but that's really only part of the story right fast in general.

A

That's great, but I think the really compelling thing about spark is that it's actually easy to use, unlike a lot of other frameworks for distributed data processing. If you think of MPI or Hadoop MapReduce, these sort of other frameworks are based on an execution model. That's easy to execute in parallel right.

A

If you have a way to sort of go from the execution model and say: hey programmers use this thing, that's easy for us to execute efficiently spark sort of goes the other way around and it takes a familiar programming, abstract chin and comes up with a way to executed efficiently, so that fundamental abstraction in spark is the resilient distributed data set- and this is just like any sort of sequential collection in your conventional programming languages, except it can be partitioned across machines.

A

These collections are immutable and operations on them are lazy and the consequence of this combination of laziness and immutability is that we get resilience for free. So we'll take a look at these and see what they look like now.

A

Imagine your conventional sequential collection and a programming language. It contains some values, probably all of the same type, and if we want to distribute it, we have a couple of ways we can do that. We can divide it up into partitions and put each partition on a separate machine. We can divide in partitions a bunch of different ways, but some of the most common are taking ranges of contiguous elements putting each of those in its own partition or hashing each element and putting everything that lands in the same hash bucket in its own partition.

A

Now the first D word we have on the slide is distributed, and once we have a distributed computation we're going to have what.

A

Failures right right, we're never to have failures, so these partitions can go away when the machines that are holding them- or maybe some other machine right goes away, but we have a way to reconstruct them, and this is where immutability and laziness really start to pay off, because when you have one of these are dd's. You have some values, either in an immutable collection in heap in your application program or on stable storage somewhere that form the basis for any of these are d DS and you build them up by doing operations on them.

A

You have two kinds of operations you have transformations. This is where immutability comes in. These, don't actually do anything to the underlying collection. They create a new collection because we're lazy, they don't actually create a new materialized collection. They create sort of a recipe for a new collection.

A

So in this case, we have a very small collection and we want to apply an operation to it, a filtering out all of the even numbers we haven't actually created a new collection here, we've just created a way to say, given this underlying collection, how would we get to a new one? We can keep applying operations on this RDD say: multiplying every element by three expanding the collection so that we have twice as many elements, including each element and its successor, and we still haven't computed anything here.

A

We just have a recipe for how to get the collection we ultimately want. If we do a different kind of our DD operation called an action will actually get a value out of this, and in this case we'll do an action called collect, which schedules these computations on our cluster materializes the collection as a sort of array in our application program.

A

Now, because we're lazy, if we want to do a different action on that same collection, we'll have to compute everything again, so we have to filter out those even numbers multiply everything you know and then expand everything into itself in its successor again, if we wanted to do a different action like save as a text file.

A

Obviously, if we want to use a collection more than once this is this is sort of wasteful, so spark gives us a way to suggest something is going to be used again and cash. These intermediate results. We can see what this looks like operationally by looking at the sort of high-level architecture of a spark application. The main application is called the driver and it interacts with a cluster manager which schedules executor 'he's, which are basically just microservices- that compute partitions of these r dds. So the driver will serialize a function out to these executor 'he's.

A

They will actually compute those values when they have to, and if we decide that we want to cash in and result, those cached partitions will live on the executor.

A

So this r DD is pretty pleasant to use, but it's really a lot like an assembly language or an intermediate representation for distributed computation. It's general, it's usable. It provides all the primitives you need, but a lot of programmers might want to work at a higher level and the spark ecosystem which is built on this rdd, and this scheduler concept includes a lot of higher-level libraries for special purpose tasks since being able to cash. These intermediate results as a big benefit for things that we want to iterate over.

A

You can imagine that a lot of these libraries are things where you have to do it. Eration, I, think of graph, traversals, think of processing, structured queries like with the database. Query, planner, I, think about optimizing machine learning models and spark even provides a way to treat a stream of values as many small rdds one for each window over the stream and thus use the same same abstraction to program streaming and batch workloads and out of the box. Bark provides a few different ways to deploy this. You can use a sort of self-managed cluster manager.

A

You can use apache may so sore. You can use Hadoop yarn if you, if you have an existing Hadoop installation. What I want to tell you today is that we can actually run these standalone spark clusters on top of open shift and there's work going on in the Kuber Nettie's community and with some people on my team to actually have native support for scheduling, spark a sonikku brunette ease as well.

A

So microservices I assume a lot of people here, have an opinion about microservices. Is anyone allergic to microservices?

A

Anyone delighted by microservices everyone else in between is that fair to say, yeah I, assume most people would be in between micro services, aren't a panacea right. We do get some extra complexity in having to define these interfaces and orchestrate these components, but we get a lot of compelling benefits in return, and one really nice benefit is that spark is a really natural fit for these kinds of architectures, so just to review why this makes sense. You know if we have stateless microservices, our services or commodities, one of them goes away.

A

We can replace it with another one that meets the same responsibilities. We have a lot of flexibility in how we deploy these things. We can scale up, we can scale out or we can load balance. It's a lot easier to debug. A stateless service at no environment can make debugging into a trivial problem.

A

But it's a lot easier to say: I want to examine the surface area of this well-defined API than it is to say, I want to explore some hidden state inside a stateful service and produce a test fixture that will reproduce a bug this can hold for performance problems too right. It's a lot easier. Those of you who've done large-scale systems. Performance work know that it's a lot easier to look at an individual component with an SLA.

A

That's not meeting its SLA and saying something needs to change here than it is to look at a big system and figure out where things are going wrong. There are a lot of social benefits of micro services to but I'm not going to be the fourth person to talk about those this morning, so we can that for left later. So another nice thing about spark is that it's it's really a natural fit for these. As I mentioned, these executor czar really microservices that compute partitions of distributed collections.

A

They are essentially stateless if we ignore caching- and caching is just an optimization, so women when so they do what they're told and when one of them goes away. We know how to replace it, because we have the underlying collection and we have the operations to apply to it to get back to that value.

A

So the good news is that if we want to make a contemporary data-driven application on openshift spark is a great place to start. But if we really want to take advantage of contemporary application architectures, we need to think about analytics as a component of an application and not just as a component of not just as a separate workload. So, let's start by taking a high-level look at what a data-driven application actually has to do.

A

A data-driven application is really a lot like any other application, except that typically you're, transforming and aggregating data from different sources, using that to train predictive models and then you're, using those predictive models to transform and make predictions from your data, you're saving predictions and raw data to archival storage and you're, supporting a few different kinds of user interface, maybe you're letting developers or data scientists install new models, modify how models are trained. You have, of course, a typical end user web or mobile UI.

A

You have reporting UI for the business side and you also have a management interface so that you can tune how your application is deployed and make sure it's performing. Well. So we'll start our review of architectures by looking at a couple just really quickly that are good analytics as a workload architectures. The first one is probably going to be familiar to a lot of people. This is the conventional data warehouse architecture.

A

We think about a stream of events arriving we're going to transform those somehow apply some business logic rules and put them in a database that we've optimized for fast concurrent updates. We're going to call this a transaction processing layer. Stop me if you've heard this before now. This is this is great, because these fast databases can deal with a high volume of events, but they don't have any analytic capabilities to extend this with analytic capabilities. We're going to periodically send the contents of our transaction processing data base to another database.

A

That's optimized for a lot of concurrent reads and complex queries. We can then do some analysis on that. Typically, a folding multi-dimensional data into a spreadsheet so that it's comprehensible and you can. You can make a report out of it. We can also feed those analysis, results back to the business logic and use in the transaction processing half.

A

Finally, we may want to allow analysts to engage in sort of interactive exploration of our data as well. We call this the analytic processing layer, so this architecture has worked really well for a long time for a lot of a lot of sort of applications. The downside is that it's really hard to do this with stateless services and it's really hard to scale out the transaction processing side.

A

I mean historically, you haven't been able to scale out relational databases very well on the analytic processing side, it's a little easier to scale out, but your results are always going to lag behind the current state of the world, which is kind of a problem and there's there's a lot of additional complexity there as well.

A

One approach to actually scaling out compute and storage is the data Lake approach that was popularized by the Apache Hadoop project, and the idea here is that you have a uniform abstraction for federating all the data you care about a distributed file system. If you, if you run out of space, you can add more space and those nodes are going to stick around for a while. When you get application events that you care about, you append them to files in the distributed file system.

A

Those events aren't going to be in the format you care about, for training, a predictive model or running your data driven application, so you're going to schedule some jobs to run those and the way the scheduling is going to work is that the jobs that operate on particular parts of the data will migrate to the machines that store that data. So you get scale-out compute on top of your scale at storage.

A

This was a great approach in that it let people really exploit commodity hardware and get scale out, and also the really compelling benefit of being able to store. A lot of data was huge for a lot of organizations. The downside is that the programming model is pretty low level and that it's not really a great fit for the kind of elastic architecture we want to see in a cloud native application. You can't scale your storage and compute independently.

A

If this is what you're doing so, the legacy architectures we've just seen solved a lot of problems, but they're not the best way to have these kinds of applications in production. Now, in 2016 we're going to look at some architectures now that are more suitable for contemporary containerized data-driven applications.

A

The lambda architecture is sort of an interesting approach to modernizing that classic data warehouse. Instead of having two databases, you have sort of two analysis layers. You get a stream of events and you multiplex those both to a stream processing layer and to a distributed file system in the stream processing layer.

A

You're repeatedly performing in precise analysis on your latest data, whereas in the distributed file system, you're regularly performing precise analyses on all your data, then you can present your user with you of data of their data that federates both the latest results from the stream processing layer and the precise results from the batch layer, giving them a view that strikes a balance between these between these impress ice but Korean information and precise, but possibly stale information, and the advantage here is that you, you get a lot lower latency than you would in the conventional data warehouse and that you can also schedule this really.

A

A lot of this can be stateless containers. The downside is that you have to implement your analyses twice. Sometimes it's hard enough to get them right once, but you have to implement these things twice, both as batch analyses and as streaming analyses, which is a lot of extra engineering effort.

A

So the Kappa architecture, which is sort of a response to the lambda architecture.

A

Reflects the fact that the way we design streaming, Algar, algorithms and stream processing systems has changed a lot, and the idea is that everything is your queue. Everything is on the queue, so we have events, we put them on a message queue with a raw data topic. We transform those and write them to a pre process. Data topic, our analysis, jobs, simply read from pre processed data topic and right to an analysis, results topic and then our UI components just subscribe to the analysis, results topic and present those to users in different ways.

A

So this is really a nicer way to do analysis where everything is a stream, but it does assume that you have a sophisticated stream processing system and that the analyses you want to do are actually expressible as streaming algorithms. Now we can do a lot of things, the streaming algorithms. We can't necessarily do everything, so those are good assumptions to keep in mind.

A

In my opinion, the most flexible way to do this involves moving this problem of data Federation from a storage layer like a database, a message queue or a distributed file system to the compute layer, and then we just explicitly transform data upon ingest into a system and spark with the rdd and some of the abstractions built on top of that gives you a flexible way to interact with a lot of different data store, is transforming them on demand in the compute layer and then operating on them there, since we've seen that spark executives are essentially microservices.

A

We know that we can run this in a contemporary containerized environment, but also spark is general enough to implement the analytic processing side of a conventional data, warehouse, the stream processing or batch processing sides of the land architecture or the stream processing side of the cap architecture as well, so spark is really really general and gets us a lot of benefits there now. Is it enough to just have spark as the basis for our contemporary data-driven applications? Well, you can still go wrong right.

A

You can still say: I have spark and I'm going to deploy it like something, that's just analytics as a workload, if you did, if you have one resource manager for your applications and one resource manager for your compute cluster, you get into this situation where you have to make scheduling decisions in two different places and you're, probably not using your resources as efficiently as you can. A better way to do.

A

This is to schedule with one cluster one spark cluster, either logical or actual per application, so that your unit of scheduling, granularity is actually the application. And then your resource manager is able to make the right decisions for each application, making sure that application components are Co scheduled with the compute they they depend on the compute requirements of our applications can change over time, though. In this example, we have an application with a huge task backlog, but only a single spark executor. Now in a conventional standalone spark deployment.

A

We can use sparks support for dynamically, allocating resources to temporarily give it more executor, zeeeee, and we can do something similar here, but since we're wired in to open shift, we actually have a service that reads the metrics from our running spark application and says: give me some more give you some more nodes to run on. So we can extend our elasticity a layer below the spark cluster into the control plane.

A

So in my team's original deployment, we had a gluster file system living in the same rack as some of our compute nodes. Posix file systems are really great right. Like you can you can you can run grip on them? You can you can do all of these, these kinds of things that you're familiar with, but this was a problem for us on a management perspective, because we wanted to share our data with other users.

A

Managing access control became became sort of a pain, and it also meant that we were, depending on a path versus the service right. Another option that people use for storage is to actually have HDFS as a peer to their application. Scheduler this this can work pretty well, but you can't get that advantage of co-locating your storage and compute. That is something that HDFS depends on to some extent for performance.

A

So I want to talk about a different solution for storage, which is just using object stores. There are a lot of different implementations of the s3 API. You can think of. You know Amazon's, obviously the Swift project from OpenStack and SEF, and you get fine-grained access control with these and you can access them from a range of libraries.

A

The downside is that if you have an application written against a POSIX file system, you have to change your expectations about consistency and performance.

A

Now the idea that you need to co-locate your storage and compute was sort of inspired by the popularity of HDFS and it's sort of an interesting, interesting assumption that I want to take a minute to address, but one of the things that you'll see on these, you can scan these QR codes and read these whole papers. I would encourage you to encourage you to do that, but one of the things that you see in contemporary contemporary workloads is that a lot of analytic processing workloads actually do fit in memory once you've pre-processed your data.

A

We also see that if you have a fast Network and you have maybe data center locality or rack locality, that might be as good as reading from local disks.

A

Finally, frameworks that are designed to depend on having local storage may make some assumptions about how expensive storage is. Now we know that writing to physical disks isn't actually cheap, but people assume that it is, and so you get these cases like like this one, where you have a distributed file system used to glue jobs together and you're spending, almost a third of your time, just moving temporary files around.

A

If you read this, this is a great blog post about using spark at scale, and if you read the whole thing, you'll see that they were able to eliminate that and replace it with a single job that didn't have to use the distributed file system. For that.

A

So really, the other thing to consider about storage is that you're reading that big data from your storage once and then your pre processing it and operating it on medium data, that's going to be cached as you train your model and we're spending most of our time operating on data that fits in memory. In this case, we could optimize the part where we're reading the big raw data off of storage, but that's going to get us diminishing returns compared to making this faster.

A

So you may not need to take to optimize by co-locating, compute and storage. There are both practical and philosophical reasons why you might want to might not need to do this, but if you do it's possible to get that rack or data center locality pretty easily without sacrificing a lot of flexibility, and there are smart people working on making it possible to even get that same machine colocation without sacrificing a lot of flexibility either.

A

So we're really excited about spark on openshift I hope you are too, and I want to show you just how how you can get started and try it out yourself we're doing all of our work upstream in the red analytics, io github organization.

A

My colleague, Mike McCune, has a great video demo of the developer workflow for for deploying a data-driven app on openshift and my team and a bunch of other teams who are working on this are going to be here at the big data meeting at lunch and we'll be happy to answer your questions then, as well as I guess now.