Red Hat OpenShift Red Hat Summit 2018 | OpenShift, 10 Jul 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Storage Orchestration with NetApp Trident

Description

Many functions in Kubernetes are implemented using controllers that watch the events on the API objects and take appropriate actions as a result. This talk explores the use of external controllers for storage orchestration, particularly for capabilities that are experimental, vendor-specific, or not yet supported by the existing Volume Plugin API. As a case study, this talk presents Trident (https://github.com/netapp/trident) and demonstrates how it has enabled dynamic storage provisioning for heterogeneous backends and cloning persistent volume before Kubernetes could support such capabilities. The talk will include some of the lessons learned from deploying Trident in production.

A

Hi everyone, my name, is Garrett Mueller I'm, a technical director at NetApp and I'm gonna, be talking today about our storage Orchestrator for OpenShift Trident is an open source, fully supported integration with OpenShift that we've had for a couple of years now, I'm gonna get kind of deep into how it works, and so you have a better understanding of how that's done and I'm going to talk a little bit about what we've learned just running an application in OpenShift, which we've had to do it. That's how I know it works. So what is Trident ride?

A

It is and, like I said, open source integration with OpenShift. It also integrates with kubernetes itself integrates with other orchestrators as well, and it's kind of our one-stop shop for any kind of container orchestration. What its intent is to be able to enable on-demand self-service volume and storage and data management. So today, if you're familiar with the kubernetes paradigm, you create PVCs and we'll talk a bit about that in a second, and it will go ahead and manage the entire process for you after that.

A

So it's all about automation and it's about getting out of the way of the engineers, the developers, the users that want to be able to take advantage of these platforms, because they don't care about the infrastructure underneath and they shouldn't. We should be able to provide these capabilities the capabilities they need at the speed that they need them and the second important bit is there: how do we provide those capabilities in such a way that we can also protect the infrastructure from those users so that they don't overrun? What's available?

A

They don't impact the other users of the system at the same time, and so a big part of what we're doing is we're enabling administrators to model classes of storage and restrict things in certain ways that aligned with what they actually have on hand and what they want to allow their users to be able to do, and the whole thing is automated and to end it's a production ready fully supported in integration.

A

So you can actually call our you know our 1-800 number and get support for Trident, even though it's an open-source integration, some some interesting facts. It's shipped in 2016, so it's it's been a for quite a while, especially in in this space based on technology that we built in 2014. We started off with some docker integrations very early on. It still supports docker, but more and more our customers are moving in that direction in the direction of open shift, and so we have to make sure that we have it supporting that as well.

A

It was actually the first controller based external provisioner for kubernetes. It's not a standard. Plugin and I'll talk about more about what that means in a second. But it really is a very intelligent integration that watches what the API server is doing and can respond in kind and to end.

A

So it has a really full comprehensive understanding of what the cluster is doing and what people want from it a lot more than you would get out of a standard plugin and it supports all of our storage platforms and in fact you can support them all at the same time. So I'm actually gonna go pretty deep here into how it works. So this is trident and how its deployed it actually deploys. As a deployment in OpenShift, it runs as a pod inside the environment.

A

It has its own backing sed store to keep track of the metadata that it needs to keep track of, which is understanding of which volumes this provisioned, how they're gonna be managed, and so on and so forth. You can see, we have front ends there, which are the kubernetes plugin. We have a docker plugin. We have a prototype CSI plugin, which is the container storage interface. If you're familiar with that, so a big part of what my team does at NetApp is we work in open source, so we work in open source communities.

A

We work upstream in kubernetes, we work up a stream and with the CSI community as well helping evolve these paradigms and on the south side you can see we have all of our different storage platforms. We also now support cloud volumes. You may not be aware of this, but if you're familiar with NetApp, you know that we have we're known for the data center. You've got all these. You know big iron boxes.

A

That can do that, and we still do that really well, but we also have virtualized versions of all of our storage and we also have cloud volumes now as well. So we are the first party NFS service in Azure, so if you actually go to a sure and get NFS storage you're getting it from NetApp we're running the service for them, it's it's not like a Marketplace add-on or something like that. It's the name of service from Microsoft. We also just announced this morning we're doing the same thing with Google, so he can go.

A

Ask Google cloud about that. We're providing NFS service for Google as well, so tridon can actually orchestrate the provisioning and management of all of the different kinds of storage, and do it all at the same time. So you can imagine that over time we start to understand a lot about what the storage is, where it is, what you're trying to do with it in all of these different environments. So we start adding capabilities like we're, the first ones that were able to unlock the ability to use native cloning in open ships.

A

So what you can do is actually in your PVC when you're, creating a volume request, you can say I want a PVC from an existing PVC under the covers for the platforms that support it. We can do an instant clone of that data set, no matter how big it is. So it's very powerful, especially in like CI CD scenarios, so one level deeper, just as I was talking so tried. Music controller runs as a pod and openshift cluster in a deployment it gets its h.a.

A

That way, just like your other applications would and what it does is actually listening to the API server waiting for PVCs to show up that it needs to provision and what we do is we say: okay, here's a PVC that is against a storage class that we understand and we'll go ahead and create a PV and the backing storage and then hand that off to openshift openshift will then take it from there.

A

It'll mount the storage when necessary with the individual pods that require that storage and and then we can go ahead and manage the entire lifecycle of that storage on the other side. So it's very powerful, but also completely invisible to the end-user. All they see is I asked for a PVC and I got a volume right away, just like they would expect if they were running in a cloud platform.

A

So the way that storage classes work is you can have a gold, a silver and a bronze class. For example, you can name them whatever you want and the way this works is you'll actually configure Trident as a tacuba, Nettie's or OpenShift administrator inside of OpenShift itself, and then tell us what these pools means. So gold might mean it's on SSD. It might mean that it has an eye ops level of let's say: 15,000 IAP, 20,000 opps, or something like that.

A

There are gonna, be properties that are metadata inside those storage classes that the users don't see. All they see is the name, and then we actually figure out which pools of storage on the other side could be cloud storage. It could be on-premises storage, it could be anything that matched those requirements and they will provision one of those for the user on-demand but tried it handles all the automation of that.

A

On the backend you actually well between the storage class and the training configuration that's where you configure the policies that tell Trident how to automate that process. Within the you know the the boundaries that the business requires.

A

So this is a this is how the volume creation works. So a user here, on the left hand, side Trident running in the middle I've, got an on tap and a SolidFire system which are two different storage platforms that Netta provides on the backend and the administrator. Actually configures storage back ends in Trident. It adds one we add those backends to it, there's a rest interface to Trident. You basically just post those those configurations to try it in and it will suck those in and autodiscover the capabilities of those of those boxes.

A

They need to find storage classes in open shifts that consume those backends again very generically. So think of storage class is a generic way to model storage and the backends are the specific requirements for individual backends, individual storage platforms. Then we actually detect that the user created a PVC automatically because we're listening to the API server, then we actually find the storage pools that match the class that they asked for so the in the PVC, they're gonna, say: I want a gold class.

A

They have no idea that it's trident actually providing this, and you can have multiple versions in fact running at the same time and will go ahead and go okay. It's a gold class that must meet. That means this kind of storage that will create a volume, and in this case we picked the the SolidFire array over there and they asked for 10 gigabytes. So we gave them a 10 gigabyte.

A

Pv rewrite once happens to be I scuzzy, but again the user has no idea that it's I scuzzy, underneath all they know, is that it looks like local storage inside the container. That's in the pod, that's running right and it's a it's gold class of service. So it has there's a certain expectation of you know: performance there. There might be an expectation of backup and recovery times like RPO, like the recovery point objective and things like that. You can model different concepts in those in those classes.

A

Then we hand it off to kubernetes or OpenShift and it goes ahead and mounts it. So that's a high level understanding of how it works under the covers. It goes pretty deep if you want to go deeper than that. You can come talk to me over at our booth over there, but I want to talk about. One thing we learned while we were going through this process, which is simplicity in this kind of environment, turns out, is really hard.

A

So we we started with this interesting problem with Trident, because treinen itself has that at CD store, underneath that you saw, and we actually needed the provision of volume to store our own metadata and we were like okay. Well, we're gonna have to document how you create basically do all the automation that Trident does the first time you installed Trident, so that we can from then on. Do it all automatic that seemed really stupid, because you know why would why would we want to document the whole process that we're automating in the first place?

A

So we came up with this well looks now a fairly complicated process of we created an install script in bash that creates multiple pods and hands off everything. So we it it launches this launcher pod that then launches this ephemeral pod. This ephemeral pod is actually trident without the SED requirement, and it creates a pv that we then run the real Trident pod on top of and land the running, Trident instance. On top of that, PV that we created now. The problem with this is I mean it's cool, it's cool when it works.

A

The problem is all openshift. Environments are different. A lot of people are standing up for the first time and don't know what broke it. A broken cluster looks like let alone some of the interactions that are required here, and so what you end up with is you end up with a bunch of logs for each one of those pods and the logs for openshift as well, and so the debugging process, when that goes wrong, is really onerous.

A

So you know we try to create like easy ways for you to get all these logs and be able to parse this, but there just was really no good way to make it really easy to diagnose what went wrong. Usually it's just one setting somewhere or something wasn't quite working right in the cluster in the first place.

A

So what we did is we recently read reinvented this whole thing, and now we we rewritten it a couple times. Actually, this is our third attempt and the Installer now is written in go and it actually starts Trident itself and runs the whole process and and all by itself, so it doesn't try to run it all in the cluster and then do all this complicated a handoff. It just runs it itself. It brings up Trident in this ephemeral mode.

A

It provisions a volume for it all the logs are then centralized and it all it kind of reads out the way you would expect through one. You know interaction with the console and when something goes wrong, it happens right away and you notice right away, and you can kind of see it right in front of you.

A

So we just released that about a month ago and it's been a lot better, a lot easier to diagnose, but these are the kinds of problems that we're dealing with on a regular basis, just keeping things simple, because in these desired state environments you get a lot of these different. You know, processes that are running in different places and just being able to understand you know where one thing ends and another thing begins and where failures might be occurring in an application that you don't know can be a real challenge.

A

So we do do everything we can to try to make this easier. What we like to do also is how much charts of this we've done. We can do helm and in templates and whatnot with this, but the provisioning of our own storage inside of that environment is still a challenge, especially this kind of complicated workflow that we need at the very beginning, so we're still working with the community to try to make it easier for us to be able to do that, because we would like a one-click install version of it.

A

It's right now pretty easy takes about five minutes when you have everything that you need to configure it, but we would like to make it even easier than that.

A

So um that's pretty much all I had for this overview presentation. This is our website and adept at I/o. This is where you can go to find out all about our open source technologies. Our integrations with things like OpenStack, ansible OpenShift. We actually have a slack channel. So if you want to communicate with us directly we're all out there and you can get chat with us, we'd love to talk to you, some more and that's all I have thank you for a time. If you have any questions, I'll be up here.