Red Hat OpenShift Case Studies | OpenShift, 5 Dec 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift Case Study : RackSpace - Greg Swift, Rackspace

Description

OpenShift Commons Gathering December 5th 2017 Austin, Texas
Greg Swift, Rackspace

A

So my name is Greg: Swift I've been at Rackspace for about five years, I actually used to work with Joel at the US courts before coming to Rackspace I'm excited to hear that there grow up doing such good work with openshift now so for anybody that doesn't know Rackspace well overview. These are the things I'm gonna talk about a quick overview of Rackspace. What we do what we need. You know how we're gonna get there and then the kinds of things we learn along the way with OpenShift.

A

So first off Rackspace is really about managed services and providing fanatical support. It seems like we're selling a lot of individual products, but really like at the end of the day. That's what we're trying to provide, and we do that over a huge breadth of products pretty much. If you want it, we try and provide it mm-hmm.

A

This can lead to some problems for us, because it means that we are a collection of hundreds of IT departments, all highly skilled, highly. You know intelligent people trying to make things happen as fast as possible inside their domain, and so you got the AWS guys over here, the other guys over here and they're all just going that way.

A

We end up following a lot of good practices, a lot of best practices. If you want an example of the best practices to come to us, we will find one of our thousands of people. That knows that practice really well, but what that means for us is an enterprise. Is it makes the internal use of products a little bit more difficult, so.

A

It can make it like you're changing companies if you switch from one operations team to another or one of the groups that I work with supports several hundred apps and they are switching companies every 15 minutes. Sometimes, if there's something big going on and then the compliance time is, can be a mad rush because of that 200 different variances to accomplish the same thing.

A

So what we needed was to try and be able to come. Rock come back and say: okay for the internal things, for the things that are not our support, bread-and-butter, the are the services that were providing out to our customers. How do we solve those problems, and so we needed to do was realize that best practices needed to find their way up to a standard practice. Here's the commonality that we need to be following.

A

We need to get it to where all the people that don't need to be mangie in the entire stock have a good option for somebody else to do it for them and then realize that not everybody's gonna get their problem solved. You're still gonna have that 10% that's running off to the side and that's not necessarily a bad thing. That's where innovation can happen. That's you know, sometimes just the cost of doing business.

A

An important thing to remember for most companies is it's real easy for a team to go and just run like get their credit card, go jump on AWS and go run sprint down the line, but in a year and a half where's that product like who's, maintaining it who's, taking care of it, who even knows how it got deployed, because maybe that person was really smart and got scooped up by another company and now nobody knows how to run that out.

A

So we can go further as a company together, because when you're working for a company, that's who it's about technically it's about that company and making sure the product is good for them and not hurting your co-workers, who are part of that company. So we can go further together instead of faster apart.

A

So our goals developers are the smees. Let them be this me. Let them know about god. Let them know how prod runs, try and get to a point where we were trying to get to a point where we can just say developers have access to private in a compliant environments. We have to implement some significant controls to make that acceptable, but it is possible, get operations out of that path, make it so that the dev teams, then your product team, at this point just doesn't have to worry about the standalone operations team.

A

Let them go run other things like open shift or the logging or the monitoring off to the side. Our role for ourselves is also simplify. Fleet management, the less variance we have at that level, the better and then maintain compliance objectives.

A

Really fancy word to say: whenever PCI comes around, we can give them that report a lot quicker with a lot less resources and then actually move faster, because the trick about that going further together is that once you get to a point in that race, you're still you're actually going faster as well, and that's the you got to find that point, but you will get there as long as you follow. So how are we getting there for I as we decide to utilize one of our largest I as products?

A

Rackspace does several, as I mentioned earlier or as I had on the slide earlier when we went with was Rackspace private cloud powered by VMware, it's one of our larger products, and so it was an easy win. We have a lot of internal support for it. We've got a lot of experts on it. We've been providing that product for probably the life of the company almost so then the only problem with that really became how to stay ahead of demand because everybody needs a place to put their stuff.

A

So then, our first passive I thought I had updated the top of the slide for a nice little pun, but apparently not our first Casa de Paz was actually started about two years ago. It was an in-house app written in Ruby called maestro, and it was built on top of marathon and mazes, and it was intended to be very Heroku like build packs. Curls to the API is those kinds of things it worked for the most part, but when you have developer turn, then you have a hard like.

A

We didn't have a team supporting it after a year and once we start getting more resources into it, it was still that like well, maybe going to open shift is a better idea, and so we went on our second pass and so we've started building out in an open, shipped environment, we're working on our third region. Right now we started off with 1.4 upgrade to 1/5. That was, unfortunately, a painful upgrade for us, primarily because of logging and some custom changes that we had internally.

A

So we haven't gone to one six yet we're in about to try that out. Storage was a little bit of a hiccup for us as well. We started with cluster of s, but elasticsearch did not like it for the aggregated logging. I didn't see. Anybody else complain about that. So I don't know if it was still something we were doing, but so we moved that to south. We still occasionally run into issues with that and we're gonna just move elasticsearch outside of the cluster.

A

So within three months, we'd gone to production for another critical workloads and people had deployed a product. A couple of small hid leaf production apps and we're pretty happy with it.

A

Jenkins did very much become a top consumer, both in number of instances and actual resources. I think we had a couple that I had a minimum memory footprint of four gigabytes for their app. So but the successes we had a new ticketing API, that's had a demo stage right now that was able to get all the way out to production for that within a couple months with minimal operation. Involvement, which has been great. Our QE team several months ago, migrated over there testing and for our internal identity system, and they say its fifteenth out.

A

15 million requests from this testing suite within a couple of days. He implemented this and he was very happy and impressed with that he's saying over there somewhere. So right now we're at a couple hundred projects. Half of them are sandbox playgrounds and about 15% in our CI city projects. I've only got one customer facing production bot system on it. Right now. We've got several production services, technically you're, not production as far as I'm concerned.

A

Unless you tag your project as prod, so there might be more than that, but when I go run a query, that's the only ones that classify.

A

So some lessons learned- these were just some points that I thought it'd be nice to share, especially if you haven't done this before, as we ran through the things we ran into, so it took a while to fully learn this lesson in the ansible inventory. You've got the elbe nodes and because the routers are similar to the elbe and because they both run each a proxy, it's real easy to just kind of sit in your head and go oh they're, the same thing and they're, not at all I default.

A

The elbe is pretty much a none, containerized h, a proxy that runs on that lb node. If you don't have LV nodes, it's expecting you to be handling it externally, such as a Nana five, or something like that.

A

The routers are then pods running H, a proxy that run on any nodes that are inside your the router selector, which defaults to the infra region that might have all just basically like by default. There's an info region. If you don't have scheduled nodes in that info region, say you just have your three masters and they're all set to unschedulable, because that's what the image the instructions tell you to do, you're, never gonna, get anything running. It took me like a week to figure out.

A

That's why those nodes weren't coming up so once you add additional nodes into that, infra region that can be scheduled on those nodes will come up where I actually ran in the problem was I had two nodes, but the default I think for the router. Replicas is five, and so I only had two nodes, and so it just was never coming up.

A

Once we went in there shifted that down to two everything was fine and so basically like in our hosts inventory, there was a nice big comment section now that says you know make sure that router replicas is no more than the number of nodes in the router region. We set aside a separate region for the routers.

A

So right now we actually are two primary elby's which run the uncontained. Erised are also running the router I'd like to change that at some point and just keep them completely separate. I think it would be easier over time to manage, because you have that distinction of what they are quotas, one of the things that I'm happy we did was start off with quotas from the get-go. We every project that you create it's a very default kind of minimal quota.

A

We don't really put a high barrier to entry to requesting a higher quota, except that we prefer to only give them to you if you are following our conventions for naming and such to prove that it's not just to your personal playground, but even if it's your personal playground, if you've got like, if you want a higher quota, we're likely to give it to you because we're it's that we just kind of want to keep a lid on things. We're not trying to be overly restrictive.

A

The one thing that we didn't include from the get-go we tried and and what happened was or what we didn't implement was resource limiting. So, like CPUs and memory, you can add those into the quotas. Instead, we just restricted, like the number of items that you could have the number of pause. The number of storage containers things like that when we added the resource limiting anybody that went to go load, a new app failed because all the QuickStart templates don't have any default resource requests.

A

And if your template doesn't request the resource, then it fence so laziness and time and all that of what was going on at the time. We's like okay, well, we'll revisit that later, because it means we have to go edit. All of the templates that came with openshift to include those requests, and so we've got a store on our backlog to go, implement that everywhere and it does work. We did play with it a little bit on something that I'll be getting to in a second.

A

But let's talk about the resources, so I didn't even want to try and put this into a reasonable slides, so I, just don't what's in our inventory file, so try to be pretty verbose in our inventory file about why what settings are where the cubelet args is, where you're gonna pass a lot of arguments to the individual nodes and, what's running out on your workers, excuse me pause, perk or I believe actually defaults to the ten I.

A

Don't remember why you hard-coded it in here, but that's based on the documentation and size, the computer, the size of the nodes and number of pods that they can handle our nodes are pretty small intentionally. So then the garbage collection threshold high and low. So what this is is the local image repository on each of the nodes takes up a certain amount of disk. The high threshold is where the garbage collection kicks in, and then it tries to clear out until it's lower than the low threshold fairly easy.

A

We had this at 90 and 80 I think originally, and so what happened was we would have somebody come in with a big image that needed more than 10% of disk and they would get kickback errors when they would get scheduled to a node, because there wasn't enough capacity, even though the node it's got 10% free and when took us a little while to figure it out, and so basically we just made it a little bit greedy or on the garbage collection.

A

We've actually still seen that error once or twice, but it's very rare now so definitely something to keep an eye on then the other was our first major incident came from nodes, starting to whom kill on us. It was decidedly not fun. The we didn't have system reserved defined or secret driver I'm, not a hundred percent straighter. The secret driver actually has to be in there reading through the docs.

A

We thought all three of those bottom ones needed to be there, but the bottom to actually break origin node when it's there there so, but it's worth I left them there. So you can see. Don't add those to this because they will stop origin note from working, but basically the the goal there is to reserve an amount of memory on the system so that way, open ship doesn't kill itself, which is what happened.

A

That was lots of fun.

A

This is where we actually played with the resource inside our quotas. Red Hat has put together this awesome set of resources and they've gone around the country, probably the globe, giving free workshops where you can come in I totally thought it was going to be a sales pitch and I went in and we got to sit down and do all day labs. It was amazing, it was so much fun. So excuse me, the content is all out on a public github. It's fairly easy to kick off your own version of it. I.

A

Run this internally and it's just up running now, and we have a special quota set aside, that has resource limits, the the roadshow that they did in San Antonio. We blew out their reservations and halfway through the day we overloaded their system, and so with that in mind, when I went to go, do the big version of internally at Rackspace.

A

We made sure we had pretty good resource quotas in place before we let people on it and we were able to handle a good hundred plus people which was about what was in the roadshow without it affecting any of our production workloads or anything. We ran it on our main system. It was, it was pretty good, so internally, we've got several teams that are working on using Hjelm to manage things basically trying to provide a little bit more composable templates for reuse.

A

This is not fully embraced for everything yet mainly because a helm is single tenant at this point, but there is work upstream to change that. It does appear that this is going to event kind of be a little bit more of a thing, and so we've seen a lot of us expect success the teams that have been using it. If anybody wants to talk to one of the people that uses it come find me and I'll introduce you.

A

Community Dianne's mentioned this several times, you're, not on the slack get on the slack, it's very quiet and very lonely. We want you there. We want you talking.

A

It is technically the preferred one OpenShift ansible on git er is one of the more active ones that I've been a part of you for, for whatever reason that's just where there ends up being more talk, it would be nice if it was in slack instead and then also the stack overflow attack for open shipped origin is a good resource as well at least a good place to ask your questions.

A

I've been told that it's actually some place where people try and keep up with and I've seen answers go up on things fairly quickly, so some random, finishing off notes like this one things that we didn't really pay attention to until the last second on our first production employ was the turn or sorry the SD ins Network.

A

We had deployed using a ten dot Network. Well, we already used ten dot everywhere inside or a big hosting provider, with a big private networks, we're using pretty much all of ten dot. Slash eight and I had deployed it in all of our POC s using in 172 and then I'm going to go do prod and we deployed it with ten dot and the night before my coworker goes.

A

Hey that doesn't look right and so I did the horrible thing and I spent the night rebuilding the entire thing from scratch, which was lots of fun but yeah. So definitely, if you are on private networks, make sure you're taking that consideration for overlap with the Sdn inside openshift.

A

That could lead to some weird wonkiness here and there another thing that bit us in the long run was we deployed using open ship danceable, obviously, and we went to go, extend it and we had no idea what hash we deployed on and we went to go deploy again using whatever was the current things acted weird because it was not exactly at the right State and for whatever reason we eventually got to a point and we don't know what hashed we're working with, but definitely keep track of that. It's also helpful.

A

Like okay, we deployed this data center now I'm going to deploy that one. You know just to help make sure you're synced I mean. Theoretically, you can stay current on a release branch but they're. A different doesn't always work out.

A

We've also been working to start handling a lot of our post deployment changes like adding quotas to things using the ansible OC module. So it's a lot more programmatic. Instead of somebody just doing the OC create over a bunch of files that are in a repo which it's at least a little bit more automatic, even though it's the same results and then I left that last line in the wrong spot, so it was so greg swift. Those are the ways to get a hold of me. So thank you.

A

Sorry, I'm! Just getting over how cold sir! So.

B

I loved in this, the shout-out to the Roadshow stuff, because that's been one of the things that we've used to get people started really quickly and came out of the Evangelist team. So it's great that you're you're, making taking advantage of it and I hope other people will too does anyone else have any questions for Greg, while he's still standing there's one over here.

A

What's the strategy around origin.

A

So Rackspace historically has been about building up the internal knowledge space. We are actually I mean we explore a lot of options and avenues over time and having conversations with red hot again in here and they're. A big part of it was just getting started, ramped up, knowing that we were gonna, be building out expertise on it internally and hoping to be contributor back to the community.

A

We haven't quite gotten to that point on the development side on the back end, but we have some of the few cool requests back to open ship danceable and trying to be helpful in that community as well.

B

Is there another question for Greg before we let him go back home and go to bed? Oh no!.

A

I'm better now you're, better now, okay, good he's.

B

Been quite sick for the past couple of days. It's not.

C

B

One there's a couple appointment: I keep thinking, I.

D

Was just wondering how many clusters you're running right.

A

Now we have two online we're about to do the third okay, so.

D

It's different production workloads.

A

The first one is fully prod. The second one is only not prod because we're in the middle of moratorium, and there was no reason to release a new production system in Orion. My.

E

Question is actually for Diane the you know. If there's a road chant, a road show plan for 2018.

B

Yes, I, don't think it's up on the website yet, but we can get it up there and as soon as it is we'll send it out on the Commons mailing list and in the slack channel. There's another question over here: anybody.

A

Wants help setting up the roadshow for their own use? You feel free to hit me up on slack because it did take a little bit of poking. So you.

C

Mentioned you're running the eos on vmware. Did you do any specific tuning.

A

No, in fact, I noticed the other day that there's two indie profiles for OpenShift and I still haven't applied them, but I definitely want to go back and do it. We didn't do any different tuning for the OpenShift nodes than we do for a normal linux nodes on the vmware. I was.

E

More going to do vmware layer who.

A

E

Vma layer did you do any tuning there.

A

No, we didn't, we don't tune it any differently than we normally do. Okay and.

E

A

Vmware expert on my team, so okay.

B

All right, another question, ain't gonna, say.

D

On your two clusters, that are you keeping them? Are you planning on keeping them in sync with projects and deployments, or will they be out of sync heterogeneous, so.

A

We want to look at having the Federation layer going on, but basically we're taking the approach of kind of like. If you go to use AWS, they don't sink your products between regions. You still go, deploy your stuff to them, so, basically, not sinks. You give. We will worry about, making sure templates are there and all those other things are there, but we're not helping anybody make sure that their application is deployed across multiples.

A

It's expected as a consumer internally that it's just like you're, consuming a cloud service, you're responsible to deploy your app to where you want to be in it. That's multi-region than that share responsibility.

B

Any final questions for Greg he will be here this afternoon and through all of kuba coop con 2, so please reach out and I will set up my laptop in the reception this evening, while we're all drinking beer and anyone who wants to get on the slack channel I will sign them up so come and find me all right. Thank you very much. Greg.

B