Red Hat OpenShift Case Studies | OpenShift, 10 May 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift at UPS with Kevin Chiang at OpenShift Commons Gathering 2019

Description

Case Study: OpenShift @ UPS
Freddy Montero (Red Hat)
Kevin Chiang (UPS)

at OpenShift Commons Gathering 2019
at Red Hat Summit

A

Hello, my name is Freddy Montero I'm, an architect with the container practice at Red Hat, how many of you guys buy stuff online, especially from Best, Buy anyone and hence how many of you guys check the tracking of your packages going at ups.com. Of course, that tracking comes to the opposite clusters that we have at UPS. What we're going to be talking about today is going to be our journey of doing software more or less open ship upgrades without any downtime, and this is going from open chef, 304 to OpenShift, 3.11 and beyond.

A

With me, I have Kevin Chang. He is the the lead technical guru at ups that handles the team that manages all of the open chef clusters, Kevin good.

B

Afternoon hi, my name is Kevin Chang and I work for ups and our team is pretty much responsible for the infrastructure support in design for open shift at UPS. So today we're gonna be talking about.

B

Let's go over the agenda, so UPS, who we are, what we do also the background and also our journey, like Freddy, mentioned our journey with open ship, how we started and why we pick why we pick you can shift and also with the monthly open operating system, patching and also the upgrade path from 3.4 to 3.9 and now from 3.9 to 3.11 and some of the lessons we learned along the way. And finally, the accomplishment and romantic.

B

Ups, what do we do? What can Brown do for you? So, as you can see on this slide, most of you think at UPS we should packages, there's a lot more than what we do then shipping package. So so, as you can see, on the left hand side we have the global small package surface, that's what you got most people are familiar with, so we have the domestic package and we have the global. We have the global global package services.

B

So, on the right hand, side are the additional services that we provide so, for example, supply change, supply chain and freight and logistics UPS, free and UPS capital like insurance and finance financials.

B

So some of the interesting facts that I found along the way we're how many packages cars do we have so today we have about over a hundred and a hundred twenty-three thousand package cars all around the world. We have about a hundred fifty thousand of storefronts all around the world and 200 over 200 aircrafts.

B

B

So Brad our background with open ship, we started with open ship in 2017, beginning of 2017, and we had select this on the most mission, critical application to to be on open shift. So, for example, some of those are tracking, so the network planning for for the routes and everything.

B

So these are mission, critical applications and no out no outage no downtime for Mission for open ship. So how we accomplish this? What we did was we use some of the rehab tools that can that can migrate the existing infrastructure without any incident, and last year we had also wanderer had Innovation Award in 2018.

B

Okay, the journey, how we started why we went this path, so traditionally we started with when a when a developer. When an application comes come, you need to get some work done. What they first do is they they take their the application teams, develop their develop developers, take their code and develop their in their laptops, and then eventually they would take that code and hidden off to the operation. Team's operation teams will then schedule jobs.

B

I'm sorry make a change, a request, change controls, schedule, jobs to put put put the code into a into a development environment and then eventually put it in UAT, environment and stress in the farm, eventually in production and during the whole process. If something was to go wrong, the applicant that, if something was to go wrong, the operation would go back to the developer and then back and forth a lot of times were wasted. So that's when we started looking to hey.

B

How can we make that better and that's where the DevOps model had helped us tremendous tremendously? So, as you can see on the right side of the graph, we have the whole devop model. I mean the developer, will still do their development work in their laptop. That's don't that didn't change.

B

The difference is that once your once, their code is done developing what they would do is ship them off to a git repository and with the right configuration when when a code is updated, it would automatically push into the development environment and they they can start doing their testing and if all is good with a push of a button, essentially that code get pushed to push to to the next environment, which is stress and then another with a push of a button. It gets pushed to the production and in the middle of process.

B

If something was to go wrong, they can control. They have full control of pushing the button. Hey I want to roll back to the previous version. I want to go back to the development environment. They have full control over. If so, operation teams don't have to get involved whatsoever.

B

So what are some of the benefits that that that this provides right so create modify and rapid deployment faster time to market and also automation with less errors whenever you do automation, there's less error, less human human intervention, application team built for operating operations and eliminating need for hands-off operations.

B

So why do we pick rahat OpenShift so prior to read how open chef? Actually today, we still have it so whenever, whenever we we have many different environments right so for the dot Nets, we have the windows. We have the windows platform for the java environment. We have BA WebLogic or WebSphere. So all of these are separate environments that operation teams need to support.

B

One of the biggest advantage with open ship is that with open ship you can have multiple different languages all hosted in one single common environment and which it also provides many many of the features like, for example, the high availability noisy neighbor, which means that which means that hey, if one of the application development of the applications who develop bad code, they won't because the docker it won't, it won't impact any application.

B

Other applications that sitting on the platform- and the next part is the application team controls, operation tasks so essentially operation operation, team control, therefore their own destiny right. They can scale up odds when they have when they have more traffic coming to their environment. They can monitor their own CPU and say: hey. Do I need more track, do I need more pods for for their environment, and then they can also self help a persistent storage request and then, lastly, the application and portability they develop, whatever they have on their development and their laptop.

B

It can be ported over to any environment that they want ok, monthly, OS patch. So one of one of our mandates from our security team is that we have to be patching every single month and during so for for and for the patches. It covers the kernel patches, security configuration and some of the custom configuration that the the OS team that the different from the OS team so with having over so in our environment.

B

We have about over four hundred four hundred or service for the open ship environment, 14, open ship clusters, and it requires no outage. How do we do it so by doing so, what we did was we create our own answerable scripts and once the ansible scripts is created, we would execute through the ansible tower. So I'll just give you a high level how the ansible script looks like. How does it? How does the path looks so from a from a single server perspective, we would essentially tell.

B

Okay, so so from a single per server perspective, we would have a pre task so prior to starting any any patch first thing. First, you have to check to ensure that hey the environment is fully operational. Once once do we deem that the script will determine the environment for the operation once the the environments operation, then we can start looking hey. The first task is we're going to take whatever server that we're going to be working on and then coordinate.

B

Oh I'm, sorry take a step back we're going to make sure that that server is 100%, also 100% operational and then we're gonna coordinate essentially is taking out they're taking out of the mix taking out so that there's no there's no traffic going to it and then once that's done, then the patching begins for that particular server.

B

So in the middle of the in, in the all the way, on the right hand, side where hey the patch actually starts, and then it will do all the configuration change, the kernel updates and then finally, it'll do a server reboot and the last piece is it'll check. Hey did the operator. Did the OS patch cause any problems and if all is successful, it will put put everything back in the mix.

B

So that's from a single server, that's from a single server patch view. So how does an table tower play in this whole picture? What answerable path tower does for us is? It does a scheduling so, as you may know, that with openshift everything is broken, things are broken down to masternodes, infra notes and application notes. So what we do is using the capability tower.

B

If we schedule back hey when we do de patching the master server will we will do map one master server at a time in event that if there is a failure in any one of the master servers that we have, it will stop the job everything gets stopped until me and a manual intervention will have to come in and see what happened, fix the problem and then kick off the scrip again.

B

It will continue how to to the point where we stopped and then also it goes the same for the infra infrastructure, because it's a front door. It's it's a front door to open ship. We also do it one server that time in event that if there's again of the, if there's any failures manual, intervention has to come and fix the issue and then move on where, where it gets better, is the application nodes. We have multiple application nodes and then there's some concurrency that we can happen over there.

B

So we can, we group everything application notes into six maintenance groups, so each maintenance crew can have like ten ten servers at once and then we can have 10 concurrent servers being patched at once and then. Lastly, we also have the metrics nodes.

B

So how do we do? How do we upgrade from 3.4 3.9, with with the upgrade of 3.4 to 3.9, there's an EDD upgrade, which that means that essentially it's we're forced to have an outage. We have to bring down the data at CD database and in upgrade upgrade the database in order to avoid outage. Essentially what we did was did the blue blue green deployment, which that means is hey.

B

If we have two clusters, one is on the 3.4 and it's servicing existing customers and then 3.9 environment, which will be the new build and then we'll have two pipelines, and one is the 3.4 type line that would deploy that can deploy code out to the existing environment and the other. The other pipeline is for the 3.9. So once all the testing is completed for the 3.9 environment, essentially the application team will just do a DNS failover. So so we have multiple application teams they can. They have full control of when we give.

B

We give the application team a timeline. Two weeks hey during that two weeks you have to you, have to fail over to the new I'm. Sorry CNS switch over to the new to the new environment.

B

3.9 to the 3.11, we are currently doing that today. Over the past weekend, we just completed one of the our production environment. So how do we do it? So we had to create again. We have to create our ansible script, there's a lot of customizations that we have to do to to make to ensure that everything is happening correctly.

B

So what are some of the customization that we do so, for example, the the hostname has to be in the right format that opus ship requires the proxy settings that we have in configuration a log location because from 3 93.9 to 3.11 there's a major shift in everything becomes a pod, so even for the infrastructure, even for what we had before was the the RPM base. Everything becomes a pod format, so all of that is to ensure that everything is copasetic is we had to.

B

We had creative script to make that happen, and once that's done, we have. Then we start using the Red Hat out of box play books. Again, everything is divided in two. We have the control plane and then in which that has the master nodes and then also it's broken down to the infra nodes and then half notes again master nodes, we're doing one at a time and infra knows one out of time in app nodes. We can do it concurrently to save to save a lot of time.

B

And lastly, once the once the environment is upgraded, a 3.11, we would have our. We create the post ansible playbook that we use to ensure that everything is running at this exact same manner for a different cluster. So for and also a lot of customization again comes in play where performance, customization, router, customizations and application logging and time sync, for example, etc.

B

What are some of the lessons learned that we learn from upgrading from 3/4 or actually beyond open ship in general? The top 2 is more of I. Guess it's learning, curves skill sets so I'll. Just read it off a open, open, open ship skill gaps to design, implement and support the infrastructure as a short period of time. So, as you may know, open shift, it's not just hey, you can just learn open ship and and that's it with open ship there's so many different components. You have the networking you have.

B

You have ansible that you instable you have the reloj as there's so many different components and in order to be successful with deploying open shift, you have to have expertise in so many different areas. So that's what a lot of challenges that we saw that that we encounter hey. You have one person, that's a good in one area, but not good in the other, so we had to learn a lot from hey every different area. We kind of have to start learning how everything works. In order.

B

When we encounter problems, we can, we can resolve it in a quick manner. The third one is, the resource planning is critical. Today we have one full-time myself and two consultants. We need to we're looking at increasing our resources to be able to to be able to support our existing infrastructure from a technical side. We have a a proxy router, so one of the big things that we we ran into was, after we deployed 3.4, 3.4 environment, and we had all.

B

We had all the applications team on one one cluster and what we got was that periodically we would have application team come to us, then go hey, we're getting some drops and we're getting sometimes out or some requests are taking longer and then later on later on, we realized that hey a tree proxy is a single process single threaded process. So what that means is that hey grant a? We created multiple physical servers for each a proxy to sit for the pot to sit on.

B

However, being that it's a single process, it will only use one out of 56 cores that that we have allocated so the resolution. For that is weak, we started creating more multiple pods on each single servers so that we can we so that the process can be utilized on the particular box more efficiently and in custom role. We set permission with the upgrade of 33.9 or 3.11.

B

There's some custom rules that we have to go back and revisit and work with the app team to make sure that the custom world fulfils their their requirement and open ship. Uninstall job did not clean up directories. One of the things that we do at UPS is we would so, for example, if we were to upgrade to a different version, we would reuse on the application notes and the uninstall.

B

What we learned was that the uninstall doesn't always clean everything up, so what we have to do is we have to manually go into each server and make sure that everything is clean up correctly and, lastly, performance save mode. Essentially it's a it's a flag that you turn on and off that hey. If it's. If it's turned on meaning that hey, if the server is not being utilized heavily, it will start shutting things down or it will. It will start shutting things down and make it had cost problem for us.

B

So we had turned that off.

B

Okay accomplishment in roadmap 2017 we started with the open ship. We build out the 3.4 infrastructure 2018, we start onboarding a lot more applications and then we again we build out the infrastructure, and then we upgraded from 3.4 to 3.9 and through 2019 to 2020, were in the process of upgrading 3.9 to 3.11 again we're we're onboarding additional applications coming out to open ship and towards the middle or end of this year.

B

Will we want to start evaluating the for that for the X environment, for that X release and then building out it's building on infrastructure for capacity expansion and monitoring, Gravano, improper, using Griffin and Prometheus, and then finally further for further automation with the ansible and that's all I happen.

B