Red Hat OpenShift Boston 2017 | OpenShift Commons Gathering at Red Hat Summit, 1 May 2017

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: OpenShift at Cisco

Description

Michael White & Michael Duarte from Cisco Systems discuss their production deployment of OpenShift at the OpenShift Commons Gathering Boston on May 1, 2017.

Learn more and see the slides here: https://blog.openshift.com/openshift-commons-gathering-at-red-hat-summit-2017-video-recap-with-slides/

A

Thank you all for being here. We're excited to be here and share with you. Some of the work we've done around open shift at Cisco. Our our goal from as an infrastructure provider, is really to provide cloud infrastructure for all of Cisco, IT developers and so Mike and I both work, Mike and Mike. We both work in global infrastructure services at Cisco and we've been there a long time.

B

Some longer than others, yep.

A

So, let's, let's jump into the the presentation, so our platform that we offer at Cisco is codenamed CAG cloud application, environment and really this is the platform that we've built on top of OpenShift 3. Our.

B

Cto came up with the name, the.

A

Key thing for us is really the scale that we deploy out so we're currently deployed across three different data centers in two different geographies, and our goal is to support thousands of internally developed applications. So we're scaling up to support 2,000 plus applications looking to run tens of thousands of containers and these support all the different lines of business. So our internal HR apps our supply chain applications and even our commerce and go to market applications.

B

So when we set out to build out this new iteration of OpenShift for Cisco, we had a couple guiding principles, kind of going into it. The first off and one of the biggest drivers for a saturated opting to adopting OpenShift 3 was our desire for a multi-tenant environment. Allow the other current container, scheduler solutions, even around kubernetes, are not multi tentative, and one of the big drivers for us is to utilize the efficiencies that containers bring into the ecosystem, to take multiple different use cases and pack them all on bare metal servers and utilize.

B

Those suckers up about taking the hypervisor tax and we've been seeing some coal utilization from that. The second guiding principle to it is that we didn't want to be a restrictive environment for some of you who might be working for larger or history organizations. You probably know that, typically the larger the organization, the more policy and process and red tape, that's involved.

B

We want to make sure that the environment was empowering our developers to really do what they needed to do and that we didn't stand in the way of enabling them to create the first-class applications that we know. They're capable of can.

A

I jump in here in my office.

C

A

Have also an LED environment which was openshift 2.0 based, and this is really where we're trying to step from a pure platform as a service to containers as a service, so that empowering the developers goes all the way to offering beyond the normal template, adapt locations that you'll get with a with a pass play right that you can pick and choose the technologies that you want to use, assemble them in micro service applications and then deploy them on our platform.

A

So we call it pads, plus, plus or container as a service, and that's what we're trying to enable here.

B

And we'll talk a little bit more about that kind of concept, a little bit kind of tying in with the multi-tenant environment. We want to promote efficient resource utilization. We didn't want to lose roughly 15% of our compute processing to hypervisors. We didn't want to stand up and eat the management overhead of multiple different schedulers.

B

We didn't want to carve up our impre unless we absolutely had to so that was another guiding principle for us and then fourth- and this is definitely not least- we need to make sure that we secured and we protected the workloads running on this environment. Yeah.

A

So, even though we want to enable all the developers right, we want them to do the right thing, so we've got to put some rails on on the process, so they don't go go too far, but.

B

Online in that was served where, as a counterbalance than a dictation right did the empowering of our developers on the environment, yep.

A

Okay, so I'll take this one. So all our developers, like everyone in IT today are enamored with the all the great things that you can do it at Amazon or Google, or you know uh sure they really want to have that freedom to program against their infrastructure right.

A

The treat infrastructure is code and to do all these DevOps II processes, they really really want that, and if they've got enough budget and an influential enough director, you know they'll go out and they'll start at Amazon and we always ask them why'd, you go there right and they come back with this litany of services that they're going to use from from Amazon right. They they're going to use ec2 and s3 and all the elbe networking components they come with this laundry list and up until recently we said yeah. We can't do that yeah.

A

We don't have programmable infrastructure that you guys can use and, and many of them went out there, but when we talk to them and really try to understand what they're looking for you know, Amazon ec2 is really the ability to dynamically provision, compute capacity right to be able to scale to be able to deploy in multiple regions. So when we understood that requirement, you know we can start making some progress.

A

S3, they're, looking for all different kinds of storage for all different kinds of use, cases for simple web app, you know just or in some content you don't need the high powered storage. But if you're going to run a big data service, you know you need that high throughput, low latency kind of storage.

A

As far as the network goes, what they're the app developers were most looking for was the ability to control request. Routing can I send some of my users to you know one geography another set to another can I use that in my deployment process, so that I can do a/b deployment and those sorts of things so they're really looking for control over the request routing.

A

Additionally, we've got lots of apps that want to use a web app firewall, and that was a capability that we wanted to provide, and so, as you look across all the capabilities that they were asking for at Amazon, we found out well at Cisco. We have all of these technologies. We have all of the different compute and storage and networking solutions that you might want. It's just that they weren't exposed programmatically. So OpenShift became a facade or a layer of abstraction.

A

If you will that helped us translate the requirements that they came with to the services that we were providing, so we still don't offer you know ec2 programmable instances. We have self-service versions, but not not programmable versions, but we use open shifts to kind of be that that common layer that API that our developers can program against in order to get compute storage and networking.

A

At the same time, we've integrated it with an enterprise container hub and our continuous delivery, continuous integration pipelines to make this a whole solution, and we'll talk a little bit more about that on the next slide.

A

So if we're going to brought provide programmable infrastructure to our development communities, we've got to really handle the whole lifecycle, so we're starting out from provisioning. Much like the the swiss rail ray example. We have a whole system built for self-service provisioning, that's api, enabled and programmable. So right from the scratch, the development team can come and say: hey. I need a project, I'm developing this application and will allow them to you know, get started, create that project and establish their tenancy.

A

This ties back to all of our planning, budgeting and other systems within within Cisco, as well. Actually.

B

Fairly similar to the SBB setup, but you guys are just demonstrating earlier yep we're laughing. It's like man, we're almost identical and though the approach that we're trying to take in adopting the second set so.

A

Once you've got your infrastructure provision, you know your next steps off to you know, building your application, and so we've got an enterprise container like everyone else, we've got get and Jenkins and those are building. You know your applications, war and tar files, checking them into a version repository and then having to merge them with an image, and so we'll do that build process. Stick the image in our enterprise container hub and it's from there that will pull it into the container runtime environment. I, wanted to say one more thing on the previous one.

A

So once you've got the app out and running in an open shift in our CAE environment now the developers will interact with the platform itself through the api's, and we've been very conscious about making sure that we leave the API is open because we want this to be a programmable infrastructure and in fact, if we go to the next slide now, this layer of cloud EAP is is a layer that we've built, maybe in lieu of or while waiting for API Federation, but also as a way to provide programmable interfaces across that lifecycle.

A

So we've built some cloud: EAP is within in Cisco that will allow them to manage all those different different life cycles, as well as manage our federated cluster. The container platform OpenShift is obviously the core and underneath it all, we've got our infrastructure services. That cloud API is really important for technologies that don't have a multi-tenant cloud based API we've built one in front of it. So if you think about the load balancers and even some of the, what else did we put one of.

B

The solar systems there's other technologies too, that sometimes claim multi-tenancy, but when you deep dive into it, doesn't provide the multi-tenancy options that we needed so oftentimes. Those expanding upon functionality that vendors would provide to meet your customer needs and I.

A

Remember one more thing: in order to provide an end-to-end stack, you've got to orchestrate amongst very service services, so just running and deploying your application and OpenShift is great. But what if you need to create a firewall rule or a presence on the DMZ? How do you get that done so we're trapping events as they come out of OpenShift and passing them back to these cloud api's to orchestrate things like certificate creation, dns registration, a whole laundry list registering them with our enterprise monitoring system.

A

We do a whole bunch of extra orchestration tasks so that when the developer says I want to expose this app at a route, they get all these other components from end-to-end for.

B

Being when the coolest things about this was automatic, generating generation of certified SSL certs, which I just thought was the coolest thing ever personally, yeah.

A

B

I was very sad to hear that Nintendo was canceling, the Nintendo classic, so my little slightly adjusted Mario figurine. So you know we won't get sued for copyright. Infringement he's wearing blue, not red and white, pants of green. So if we that's enough, but as many of us have been talking about you know, this is the evolving technology stack in solution and there's a lot of stuff.

B

That's been done in fact, I'm shocked at how quickly this community has developed and put together feature sets that rival pretty much anything else in such a very short time, but there's still a long road map of features that we still need, and so I kind of want to talk about, and you know super cool animation right. Some of the feature gaps that we started. Fact you guys want to see it again, because you know it took me a while to figure out how to do pathing and PowerPoint. So he.

A

Spent he spent about 14 hours, creating just.

B

That one job yes Mario's.

A

D

B

He also go, maybe I'll put in a little mushroom, so he grows so kind of wear that now we were trying to again just bridge that you know what we get out of open, shipped out of the box, which is fantastic, and how do we adapt that to our specific use case in need, and so one of the first things we realize we needed was a unified experience across all of our clusters.

B

So currently we have three clusters, one cluster per data center, and that constitutes our availability zone within our uptime calculations and there was no good way to day and we're eagerly awaiting for Federation to handle kind of user tenant chip and things like that, as well as the API access across all those things that mike has already talked about, and so we had this thing called ACC, which is kind of our front: end user interface for billing charge back user management, tendon strip stuff, like that, our back and then Cappy or the cloud API that Mike was referring to earlier.

B

In addition to that, there's a you know: the kubernetes service discovery, it's a fantastic. It works very well within the cluster, but there's no way today to do service discovery across multiple clusters and one of our first big clients, they're working on a identity federation warehouse, and they have like many clusters of co-founder databases that then need to see when new nodes come up in different clusters. There's just no way that we could do that through kubernetes. So what we end up having to do was enable certain specific use.

B

Cases like this, this one app to deploy their own service discovery, application of their choice and then install agents across the pods on all the different clusters to kind of hedge, our bets for now until we get that proper Federation and in this case we're using console and sorry I thought I'd closed everything down.

B

One moment there we go yes, I am definitely sure I want a closed lost ink there. We go the other thing that in very interested in hearing about all the work, that's being done in this area today: ingress and egress control, and so again, when we're talking about multiple clusters, not just northbound of open shift, but even between clusters. We quickly realized that we needed some ways of controlling what traffic to be exposed where and what traffic could be in taken from where and there's really no good way of doing that today.

B

Fortunately Ben who was over here earlier, a huge call out to him. He helped us out a lot and we did a lot of work on getting the ingress controller set up so that we could publicly outside the OpenShift cluster avail. Certain services with their own public IP address that, so it could be reached from other clusters, but there's still a lot of work. That needs to be done here.

B

Network policies again we're excited to hear, what's coming out, we're also working with our own internal context on this and really taking Network policies from just a host layer which is kind of where we're now, where we're putting each host as an endpoint within EPG. Think of it kind of like a policy group, and we are dictating essentially apples or filters on what can access or not access those hosts. We then node label those hosts to dictate out certain workload behaviors from camp stage, dev prod.

B

All running in the same cluster, while only having access to certain things, limitedly we're really looking to take in hoping that the container will eventually become truly a first-class citizen on the network and really kind of expanding and fully taking advantage of everything that containers half there. Let.

A

Me jump in here Mike to me that this is the holy grail, yeah I think I've been waiting so long to be able to do a multi-tenant shared pass platform and be able to specify application, specific network policy, and it's not a openshift problem. It anytime, you we've been running java. Jvm is on on a consolidated platform for 15 years and it's the same problem there. We can only define policies at the host layer or a group of hosts or the cluster layer. So if any of you guys solved this one I know and actually.

B

I thought the proposal I forgot, who did it but using node ports, something we looked into as well, but we didn't have whatever tool you're using it's part of your CIC, the pipeline to manage this I wish. We did, though, the mayor lives a lot easier ipv6.

B

We really really want it. We're running out of public ipv4 addresses at Cisco. Ipv6 is really the path forward for us and we're kind of stuck, and so, if you guys happen to be working on that proposal, clay was talking about. Thank you. Keep up the good work we yeah.

D

I'll, buy you a property or something.

B

You know we'd love to see that take place and then dynamic, secure, host path.

B

There's nothing else, we're also very much looking forward to as a requirement right from the get-go for us that we needed very low latency high I/o high throughput storage for some of our cluster database use cases and while stateful sets, helps all part of this problem. As you guys probably know, host path is really only enabled for single node deployments.

B

We've done some workarounds we've been able to figure out way to kind of securely attach local storage without exposing or making the container route to that container and have it mounted and we're using the out of tree dynamic storage.

B

Plugin for that and we've modified that to allow some that's other feature sets, but we're really looking for an entry solution and then security right now we have great visibility and that's something from an IT organization you historically have struggled with and when we were pitching this through our own internal InfoSec team, something we kept exposing over and over and over again, and they were very impressed by that. Looking into the future again you'll see cube.

B

Federation all over the place here we really see this as the way future for a lot of the problems that we're facing today. Things like unified experience across clusters. We see the cube Federation model really helping out their service discovery, coop Federation, helping out their cluster failure, cube Federation, helping out there again we're still looking for a good scalable ingress, egress solution, we're working with our own internal teams about that, as well as with Red Hat, and we're really looking forward to some stuff coming down the pipeline.

B

In that front, networking policies, native ipv6 support and with security, I kind of want to stop here for a second, we need more than just visibility right.

B

We need the ability to enforce and have some form of compliance around this, and so the ability to not just image, scan a doctor image and say: oh, it's good, but then tie that into a lifecycle management as well, and you know be able to associate that with containers running in different environments and different stages and be able to enact centered policies around that and dictate what is okay and what is not at what point?

B

When to kill it when to keep it alive and again, we were very happy to hear that Red Hat's also thinking about doing that in the near future. It's something we're pretty stoked about.

B

So, as we kind of take a look at our cloud evolution, at least within Cisco, we start off with the data center and I actually had to look. This up, Cisco apparently started in 1984, and so we started off with data centers I, don't think we had mainframes I know.

A

I know you weren't there, but yeah I've been out a long time and as.

B

Early as 2002, we deployed a first pass solution, so we've been in the pass game for a while, and in this scenario you know the data center is run pretty much any in every workload right. You can stall it almost anything on bare metal and have it go when we introduced our pass solution, we kind of segmented some of that away for a web, app some streamlined and made that experience better 2008. We start like everyone else, rode in with the VM Revolution stage and with that kind of introduced associative infrastructure as a service.

B

So now really the data center and bare metals have been regulated, mostly databases and solutions that couldn't really survive or handle the hypervisor cost. And then we built our past solution on top of that and we continued serving out Web Apps everything else for the most part ran within VMS 2010. We actually started our journey towards multi. We call em VDC or multiple virtual data centers. Basically, it's just a fancy. Internal marketing word for high availability across datacenter failure, and so we have two sites in Texas. Where this we typically do.

B

This Allen and Richardson they're, like 20 miles or less apart.

A

B

So it allows us to take an entire data center outage and still keep everything up and running without any problems, and a lot of that was handled through hypervisor technology and BMS 2015. You noticed we switched from a past. Tech has something that Mike White was kind of talking about in this notion of availing it more than just Web Apps, but now also enabling micro services.

B

And here we are 2017- and we start talking about introducing this another abstraction layer called the cloud or capi a CCD. These Federation models elevating up not just Pat.

B

You know web app based workloads, but also micro services, but as we deploy CAE or dark as infrastructure onto bare metals as well as VMs, we can start actually utilizing databases that were pre-existing kind of nailed to bare metals only because we removed that hypervisor tax and we can build direct strong performance to the container and again, a container at the end of the day is just a process running on the operating system and really leveraged multi-tenancy and shared infra at the bare metal layer, for, in this case, databases going forward we're looking at public cloud just like everyone else sounds like a lot we're a little late to this game.

B

Actually, we are also talking about using open shift or our Cavs infrastructure as an abstraction layer there and then bridging using hopefully coop Federation, as well as our own cloud, API and services, to kind of bridge that and serve out all those use cases taking a step forward. After that, our hope is.

B

We can start providing managed services and again we're very excited to hear that a lot of work is being done on that front or we can start thinking about serving up IT as a customer of IT and serving out as a services, so databases as a service, Message Queuing as a service Jenkins, build times as a service ID as a service, machine learning, Big Data as a service, and it's a lot of acids.

B

But all these things you know they're important and again. This is kind of fundamental to the way I T has the volved and to better service our own clients. So why hybrid well, cisco, is a very large organization. I think we have a hundred and fifty thousand employees we'll ride somewhere somewhere around that I. Don't know how many countries were in a lot and most of our you know. Datacenter footprint today is in the United States and as we look at servicing our own internal clients in amia, or you know, Europe Asia packed Africa Australia.

B

All these different areas having data in the services located in the US just doesn't make sense from latency and performance point of view. Oh, it's the wrong button. There we go, there's also notions of data sovereignty, so the countries and if anyone knows of any more than this, let me know the country is kind of labeled there. In the light, blue, not including the US, doesn't really differentiate.

B

Here are the ones that that we know of today that are looking at you or are implementing and data sovereignty rules and so places like Germany we're looking at hosting a public cloud service. So we don't really necessarily have enough workload to warrant us building out a whole data center there, but we need to keep that data in Europe and so we're starting to look at public cloud as a way to solve some of these use cases for us.

B

You want to take this one sure.

A

So Mike mentioned that cisco is a big company, a lot of employees globally and a lot of customers globally. So beyond data sovereignty, which is kind of an urgent pressing issue, we've got just basic usability issues. Can we put the applications and the data out closer to our customers, giving them better global presence, global performance and availability right? You know all of all of these illa T's can be improved by deploying the app closer out to the customer or the user of the application. Then.

B

Obviously, the biggest use case you often hear about hybrid cloud solutions, that's the elastic tape capacity. So when workloads also an increase, you can burst into the cloud without having to maintain a large foundational footprint. A private and structure allows us to save money in that way and I think that's it. Does anyone have any questions.

D

B

I'll, take the one up front and then we'll take the guy there in the middle and.

D

I, don't know how many users you have delivering OpenShift looping to you and how long you train them to use a punch is the best way. So my question is, for instance, memory of CPU requests and limits. Do you expect them to set the write requests or do you do it yourself? Monitoring the data under the talk so.

A

We've made that programmable much like the previous presentation. We set up what we think is a kind of a minimum default. So if you don't specify the developer will get a small quota, we expect people to over run it, and then we have a request quota API that we've also made available, so they can keep increasing it. All of this is tied back to our costing and budgeting systems, so they can increase it up to the amount that they can afford right.

A

So requests and limits are a bit of a tricky situation, and then you start factoring and over commit. So there is a learning curve for developers there who and they think in just in terms of their their Java heap size. You know so they might be to play around with it and.

B

Frankly, a learning situation for us as well right, so a lot of these workloads performance ratios are new to us, so we could look back and say: okay, these are our JVM workloads. We've kind of can model that from the past year and try to get an idea of what ratios and how much they're going to be using. We can take a look at even Jenkins build farm to kind of get an idea at the end of the day, there's going to be a lot of new workloads.

B

In fact, there are already are a lot of new workloads with very different performance factors than we've ever seen before and so on. The overcommit ratio, part of it we're constantly having to kind of monitor that and learn on our feet and adjust as needed. But the benefit here is that all this is program so that it's very easy for us to continually fine-tune the environment to match the needs of the actual running infrastructure along with training. That's something we're working really hard on building out and helping our developers learn about micro services.

B

There's like any new kind of I, don't want to call it's a so I feel like this has a lot more feet than that, but a sort of new thing people get excited about, but they jump into it, not with an old-world kind of mindset and so helping them kind of transition from that old world to this new world of micro, service architecture and cloud and horizontal scalability is something we've been putting a lot of effort into and something that we actually need to do more about, because that's really going to drive at the end.

B

The best overall experience.

A

B

You for your question by the way.

C

Just real quick um you've been on this journey for a few years now um you talked quite a bit at start about your consolidation, rationalization of your infrastructure and how containers kind of took away that hypervisor tack, so to speak, just curious as to what sort of efficiency you think you've gained over the last few years by by using containers, either by percentage or dollars or I, don't know price performance. Another Oh.

B

Like a new crunched.

C

All the numbers any any sort of color you could add to that. How you can do that job application, yeah.

B

To getting the specifics, I'll probably need a grab. Your email and I'll have to pull up those docs, it's been a while, but the journey is just kind of starting for us. So when we initially built out our container platform in 2015, it was all on VMs, mostly because that's what we're the available infrastructure was for us to consume.

B

That as of this year, though, we've got ahead and deployed a significant amount, bare metals throughout the United States and we're all minting our environment with that, and what we're doing first is pushing our heavy workloads onto those and those that specifically require local storage, local SSD, high low latency high bandwidth, and so we have some theoretical numbers as to how much we'll probably end up saving, but we don't have actual numbers yet we'll need to let what we currently have run for a little bit longer and then we'll have actually something better, but just knowing that VMware, which, if you guys don't know, I'm, not going to bash on them here, but there and user licensing uh-oh.

B

Thank you. We love your stuff. We run it less.

A

B

We're very big customer. We are, we have tons of VMware, please don't revoke our license because IB we be hosed, but their end. User License Agreement doesn't allow for publishing of performance numbers. So it's actually hard to go online and kind of see and figure out exactly what the cost is of the hypervisor and so so I'm not going to publish what our numbers are.

B

But what I've kind of talking with other colleagues outside of Cisco and kind of what I've heard is roughly around fifteen percent capacity loss, but then there's also around I've heard anywhere from three percent to 12 percent performance loss on a per VM basis, and so it's not just capacity on a per compute host basis. But it's also performance of the process running within the VM. They have to take into account. There's.

A

Two other things I'll add to that. One thing that we're concerned about and why we're moving to bare metal is the double scheduler kind of layer. Our VM environment is large and general-purpose, and so there are policy set at the the VMware layer about you know V motion and and DRS, and we don't want our nodes in this cluster pumping up and down, and one.

B

There's a big key: we want them locked to nodes, we want them to fail and.

A

It's not that vmware can't do that. It's our policy at cisco on this large general-purpose cluster doesn't allow it. The second thing is that the diversification by allowing application different application workloads to come onto the platform, we are seeing a diversification benefit where our previous platforms as a service we're all Java web apps. So very memory intensive and our utilization really just stunk on the CPU I mean we might as well not have even bought CPUs.

A

You know very, very.

D

Fun at least one yeah, but.

A

Now that we're doing these data services we're seeing that kind of staff, in effect where our newer workloads are much different than the original ones, that Li and.

B

Then kind of the point that Michaels making about VMware in this double scheduler issue. We also run something called turbo, Nam, --ax or VM turbo, which helps us manage in the schedule or VM workloads and right-size them to kind of increase. Our efficiency at that layer and we've had a lot of issues with kind of dual split brain where on one hand, OpenShift is doing a very good job of handling its own resource utilization and underneath it, our hypervisor layer, is constantly changing. It's like no. No. This needs to be. We are.

C

B

View containers as a first-class system, the data center- it's not something, that's just another platform that sits on top of VMs. Does that answer your question cool? Thank you. Thank.

C

You very much grazie are we done.