Red Hat OpenShift Helsinki 2018 | OpenShift Commons Gathering, 24 Sep 2018

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Case Study: OpenShift Architecture Evolution at Elisa OCG Helsinki 2018

Description

OpenShift Architecture Evolution at Elisa
Antti Seppälä Service Manager (Elisa)
OpenShift Commons Gathering Helsinki 2018

A

So today, ladies and gentlemen, my name is auntie, and it's really good to be here today, I'm here to talk to you a little bit about the evolution of our openshift architecture here at ELISA. Well, it was nice to hear the Augusta mentioning that the company is over 100 years old. Well, ELISA is even older than that. It was founded in 1882. So talking about legacy still well a couple of words about myself. First, so I work as a service manager at ELISA like I, said that means that I lead the DevOps team.

A

We are responsible for the cloud offering that we offer to our developers the infrastructure behind that, and we also do maintain some tools and technologies and participate in architecture discussions with the devs. So we in general it would be fair to say that we try to make the lives of the developers easier. I also get to input strategy.

A

Take part in partnerships. Negotiations do budgeting and various other duties that managers generally get to do, but that's only my day job. So during night, I turn into Linux kernel hacker, so I really do actually enjoy. Turning the bugs or oopss that you can see on the right side of the screen to to upstream commits that looks something like the left side of the screen, but that's just the way, I'm weird but yeah.

A

You know coming from a kernel background and being this simple-minded low-level guy I'm, not ashamed to admit that I feel the containers are sometimes a bit overwhelming. The technology is rather new and it's evolving rather rapidly. So it may feel like it's hard to keep up so in the next part of the presentation. I will actually go through a little bit about the history. How the container stack that we currently run at ELISA became what it is today.

A

Well, I'm not going to go into the early early beginning, I'm, just going to start by mentioning that that it all basically started when we tried to introduce ansible to do software deployments. It worked rather well. So we had our existing virtualization environment and we did offer ansible as the tool for the devs to deploy their stuff into production. They were really happy about it and eventually we get good adoption, but with good adoption came increasing amount of requests to have the ability to control the infrastructure as well with the ansible.

A

Well at the time, we couldn't really do that, so we took up and created another set of technologies that could really comply, and this one was and still actually is based on OpenStack, so you get API availability and you don't need to do tickets and so on and so forth. The thought was actually to run heat templates, which described the infrastructure and then run ansible on top of that to really provision the software, so you could control it entirely.

A

Well, this became immensely popular, the devs loved it and we were happy about that. But also during that time the container revolution happened and after a little while we started to investigate it. What were the people actually deploying into production with this beautiful stack and it turns out, they were deploying kubernetes. They were using that to deploy kubernetes and then run the rest of stuff on that, and actually this was fine in our opinion, so we liked it.

A

They we even endorsed it to some extent, so we created some installation scripts to help they've installed kubernetes, because we saw that it would probably become the mainstream container Orchestrator at some time in the future. Well, boy did they install it, then. So, eventually, we started sprouting up multiple kubernetes clusters, with different configurations with different sizes, with different features and with different operations, and if you like us, are in an organization where there is separate team or a department that is responsible for the operations and 24/7 and and being on-call duty.

A

Well, good luck introducing this architecture to them. So you will instead most definitely get this Drake reaction from the guys where they will tell you that they would much rather operate only single instance of kubernetes.

A

Well, this turned out to be more difficult than we would have hoped so at the time, kubernetes didn't do multi-tenancy, basically at all I, don't know if it still does, but but at the time at least, it was really really hard. So we set our sights to another container Orchestrator solution that would be based on kubernetes and would be compatible with the multi-tenancy requirements of our multiple teams. So we didn't want teams sharing the environment.

A

Well, it's pretty easy to tell that the open ship with the pill quite nicely, so it does more multi-tenancy and all the requirements and it's based on kubernetes that the devs already loved, so eventually the evolution of the software stack that we offer looks like this in the next part of this presentation. I'm going to talk to you a little bit about how the open shift installation itself has evolved during the years. So our first attempt at install it looked a little bit like this.

A

Well, we were smart enough to do multi master installation from the beginning, so we are a bunch of masters and then we added a bunch of nodes, the sizes of which we did not have any clue about. But we made an educated guess.

A

Then we also added a rotor node, which was used to direct the traffic into the cluster. Well, it's pretty easy afterwards and looking at the picture, it's pretty easy to spot the single point of failure, so we pretty soon discovered that you can't take the rotor down for maintenance without bringing the whole cluster down. So another version of the cluster was setup where we added the backup router, which would be then set up using the keepalive d to switch over it.

A

Something happened to the main one, and this is actually the version that we first offered to our developers, but unfortunately we didn't get much adoption, so nobody wanted to use it and when starting to look into the reasons why it became evident that the guys, the devs were actually educated well enough by our inform Information Security Department.

A

So they were really worried that if they deploy their stuff here, it would be entirely exposed to the evil Internet and they didn't want to do that for their working progress stuff.

A

Well, looking into this some more, we discovered that a we can install another pair of routers that is connected to the intranet, basically the office Wi-Fi, so that if you install your stuff- and if you do a certain type of configuration, you will end up exposing your service, only the internal network of the company and not do the whole wide Internet, at least.

A

At the same time, we also changed the default so that, if you didn't specify, where do you want to go, you ended up using the intranet so for information security reasons, that kind of eased pain a little, and this is actually the first version that saw some limited production use from some some smaller teams.

A

Still the adoption rate wasn't at the level that we were really hoping and again looking into and asking the developers that what's going on or why aren't they? Using this, it became clear that most of the teams actually have some external dependency that they would like to access external database or external API or external whatever, and they also stated that they would not want to share the access to the external service with the other users of the cluster and at the time, option shift couldn't really separate the external traffic between the projects.

A

But we were in luck because OpenShift 3.7 came along and with it a tech preview feature called static, egress IP networking, and this allows you to basically allocate a static IP for each project that can access the external Network.

A

So there that way, you can set up access control lists on your databases or external services that allow only certain project inside the cluster to use it, and this one was actually the set up that well worked well enough that we started to see adoption rates in the development department that we were hoping to see in general by the way I.

A

Don't recommend utilizing the tech preview features in in production because there were some pains in setting it up, but luckily, with upstream collaboration and collaboration with Red Hat support, we were able to make the feature work stable enough for our uses actually stable enough, so that, with the usage increasing, we ended up charting the routers once more to really have the bandwidth available for for the increasing traffic- and this is pretty much the architecture or openshift where it stands today in our data centers. But when I say data, centers I really do mean it.

A

So we ended up discovering that it's rather nice to have multiple ones set up for h.a reasons. So if you want to take or do upgrades in one of them, it's pretty nice to have the data center number 2 available for operations at the same time. So we ended up setting up another cluster into a data centers.

A

We also did it like, so that the developers have the chance to choose which one they would like to use. So if they wanted to use data center 1 they could or if they wanted, to use the number to take good. We obviously gave strong recommendations to use both at the same time and that was kind of what we wanted them to really do, because that would have allowed us to do the maintenance things rather nicely, but even evidently they did not and when I asked that, why don't you use both of them.

A

At the same time, it turned out that it's really hard for them to do load balancing between the clusters, so that you could expose your service from both of them at the same time and have traffic flowing, even if the other one was taking down for maintenance. So DNS round-robin didn't really work for that. So most of the browsers, for example- don't really cope with that sort of mechanism for balancing the load.

A

So we went to the drawing board with this issue and came up with a handy little piece of software. That is an integration between openshift and our existing load, balancing infrastructure.

A

The piece of software actually listens to openshift api events related to road creation or updates, and can provision external load balancer when needed. So if you set up route in data center, one this one will create it to the load, balancer and start listening events or traffic from there. If you then add the same route to data center number, two, it will be added to the load, balancer pool as another destination of the traffic.

A

This piece of software was actually open sourced in our announced to be open source in the last open shift commence gathering in San Francisco last May. But if you still want to take a look, there's the github URL, where you can find it okay! Well, this setup is pretty much the one we nowadays run, but going even further. We discovered that the workloads between or there are different kinds of workloads that people want to run.

A

Basically, the difference that we ended up discovering is that there is production workload which you expect expect to be rather highly available and highly responsive and it's generally quite predictable. But then there is the development workload that people do compilations or load tests or unit tests or AI training cycles.

A

That is generally hard to predict, but luckily can usually wait a little while, so you don't need to immediately service the request of to doing some load test, for example, so we set up two more clusters for the development use and they change the policies so that, if you want to do some development, you could only request or create a project in opens if dev cluster and nobody would be stopping you.

A

Instead for the production clusters, we ended up creating policy that you need to request and the project to be created for you, so we could kind of track the resource use it a little better, and this is finally the architecture of our all data centers as it stands today,.

A

Well, you may not wonder it. Why do we go through this? Also, why go through the trouble of setting all this up and why such a massive operation is even needed, and the answer is rather simple, so we have made it our strategy to increase the cloud technology utilization.

A

We hope that it will be able to do automation with smaller iteration, which in turn enables faster learning for the dev teams. That's quite a valuable thing to get. We also aim to shift the responsibility of end-user experience to the teams themselves with the cloud technologies. A good example of a team that has been following these guidelines is actually the ELISA. We, the entertainment services, video, on-demand store or Walk Rama in Finnish, which at the moment, runs on top of OpenShift and being a sizeable operation as it is.

A

It already merits for several data centers to be used to get the kind of store up and running all the time, but that's not, in my opinion, the most interesting thing that we do on these clusters. So I'm gonna talk a little bit about the coolest stuff that we currently are working on, which is the self optimizing Network. So being a telecom operator.

A

The complexity of radio networks is kind of mind-boggling.

A

We have singles the base station which may have multiple cells. Let's say, 2g 3G, 4G 5g is coming up and each of these cells may have multiple antennas directed to different direction, and each of these may have somewhere rating from hundreds to thousands parameters that you can fine-tune to kind of create a optimal coverage for your cell phones.

A

A

Imagine if you would, for let's say cost-saving or performance reasons which the load balance your users between the cells or towers.

A

You adjust a small parameter in one of the antennas to get more traffic directed to it, but usually it happens so that they come from the neighbor in towers or neighboring antennas.

A

Now now these ones need to be adjusted because that they have different usage because of the one parameter that you changed and when you are just adjust these the next ones need to be adjusted. So you end up with this cascading ripple effect that covers your entire network from a small parameter change in one of the base stations, and we figured that this no longer is a task for humans to do so.

A

We created neural networks, an AI learning machinery that can actually do the job for us, so they try adjusting the parameters everyday as often as we can run them in in in production.

A

Well, we are in luck, as the algorithms themselves are pretty much nicely parallel double, so you can run multiple of them at the same time, and for that reason they are quite handy. Workload for well contain orchestrators.

A

So if you are actually interested in learning some more about this way well in either/or order, so if you want to learn more about the product that we are trying to build around the self optimizing network, stuff visit, alisa, automate comm, and if you are truly interested in becoming an expert on the field and working with the technologies over there go check out elisa FIS last jobs. We have several openings available there, but with that I think it's time to say. Thank you all and very nice that you had me.