Cloud Native Computing Foundation Cloud Native eBPF Day North America 2021, 30 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: eBPF & Cillium at Sky - Sebastian Duff & Anthony Comtois, Sky

Description

Don’t miss out! Join us at our next event: KubeCon + CloudNativeCon Europe 2022 in Valencia, Spain from May 17-20. Learn more at https://kubecon.io The conference features presentations from developers and end users of Kubernetes, Prometheus, Envoy, and all of the other CNCF-hosted projects.

eBPF & Cillium at Sky - Sebastian Duff & Anthony Comtois, Sky

A

Hey everyone welcome, and thanks for joining our talk about ebpf and celia matt sky for some context. First of all, who is sky, so sky started out as a satellite broadcast company with headquarters in the uk in london and has expanded into a much larger multinational company with presences in many other countries.

A

Likewise, sky has expanded into many other areas beyond satellite broadcast, and one of those areas is ott over the top, which is online. Video streaming and ott is a part of sky which we are part of there's going to be two of us who are involved in the presentation today, I'll quickly, introduce ourselves and then get on with the content, because 20 minutes goes by quite quickly. So, first of all, I'm sebastian duff or seb.

A

I've been with sky for just over six years, originally as a software engineer, before moving into delivery and I'm now responsible for the eco engineering department, which will cover a bit more in the introduction and with me doing, the presentation is anthony contoir, who is a principal engineer in core engineering at sky. Anthony has a strong background in both sre and software engineering and joined sky in 2016..

A

I'd also like to mention ccg, who are a consultancy who have played an important role in the journey to build a mature platform as a service offering and in our work with selium and edpf. Joseph samuel from ccg was originally going to be part of the presentation with us, but due to timing, he was only able to be involved with the preparation and not actually part of the presentation.

A

So this talk might be a bit different than the others, rather than going into the deep technicals of how we're leveraging ebp app we'll be focusing on the delivery aspect and how we leverage technology to gain a higher level of confidence. Mitigating risk to the platform and to the business in the first section of the presentation I'll give a brief overview of what we as core engineering, do and then I'll hand over to anthony to first talk about why we chose psyllium and edpf and then about the pipelining as a form of risk mitigation.

A

So core engineering is a department within the global video engineering and apps division, and it is responsible for global vendor engineering and apps is responsible for client libraries and back in services and a portion of the clients as well, which is supporting the video and play out for sky's ott proposition, some of the most recognizable platforms and props propositions, which we are part of at sky sky, go now, which was formerly now tv, peacock and soon to be sky.

A

Show time in core engineering we build a multi-tenanted kubernetes based platform as a service offering which hosts about 90 of the application workload. The platform is built as a white label product so that it can be built once and deployed many times to support the different organizations and propositions as the underlying platform for the high profile high profile propositions. We have very large and complex requirements, including being highly available, hybrid cloud, multi-region and active, active and all of these at high scale and low latency, so to be able to operate efficiently at scale.

A

We have a number of important engineering principles which we follow for everything we do, and I won't go into all of them this presentation, but I will mention two of the golden rules which we follow, as they have a very large impact on the way we do things in the way how things have been designed. So the two golden rules are tenant, a cannot negatively impact tenant b and no tenant can negatively affect the platform.

A

Another of the key principles which we follow is that we treat any tenant environment as production, so that means, for example, we treat the development cluster which teams use as our first production environment and although they're not any sky customers who are using their environment because there are 90, plus development teams, depending on that environment, any disruption can have a huge impact to their ability to deliver and to the business as well.

A

And although we are a platforming team, we measure our success by the success of the tenants who are the teams using the platform. So our view is that one might have the best most perfect platform in the world. But if people are struggling to adopt it, then it really isn't a successful platform, and that really comes through in how we actually implement a lot of the capabilities which we have and on this slide we have some interesting stats which give a brief view of the scale which we're working at so.

A

The multi-tenanted platform currently supports uh just over 13 departments, with over 90 teams, which is about 1 000 engineers. Using the platform, these teams are using a wide variety of different technologies. So our goal is to provide a consistent face for everyone and we largely achieve this through kubernetes, but we also build custom, tooling and libraries for teams to leverage as well and on this slide we have a bit of a snapshot of some of the interesting technical stats.

A

We have over 300 unique applications deployed to the platform, with more than 60 000 replicas running across all environments and to support the required scale. We have performance, tested our central services, such as ingress to 1 million tps, and that's enough from me, with an introduction to the platform I'll hand over to anthony to talk about why we show cilium and ebpf.

B

Hi everyone, so I'm going to talk about why we've been choosing cydia mad sky and how we've been like mitigating the risk, uh with the help of ccg core engineering consulting group.

B

So, first of all, that's kind of how we've seen like there is a lot of application running on top of the platform and on top of kubernetes, with multi-talented architecture so and by default. On kubernetes. You've got like a flat network where every single port can talk between each other. So we want to restrict, with the help of kubernetes network policies and serial network purposes, uh restrict network communication within the cluster. So from port to pod. We want to also allow and block access to external endpoint, for example, a databases to our tenants.

B

So a specific tenant are gonna, be able to talk to a specific databases uh and so on. We also want to block malicious ip defined by the by the security team and that's going to be defined at the cluster level, and we want to make sure like the tenant cannot override it and we want to move towards the least privilege access models where every single tenant are gonna, define the full network flow um due to the high scale and requirement from sky.

B

We got like some scalability concern using ip tables, so we've been deciding to leave a leverage and and and ombre is serium and ebpf.

B

So clem is going to essentially uh inject some ebpf program inside the kernel to interact with uh with the network stack um and it's gonna is kubernetes aware so has a full, uh topology and and and ips, which is gonna, be able to inject inside the bpf map and share the data between the ebp program and and the kubernetes context. That's gonna allow us to have a more efficient load, balancing and network policies propagation, and we heavily rely on the deny functionality of serium to block at a cluster level.

B

Every single ips, which seems to be malicious and and also using host network policies and with the law of ahead of ebpa, if you have the nice observability feature like using hubble and celia monitor.

B

So in order to embrace this kind of new new technologies, we want to make sure um there is. We want to mitigate the risk before going to production, so we're going to show how we've been like mitigating the risk by automating the test, so you've got a git repository, you're going to push and and then you're going to have a bid which is going to be a deploy.

B

So you you commit and the build is going to be started on our ci agent and we're going to run all the localized tests so, for example, linting diggers vulnerability scanning and when everything has been uh passed we can obviously the image the docker image has been built and published to the test repository.

B

So when everything has response, as I said, the test image is going to be available and then it's going to be pulled and reused by then all the all our tests, like non-functional testing and functional testing, and if those one path is going to be promoted up to the next repository, which is excited test.

B

So, as part of those tests, we've got two main tests: functional testing, which we are heavily relying on. The serum connectivity test suit, which is provided by cdm, and it's essentially a bunch of pod which are deeper in the cluster and doing some dna's http uh check connectivity check and also uh making sure the cdm network policies and kubernetes network policies are.

B

The behavior is expected on and working on, a running cluster and we've been adding some additional functional testing, uh including like, for example, making sure namespace network policies cannot override the cluster wide deny policies we want to make sure, like the identities, are limited to some specific label so to limit the number of identities inside the cluster. So for us we only limit on the namespace levels that mean every single namespace we're going to have a one-to-one mapping with with the serial identities, and we want to make sure.

B

For example, when we apply a network policies, the existing tcp connections are going to be blocked.

B

So that's one of the main tests making sure we've got functional requirement and and testing, and then we've got the non-functional testing, which is uh we are trying to have like a 30 minute of fast feedback loop and what we want to exercise is the full network and making sure everything can work because psyllium is using or interacting with with the network stack so essentially uh exercising the full network path is having a some load, injector sending some load through a back end, and then we've got like multiple uh communication happening so from put to pod from using the service ip, but also cross-cluster, communication using the internal and external invoices, and obviously we are relying on kubernetes hostname to to target this this service.

B

So we are exercising the dns and udp flow by resolving the kubernetes hostname through service ip and core dns, and talking to the backend we're using http and jrpc and the tcp exercising the tcp stack, which is what the tenants are using on our platform during this time, when the full load run, we are including some resilience and chaos testing, which means essentially like deleting a cdm agent pod.

B

A serial operator or one of the it is serious, gcd, member and and also the back end, and it's really important to to delete the backend, which is receiving the load to exercise a rolling deployment and or deleting a pod, the business as usual on the platform and that's going to force the load injector to re-establish connection and recycle connection.

B

So we are trying to target like the worst case scenario, and we have like um showing some number like the number of identities we want to have at the maximum on on a cluster and then and then trying to reproduce it on our test environment.

B

We have also some dedicated tests, which is uh to simulate a migration from uh pserium17. So, for example, we deploy one seven on an existing cluster, we run the load and while the load is running and the chaos testing happening, we we deploy the new version, which is was.

A

B

One nine to make sure there is no description.

B

Obviously it's very hard to have every single use case and test. So that's why we automate all the tests making sure we can. We can scale and every time we've got a issue reported. Then we we would make sure we had requestion testing to make sure the issue is not going to happen again. So as part of the non-functional testing, you've got four different tests.

B

The first one is we're, simulating the identity chart, so it's essentially creating and deleting some pod, which can have a produce some some identity chart and see the identity are injected inside the bpf policy map. So we simulate or create 5 000 identities and we've noticed a small aged case so doing all those four tests. Obviously you've got the curiosity thing happening and you've got some cd-image and siem operator, and and also it is serious cd member and the backend.

B

We started and we've noticed a very small edge case scenario with cdm agent, restart where, when you restart it, you've got a small increase in term of drop as part of the matrix, but it's not affecting the client. Due to the tcp. We try and we've been like uh working closely with cilium and eiser valentin to to merge it upstream to major, fixed upstream and then you've got the second test, which is exactly the same, but without cmo agent restart and we tolerate zero drop and the first test is uh we.

B

The second test might go away and we're just gonna have the first one but totally zero drop when, when we're gonna release a new new version, the third test uh is uh simulating, like the cm network policies, recreation which is um so he's gonna exercise like flushing the bbf map and all the information inside when you delete the cm network policies but and when you create it, it's just gonna.

B

Add all the data inside the bpf policy map and we use the cluster entities, which is gathering all the identities and allowing all the identities inside the cluster to to talk to this. This target- and we do that with the identity chance and the fourth one is without identity.

B

Chance to isolate, which which uh scenario or uh is, is um impacting to give you a bit more insight so, for example, we're heavily relying on metrics as when the load is running and we we have been creating some manner and we gave the test and- and we failed the test if there is an any alert uh generated.

B

So, as you can see on the left, you've got the pod creation, so we are trying to create pod to generate identities related to and that peak you've got like 10 10, 10 uh pod created per second, which- and you can see the pod count which is like roughly around like 1500 and which is matching the identity delta, even delta, which is a number of identities, and you we, we delete some of them, so you've got the identity change, so you can see it, it can go up and down.

B

You've got the bpf map operation, which is showing all the all. The operations are happening on the bpf, the serium drop and and and also you can see the four tests with the load ejector with two 2000 ktps, and we monitor the load injection latency, making sure it's not. There is no increase and the cpu and memory gave us the ability to define properly on the worst case scenario on the demand set.

B

So when, when both test has been, has passed, we're going to promote this, this artifact, so the docker docker image to the external test repository uh at every at every single day, at 6 00 pm, we are running what we call extended, nft or non-functional testing, which is essentially the non-functional test which has been uh running for 30 for 30 minutes. But for this time it's going to be a longer period of time and for maximum 16 hours.

B

But in our use case for serum engine, it is going to run for 8 hours, which gives us a good amount of time to to show any memory leak or any any issue happening at scale.

B

When everything has been passed, it's going to be promoted to different organizations. So it's going to be one for nbcu, one for a sky and they're going to be deployed on different clusters. That's why we've got different organization and we deploy to what we call predev, which is one of the specific environment.

B

So at sky we've got multiple environment, you've got predev, dev and stage and prod, so they are all tenant facing which mean like the dev team are deploying application on the profit except predev, where predev is only the infrastructure, people who are deploying some application and making sure exercising the network making sure there is no issue, so we've been building for all those kind of three four environments: building what we call continuous load, which is essentially some load injection, running 24 7 on the cluster and targeting a back-end and obviously you've got the chaos testing happening at the same time on those back-end and it's gonna exercise the full network flow internal, external ingresses and humanity services.

B

uh Obviously we're going to gather all the metrics and and define some alerts so, for example, latency increase, http error packet drop and that's how we can get promotion from one environment to the other. One through the alert at 8 30 we've got like the promotion mechanism, which is going to be from promoted to one pre-dev to dev and so on.

B

If there is no alert defined and at 10 am, for example, you've got we've because we've got multiple region, we can stagger the the deployment across multiple clusters, so at 10 am you've, got the first region and then the second one at 12 and for the same environment. And then we, if there's anything, happening on on the first one, then we can stop. Obviously the deployment on the second one.

B

Thank you. It was a brief overview on how we've been like leveraged leveraging, psyllium and ebpf at sky at scale and and thanks for the cecg and co-engineering consultancy, consulting group for for the help um we are hiring at sky and also cecg. So if you have any question, please let us know.