Cloud Native Computing Foundation Online Programs, 1 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Your own Kubernetes castle: Building the production ready Kubernetes cluster with open source bricks

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello: everyone, my name, is adam kozowski and I'm here to present you your own kubernetes castle. This is the presentation which tackles a very important topic of building production-ready kubernetes clusters using the open source components.

A

So shortly about me, uh my name is adam kowalski. I work for greypop now over five years on the position of technical leader and cloud solution, architect, and I help big enterprises with their kubernetes adoption and promoting their clusters to production.

A

So, let's think about the topic I put there why a castle, so I imagine that the production cluster is very similar in properties with medieval castle, and why is that? I think they shared the common requirements. uh When someone was building a castle, it was built for years. It wasn't just a temporary project which was meant to work only a few weeks or a few months and then be destroyed.

A

It's a solution that has to be reliable and resilient and work for years, and this is also important topic for production clusters.

A

uh It shouldn't be built with a few weeks in mind because in general this is not how production works, um so it cannot be temporary solution. The other important topic uh when we think about the caster is that it has to be secure.

A

For example, it can be built on a mountain, and this also applies to to bernat's clusters. The cluster has to be secure when it's production, um it depends. If the cluster is development, the development doesn't have to be as secure as production, but production cluster.

A

I think that security is one of or maybe the most important topics when you think about that.

A

The other thing is that there should be a supporting infrastructure both for a medieval castle when you need to a clean water, supply or blacksmiths of carpenter services and the same thing maybe not same explicitly, but the similar thing for the kubernetes clusters.

A

The kubernetes is just an orchestrator, it's a tool which has a specific use case for orchestrating your workloads, orchestrating your containers to make it production ready and production viable.

A

You need to install the supporting components which help with the topics that are not handled uh directly by the kubernetes like storing and displaying clocks, storing and displaying metrics or, for example, just the automation for cicd and the last topic, which is rather important and also shared between the castle and kubernetes production cluster. Is that accents. The access to the castle depended uh in the medieval era on your role in the society.

A

So king have a different access, knight has different access and the blacksmith has a different access to the inside of the castle, and the same thing is for the kubernetes cluster, your user and the group you're in.

A

Is the dependency uh which says what you can do in the cluster? So if you, uh if you are an admin, you can create any kind of resources like you can can do anything in the castle. But if you are just a developer, maybe you are not able to configure, for example, ingresses or storage classes, so it really depends on your group or your role.

A

So that's first part of the topic, but there is also a second part uh why the open source and why the open source components, so the open source is a rather broad term. It may refer to the distribution model, to the license or just the open source movement and to describe why open source I will use the definition, which is the open source model, is decentralized and focused on open collaboration, because this is why I picked the open source when you use the open source components.

A

It's based on the open collaboration. Anyone can make a change. Anyone can propose the change, create a pull request. Not all of the pull requests are accepted, but this is not just the end of the word. You always can create a fork uh either for your own use or share it with the community and use that solution in your product. So it's not closed source. You can anytime make any change and just alter it for your needs.

A

There is a community collaboration. uh People are trying to solve the issues. It's not just a single person trying to to make it all work. um There are even companies sharing uh the source code and helping with the development of open source tools and the affordable pricing. uh It sounds a little bit funny because in most cases we think about the open source as a free tool and in most cases it is free.

A

But it's also important to consider that the open source does not explicitly mean it's free. It only means it's publicly available. So if it's free in your use scenario for your needs, it's better to check the license. There are different licenses, and most of them in general are very permissive.

A

But if you think about changing the code, you need to make sure that the license that does not, for example, require you to publish. It sometimes does uh required. Publisher changes.

A

So what are the bricks? The last part of the topic?

A

I would divide them into two main aspects: the tools and the configuration in this presentation we'll focus on the tools which are also multiple topics like observability automation or security.

A

But the important part which is not part of the present in this presentation is the configuration like role-based access control configuration in your cluster uh cubelet, fine tuning authentication, authorization like implementing uh open id or ldap to access the cluster itself and making sure the underlying infrastructure is resilient, safe and backed up.

A

This is very important and to make sure the solution is really production ready. You need to combine both the good tools, good components and a great configuration.

A

So in this presentation I will show you a multiple different, open source tools and components which I would recommend for using in your kubernetes production environments. um But how did I select them? So all of them uh were tested by greypop and by me in our projects, and they are proven to work great in the production clusters.

A

They are all open, source licensed and have active community support. This is very important because you can always find an open source tool which is working, but the last comment was like a year ago and the last answer for for the case or a bug is three months old. So that's not really useful.

A

uh The active community support is really important for this kind of tools that if there is a security issue, uh even if you can fix it, someone has to accept the pull request right um and the other important thing is that all of them are part on the of the cncf uh cloud native foundation landscape.

A

So you can always find them there and check how they compare with the other tools which are not present on this presentation.

A

Okay, so let's start with the first topic, observability and what I mean by observability, so observability uh is an ability to make sure what happens inside your cluster. So the core components for that uh which we will go through later are locks matrix and the network, which was which is observable through the service mesh, for example.

A

So those are three important aspects of observability and why do you need observability? Obviously, you need to always know what happens in your cluster when it's in production, you need to make sure that it's operating correctly, you need to make sure you're getting alerts when something might get wrong in the nearby future or just happened wrong a moment ago.

A

So the first one will be logs and for locks. I put or selected two components. Two systems one is elk, which is elastic, search log station kibana and the name may be also uh the often used name is also efk, which is elasticsearch fluently kibana, and this is a widely used tool, um very common, very popular, uh very capable of handling extremely large amounts of logs and and storage, but also quite complicated with the configuration, especially if the configuration has to be ha so highly available, and you store a lot of locks.

A

It might be rather complicated to configure elk if you have no experience. The huge advantage of it is the query language which provides an ability for the full text search, uh which is uh not that common. This is something uh which uh elk is great with, and uh this makes searching clocks much easier than some of the other tools, um the disadvantage of the elk itself. Maybe it's not a huge disadvantage.

A

It's just the uh small limitation that uh the security is not part of the open source part, uh but there is a workaround for that which is called the open disk show. It was created, I think, by amazon, and this is a version which extends the the basic tool set with the security and other important tools which are not present in the basic installation.

A

So if you need the security part multi-tenancy, uh you can use the open distro for that, but sometimes even for the production cluster, the elk stack is too big or too complicated to configure, and for that situation you have griffana loki, it has much lower the resource footprint than elk and it shares the ui with the graffana uh grafana is used for the matrix.

A

It's very easy to install too um the disadvantage of it is it's much simpler. It doesn't have the visualization that kibana has for the matrix and the query. Language is very similar to the prometus query language, and this means it's limited. It's not the full text search, it's just it's more like regex, so it is uh limited in this uh in this case, and it might be harder to find the lock you are looking for, uh but both are great.

A

It just depends on your on your needs next topic matrix and for that actually I just picked one solution, because it's so popular and so widespread um that I think it's something I could recommend easily, and this is the set of grafana parameters and alert manager and most of the installers, for example, the prometus operator installs the full set at once, because uh all of them depend on each other.

A

The prometus is a tool which gathers and stores the metrics, um so this is like a pull or push model depending on the configuration and the metrics are stored in the tsdb, which is a database holding the metrics sorted by a timestamp, so time series db and to see the matrix to see the dashboards and display them for the user. There is a grafana which is very nice tool, highly configurable for creating dashboards, and it reads directly from prometus, so it doesn't really require a lot of storage and a lot of database.

A

um It just stores the dashboards and all the metrics data is held in most cases. It depends on your caching strategy too, but in most cases the data is pulled directly from promedus and for the metrics.

A

The metrics are also based for alerts, and there is a tool which is called alert manager, which is also part of this tool set, and this tool is able to alert and notify you about the problems or alarms created based on the metrics either the metrics coming right now or the historical ones, and you can, for example, create an alert when the cpu usage is slow is too high.

A

The amount of storage available is getting low and, for example, if the bandwidth is getting low for some specific services, so it's highly configurable you can use any metrics. You want and very complicated mathematical formulas if it's required for calculating a specific alert and as a tool set.

A

It's really great, especially for the start, when you need to monitor your cluster, because if you use the promotions operator installer, but it's not just limited to that, you can easily install it using as a single crd, but the crd is not just for installing the parameters graphene and ad manager, but also it is able to read- or you are able to use the crd to create alerts and dashboards dynamically, which is great, and you can even allow users to add them their own dashboards to their tools and most of the tools which are on this presentation and most of the open source tools from the cncf landscape already have the examples or existing grafana dashboards, which is also great.

A

So so the support for promising graffana is very often built in, and sometimes you need to scale up the metrics system and there are also open source. Two solutions: cortex antennas, they're very similar.

A

The difference is that the cortex wasn't designed for scalability, while thanos was designed with a small footprint in mind, so how cortex works? It's a centralized system, while the thanos is deployed as a sidecar.

A

So for the thanos there are more primitives instances, but smaller ones. Cortex is a single big one and the this is also different from the query perspective and the storage perspective, because thanos has to query all the promote uses and then gather the results which it has a great great code for for making queries and the final system, which makes it very very fast uh when the cortex has a centralized storage.

A

So the matrix all metrics are sent to the single storage, so this is, for example, easier to backup.

A

So those are two different tools uh when you need to scale up your metric system and the other thing you might need for observability is the service mesh.

A

A service mesh is a dedicated layer for making service to service calls or container to container calls, and the idea is to solve the challenges of the developers when they need to call remote endpoints or the endpoints inside the cluster like making the secure by default or adding the service discovery, but also a service mesh is a set of a proxies which abstract the network inside the cluster and because all the traffic goes through.

A

This those proxies uh there is very often observability, built in which makes it easier to gather metrics about the bandwidth bandwidth being used, amount of connections failing or being successful.

A

um So that's a very important topic too, and for this topic I picked two solutions. um They are very very similar and there are not that many differences. uh The istio is very popular.

A

I think it's the most popular service merch right now, and it has a lot of examples and code snippets and a lot of stuff already built in especially in terms of documentation and the articles, and it has a multi-cluster support, but compared to linker d, is slightly less performance than istio uh that it's slightly less performant than linkardi uh when the traffic uh is high. So there are high loads of large amounts of data being transferred.

A

The link id was built with performance in mind, it doesn't have multi-cluster support and some of the features of eco-like circuit breaking are missing, but the resource footprint is very small and it makes it uh much faster if there is a large amount of data to be transferred.

A

So the next topic is automation or continuous integration, delivery and each cloud native solution. Each production cluster has to consider this. This aspect of deploying and developing applications so making sure the builds are reportable and observable, and also proper version control and for production clusters, mainly the application delivery system, which is also reliable.

A

So, let's start with the rather complicated topic, which is a githubs and why githubs is complicated. It is complicated because it tries to solve a very important aspect of the development.

A

So it's trying to solve the the problem that developer has to be able to deploy their applications automatically from the development to production, and the problem with that is, um there has to be a single source of truth like, for example, the git repository, which holds all the configuration, and then it's pulled for uh for changes by the argo, cd or reflex in this case, and if the state of the cluster differs from the repository, it has to be updated.

A

So in theory this is rather uh easy concept, but there are caveats that are very hard to solve, like, for example, the secrets. The secrets are not really safe. In kubernetes, if you store them uh as a secret because they are not encrypted by default, and so the githubs has to read this secret somewhere and you shouldn't put it them, obviously in each repository.

A

So this is the part of configuration that is very often uh challenging and the configuration otherwise is rather simple and why I picked those tools. I've tested both of them and they're, all that they're both nice and there are also small differences between argo and flux.

A

The argo has a very nice ui, so it's easier just to look at it and see how it behaves, and it has a great multi-cluster support. So for each project, each component you configure in there you can set the target cluster, um so there can be multiple targets.

A

So so the argo technically supports the the multi-cluster design, while the flux uh is able to only read one remote repository and one target cluster. So that's a limitation, um but not a big one, because in most cases uh you can live with just having one with flux in the in your cluster and also it's not. It doesn't have an ui, but it has a nice cli for management.

A

um Both of the tools are only continuously very git obstacles, so there is no continuous integration and the continuous integration will be the next part.

A

So for the continuous integration I picked two tools, it's jenkins and the conkers, um and both of them are great. um The jenkins is widely used. uh It has a huge adoption. I think almost every company or every developer have used the jenkins at some point of their of their journey and it has tons of plugins available.

A

So you can install almost everything uh as a plugin, but it's also a little bit monolithic, it's harder to install and configure than than the other tools and the uh configuration as a code is a little bit strange, compared, for example, to the conquers, because it's partially configured through ui and partially through the code.

A

The lightweight alternative to jenkins is the already mentioned. Concourse.

A

The conquest is very easy to install and it has a very great system of pipelines which are deployed through cli the fly and are written in the ammo, so the ui is very clear: all pipelines are described in the ammo so that there is literally no way to change the configuration to the user interface, and it has a very nice way of it was designed in a way that the workers are very lightweight and fully isolated, so each worker in the concourse runs in the container, which is fully isolated in jenkins.

A

It is also possible, but this is not how it was designed initially, so it requires some more a little bit more sophisticated configuration.

A

The next important topic for your cluster is the ingress controller and what is ingress controller? Let's start with that. Ingest controller is a combination of load balancer and a proxy which is responsible for reading the kubernetes, ingress resource or object and based on those objects. It draws the traffic incoming traffic uh for your cluster to a specific service or set of or set of services.

A

For this I picked three uh three possible ingress controllers. There are much more of them. There is, and there are literally, I think, it's 12 or something like that. There are a lot of ingress controllers. I'm not saying those three are the best.

A

Those three are the ones I have tested, so the first one and the the one which is part of the official kubernetes documentation is the kubernetes ingest, controller or nginx ringers controller, because it's based on nginx- and this is a small disadvantage of this tool, because there are two ingress controllers: one is kubernetes ingress controller and one is nginx increased controller and even though the kubernetes ingress controller is based on nginx, this is not the same thing.

A

So if you go to the web pages and try to figure out uh what's the difference and which one is which it's slightly more um more complicated than it seems initially, but both of them have a great advantages. So a lot of people know nginx, it works great and it has nice configuration.

A

It requires some knowledge to write the nginx configuration correctly. But if you have this knowledge, you can configure almost everything and some or maybe, let's say a lot of these configuration options are available in the ingress object as annotations.

A

So you you can use, take advantage of most of the configuration or configurability of nginx just using the annotations on the ingress objects, which is very great.

A

um The difference between the kubernetes, the ingress controller and the nginx ingress controller uh is not big. um The engine x1 provides the query parameter support as the extension of just route and path and the kubernetes one has.

A

Additionally, the authentication part like basic authentication can be configured using the uh the kubernetes ingress controller and the two alternatives. I picked. uh The traffic, which has now version two and he proxy and the traffic advantage is it. It has a very nice user interface, so um if you need to quickly look at what's wrong, what's going on in in in the ingress controller, the traffic is great. For that you have an e easy to use user interface.

A

You cannot make any configuration changes for from there, which is also rather good, because you can uh allow access to this ui for for the developers and it's really easy to install and the disadvantages of this one.

A

So when it was switched from version one to version two is it lost support for some of the features? uh I think they might just add them later on in the in the development, and it supports the native uh resources native ingress resource, but it also uses its own crds, which is slightly confusing when you just start from there and the hd proxy alternative.

A

I put it here because it's very performant compared to any other ingress controller from some of the tests and some of the comparisons it may be, even the most performant load balancer. It has a lot of configuration. You can configure a lot of things, but it doesn't have a user interface and the configuration may be slightly harder than the kubernetes and traffic ones, because there is less less resources available for that.

A

But if you need a really performant ingress controller, the the hi proxy might be maybe your choice.

A

So the security um security in the cluster is a very important topic and, as I said previously in the in this presentation, the configuration part of the cluster like making sure only authorized users are able to access it and the role-based access control are back for making sure only specified people can make.

A

Specific changes are very important, um but you're not really limited to those uh two or three aspects of the kubernetes security. There are tools that can be installed that help you with with making sure that cluster is secure.

A

um So first important topic is open, id or oidc open id connect provider and those providers um are just identity, layer for verifying end users, authentication by using external authorization, provider or third party, and both of them are very easy to use- and it's just nice to have the open id or oidc provider inside your cluster, um because you can easily change the third party which is used for authorization and verifying the identity of the user. So that's great- and here I picked two of them and uh they're also very similar.

A

um The dex is just a simpler tool than the key cloak. It's easy to use. It's simple: it's just the oidc proxy, so it processes your authorization, authentication requests to the other other provider, but it has a slightly limited capabilities compared to kick lock and it's just a proxy. No automation, no custom claims nothing like that. It just process your requests, but if you need more than dex provides, you can use the key clock which is very extensible and advanced in configuration, and it has an ui.

A

um So that's two nice things to have, but also it allows you to create a custom flows. Two-Factor authentication, so you you can configure more.

A

More secure system with that and um disadvantage of this solution will be it's harder to configure than dex and requires additional database for storing all this configuration and the user claims all the custom custom profile changes, uh so even conceptually it's just bigger to to deploy uh just it's just a bigger solution has a bigger resource footprint than dex and the next backup and restore so backup and restart.

A

I for me personally: it's either the most important topic or the second one uh in in the security, uh and this is because a lot of people take backup and resistor for granted, um because a lot of solutions provide this kind of thing and a lot of companies do not test the restore functionality.

A

So as long as the backup works it's tested once initially or even not that- and this is all- and the backup and restore toolkit is probably the most important part of your production ready deployment, because you cannot rely on the fact that what you created is so great it's. It will survive any kind of disaster, uh the underlying infrastructure, like aws cloud or or azure cloud, it's so resilient.

A

It doesn't fail that often, um but there is a very small chance that it may fail and if the underlying infrastructure fails- and you have no backup- you have a very- you- have to have a very great disaster recovery plan to recover from that. And you, if you have a backup, it's just as easy as restoring the backup.

A

The bigger problem is that how to configure the backup and restore correctly, especially if you're, using either on-premises kubernetes deployment or it's using the uh not so widely used and supported cloud because, for example, some of the clouds like eks um they they use for the persistent volumes. The this storage did the volumes from the cloud provider which are which can be configured to be backed up automatically but by the um by the provider.

A

But then um you might need to copy the data somewhere else. In case something happens with that provider and the valero is a solution which is in most cases independent on the provider, because it has a support for all common clouds. Like aws azure, google cloud. uh You can think of anything like that, but it also has um a tool built in which is called rustic, which allows to make an image of the persistent volume which is not supported directly by the valero.

A

So if you have a provider like, I don't know openstack, which might be not supported by valero, you can use rustic to to make a backup of of this volume, just a plain backup bit to beat, but still a backup um and just have a working backup and resource strategy like that.

A

um The disadvantage is that it doesn't have an user interface, it's probably not the biggest disadvantage in the world and the backup metadata is start without versioning. So sometimes uh you might. You may be able to break the mechanism by just removing the file by mistake or altering the file. So it's not really recommended, but just make a backup of your backup tool, let's say, um but there is also an alternative which is not an external tool but a part of kubernetes itself.

A

It's called volume snapshot and volume snapshot is a new resource in kubernetes, um but it was introduced recently and it's now supported. But but technically um it's really new and it requires not just a new kubernetes version, but also a supported csi driver version for infrastructure.

A

So the infrastructure provider has to implement the support for volume snapshot in the csi driver for that infrastructure provider for the club provider and then and only then, the volume snapshot is going to work.

A

But if you are able to run the volume snapshots, uh it's a great solution, because it's native it's supported by natively by kubernetes, it's easy to configure. You can configure the backups as a part of your cluster configuration um and it's natively supported by the csi driver. So there is no external tool for monitoring like valero to make sure it works. So if it works, it's uh controlled by the kubernetes.

A

And to finish the security part, there is one more tool that I wanted to talk about, and this tool is an open policy agent.

A

What is open policy agent? It is a tool which supports the maybe, let's say the different way. Open policy agent is a tool which extends the role-based access control abilities with role-based access control. You are able to configure for specific user or group of user. What they can do and anything you don't configure is just denied.

A

So this is only it's kind of a white listing way of configuring, the security and only based on raw resources of kubernetes. So you can say the user is able to create a deployment. The user is able to create a pod. The user is able to delete or update the ingress, but you are not able to say user is able to create a deployment which only contains a single pot.

A

So this is not possible we've with airbag, but this is possible with the open policy agent and the gatekeeper tool which is used by it and what the uh open policy, agent or opa does it provides and system a component which can verify the policies written in the regal language versus the json documents, and this tool is configured through web. You can configure the open policy agent or gatekeeper as an admission webhook in kubernetes, and then each change to the cluster is sent to opa and to be verified if it should be accepted or denied.

A

um So, for example, you can configure or create a policy in open policy agent service uh using the regular language to say, for example, you are only allowed to to create a deployment, a pod using our container registry, so you have, for example, artifactory installed locally in your network, and you don't want people to use the containers coming from the internet because sometimes the internet connection is quiet uh and there is always a possibility to use the firewall to limit this kind of behavior.

A

But with the open policy agent you can just say. The only allowed container registry is is our artifactory using the regex versus the image name, and this would work great, and it also allows you to only limit this behavior to either specific namespace or specific set of namespaces.

A

So, for example, you might want to have a space where you need to make some experiments which might require access to a different repository external one or just different one, and then you can just say: okay, this open policy agent is checking the namespace too inside the policy and verifies that or you can just annotate the namespace to make sure that it's not being taken care by the open policy agent.

A

So for our all more sophisticated policies or more sophisticated security, the opa is a very great tool to expand what you can do with airbag and make sure that the actual behavior of people is more controlled than than what you can do just with the airbag and firewall rules.

A

um So that's into that! That's all for that topic. um I hope you. You have learned something about those tools and you have seen that the open source may not seem uh initially as a greatest solution, but it contains everything you need to create a working, reliable and secure kubernetes production ready cluster.

A

um So that's it. Thank you and have a good day or have a good night.

A