Cloud Native Computing Foundation Kubernetes Community Days (KCD) Chennai 2022, 30 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Production Grade Kafka on K8s by Anand Iyer

Description

Stateful workloads have always been a challenge in the Kubernetes ecosystem. Kafka one of the most used event-driven systems is becoming the backbone system and with most of the infrastructure workloads now on Kubernetes, we need a robust and native way to communicate with Kafka inside Kubernetes. In this talk, I would like to introduce a Kafka Operator for Kubernetes which allows ease of deployment and management of the Kafka workloads.

A

uh My name is anand. Currently I'm working as a staff engineer at sea scalar.

A

Today, I'm going to present a production, great kafka on kubernetes, we'll see how kafka can be inbuilt as a class citizen in kubernetes, so that people who are familiar with kubernetes they they will have a uh ease of deploying kafka.

A

So uh we look at the agenda first, so we're going to cover introduction to kafka. We are going to look at kafka capabilities or typical traditional deployment and how the stream c project is going to help us in this overall design will look at the overall deployment architecture, kubernetes, operated design for kafka and we'll look at a demo as well, so apache kafka. So uh basically for folks who are new to kafka I'll, just take another five minutes to quickly uh brush you through the concepts like kafka's and even streaming.

A

uh Technology has a capability to handle trillions of records. uh uh It is essentially a comment log with a basic data structure uh since being created as open source by linkedin uh in 2011. It has, it has been pretty much being used as a full, full-fledged streaming platform.

A

uh Brokers are actually a cluster of kafka brokers, handles the delivery of messages and a broker uses these apache zookeeper for storing the configuration and for the cluster coordination. Typical leader election mechanism is also taken care by the zookeeper capabilities are microservices actually uses for sharing the data highly useful.

A

If a data requires a high throughput low latency guarantees messaging ordering, it provides a rewind replay kind of a mechanism so that you can reconstruct your complete, complete application. State uh message. Compaction is uh provided uh you can horizontally scale your cluster configurations, replication of data to control your ft modes uh retention of high volumes data for immediate access. All of these you kind of get with kafka, so some of the use cases is, of course, uh it's very popular in event.

A

Driven architectures also used in even sourcing to capture changes, message, brokering activity, tracking operational monitoring through metrics log collection and aggregation of course commit logs, and this for the distribute systems, but also stream processing, so that applications can respond in the data real time. So majority of pipelines, if you are building pipelines, so kafka would be the central nervous system in that pipeline.

A

So there are certain concepts and terminologies: let's uh go by them quickly, so that you will be understanding uh what we mean so broker is basically a server or a node that orchestrates uh the storage and passing of messages.

A

Topic is actually a more of logical, but it provides a destination for storage of data, and each topic is actually split into partitions.

A

Cluster is a group of broker instances partitions, basically partitioning takes a single topic log and then breaks it into multiple logs, each of which can have a separate node in the kafka cluster. This way, the work of storing messages, writing new messages and processing. Existing message can be split among many nodes in the cluster, so partition is a key uh concept and achieving your high availability concept in in kafka.

A

Then you have a partition leader which handles all the producer request. Then you have a follower which can replicate as well as consume the request, so in total kafka, cluster comprises of multiple brokers. Brokers contains topics that can retrieve and store. Data topics are then split by partitions, where the data is written and then partitions replicate across the topics for fault, tolerant.

A

So let's look at the typical component interaction, so here you can see the kafka cluster, which we actually consider as a brokers, and then you have a zookeeper.

A

These internal communications are managed with tls and then, if you want to build on top of it some metrics, so you will have a kafka exporter. If you want your clients to be talking through http, you can have a kafka bridge uh if you have use cases where you want to uh integrate an external system directly to the kafka or your kafka needs to directly send the data to an external system.

A

So that's where the kafka connect comes into the picture, we can use the source, connector and sense connector uh to do this integration and then kafka mirror uh basically allows you to kind of replicate mostly used for the data replication scenarios, and then yes, uh that's, that's pretty much a typical kapha component. You would see let's look at a traditional deployment, um so we create we will. If, if just imagine if there is, there is no stringy project, how would you look like? How would you look at a kafka deployment or a community's environment?

A

So basically we will create stateful sets because uh you need persistent volumes. You need a stateful set uh in the end to make sure uh your logs your commit logs. What you are actually storing are quickly accessible, so you are going to create stateful sites for zookeeper and broker you are going to deploy. These replica sets manage these endpoints for external access. You have to manage the versions of all the resources.

A

Remember that for a given broker version, you need to have the right zookeeper version as well. uh Then you have to build your own observability stack. You have to perform upgrades roll backs, manage the scalability challenges, uh and then you also will have to build a lot of tools to just maintain this complete stack. So it's complex and just imagine if you look at a production gate scenario where you have more than 100 brokers, more than 20 000 partitions.

A

How would that set up would look like it's going to be very complex, given that you have to manage all of these resources and that's where srims he comes to comes to our rescue, the streams. He provides a way to run a complete apache kafka cluster in kubernetes.

A

It provides lots of deployment configurations so for development, it's as easy as just running it on a kind and on for production. It gives many capabilities, such as rack awareness, deploying on different availability zones, uh applying teens, tolerations, making sure kafka runs on dedicated nodes. All of those is possible, um and then it also allows us to expose kafka to end clients in a more secure way. It pro it provides access like an output, load, balancer, ingress and openshift routes.

A

Also in security, it provides mtls, cramshaw and a layer of authentication, plus authorization uh use cases as well. uh The cube native management of kaffir is just not limited broker, so basically, swimsuit allows you to manage even the topics users mirror maker connect. Everything is using the custom resources.

A

So it's it's kind of one stop shop for you to deploy everything related to kafka. This means now we can get more and more familiar with kafka, kubernetes processes and tools to manage our complete kafka application. So the whole idea is make kafka the first class citizen in the kubernetes world.

A

Now it gives the benefit to all of our sre, because for them uh they are looking at kubernetes resources and now even our one of our critical application is behaving like a community resources. Only so that's where the the swimsuit project comes handy. uh Let's look at some of the features it allows you to deploy and run kafka clusters, seamless installation, seamless deployment, upgrade process. You can manage all the kafka components. You can manage the different dependencies as well.

A

Whenever you're deploying a particular version of a broker, it will make sure it will spin up the right version of the zookeeper as well. uh It makes sure a very configurable access to kafka. It provides a secure way of accessing. Upgrading kafka is easy.

A

Apart from all of the deployment process, you can also use the same for creating and managing topics and also managing the users. So all in all anything related to kafka is being managed. So you don't have to look at any external tool to manage these things.

A

So, let's look at the design, so in shrimp c, majority of things are actually governed by operators. Different operators have their responsibilities.

A

So here, if you see the cluster operator, is responsible to manage and deploy your complete kafka cluster, so that will be responsible to upgrade your brokers, upgrade your zookeepers making sure you are having the right deployments right, set off replicas running. All of that will be kind of governed by by the cluster operator.

A

The topic operator is pretty much managing anything related to kafka topics, so you can actually create kafka topics on the communities using the cooperative crs and similar goes for the user operator. So general. Now you are, you are actually giving the uh the operators more power to actually manage your kafka clusters. uh On the same hand, you also have the isolation of the rules what it offers.

A

So in fact, it offers that if you are not interested to use user on top, you can only deploy the cluster operator and just use it for the cluster management and use uh any other kafka, two links to create the topics and users, so it does support uh these uh different uh deployment options as well.

A

Let's look at the uh the complete uh deployment architecture how it would look like uh in in a a particular scenario. uh I I want to present a more of a ten thousand feet view how you would see a strimsy based project getting deployed on your communities.

A

Cluster so left hand, side section, actually talks about uh your kubernetes cluster, where all of your client applications are running uh uh and uh sorry the right hand side is actually all your or all your services, where your clients are actually connecting to the front side, which is your actually your kafka deployments. So here is where you will be deploying your kafka, and these will be running on dedicated nodes. So here you can see it's running on a dedicated kubernetes cluster. It has. It is under a kubernetes kafka, namespace, and then you can see.

A

We have also divided them into different availability zones, so brokers as well as zookeepers, are on different availability zones, and you can see uh it has been exposed as a load balancer service. So any services which are trying to communicate with the kafka will be using this low balancer service.

A

Of course, there are different ways to expose this out. You can have the ideal production practices to have both of these.

A

Typically, in the aws scenario, both of these uh vpc speared- and you mostly expose this nlb as an internal lp, so that you uh so basically the internal lb will make sure that no outside access is granted only the applications can actually communicate with these load balancers.

A

Similarly, uh you also have a scenario where your sres can actually communicate more efficiently. So here you can see they will be actually directly talking to the operator. So, let's look a look at uh the use cases. How your kafka would would work right. So here you can see an operator which actually goes and deploys and manages your complete deployment.

A

uh Your operator actually talks to aps server and constantly reconciles make sure you have the right number of replicas of your brokers and your zookeepers running and also sres can actually run these crs and make sure that they can build more tools on top of it kind of create a kafka connect, model or or a kafka mirror maker. All of these are kind of then honored based accordingly, after your kafka is deployed.

A

So a majority most of the times uh the zoo keeper lag load band service are never exposed outside unless until there are certain use cases where you want to do debugging, otherwise, mostly it is restricted uh pretty much to to the sre or ops teams. Only.

A

So what what it is uh uh holistically when you say there is a complete kafka system or an ecosystem which is needed, so what shimsu offers is a complete set of ecosystem which is needed on a production grade uh systems right, so you can. We just discussed about the kafka components like broker and zookeepers. We also talked about a kafka cluster operator, which kind of allows you to upgrade and manage maintain your operators.

A

On the other hand, uh you also have these kafka resources operator, which is for for mostly creating these topics and users. You have an observability stack uh which allows you to uh generate metrics out of your uh resources and track those resources and also create alerts on top of it. It has a very nice set of configurations. A lot of open source configurations are available, which you can actually tune with it. uh Shrimsi comes up with lot of sample grafana dashboards, which we can use to actually export.

A

All of this uh to kind of look at all of these uh metrics. uh Then we also have a cruise control capability, where you can make sure all of your load. All of your brokers are evenly balanced.

A

It identifies uh anomaly detection to make sure that none of the brokers are put into some thresholds. It also makes sure that uh you can. You can actually avoid throttling based on cpu, based on requests based on memory based on the events coming in uh you. You have many options based on the number of topics based on number of resources being used, so you can also make sure that any given broker is not overloaded.

A

uh All of these capabilities can be taken care with the cruise control. Cruise control is another interesting project from the linkedin and uh it kind of makes sure that your cluster is evenly balanced uh on a on a scale system. Then you have a connectors ecosystem. It's it's it's one of one of the ways where you can optimize uh how you communicate with external systems uh uh very classic use cases if you, if you are actually generating events onto the kafka, uh and you want to actually send this data back to some external system.

A

You don't need any microservice. You can directly have a connector or kafka connector, uh for example. Here in this case, uh we have done for snowflake. We have done for neo4j, where you can interconnect the data right from kafka events to snowflake tables, all of them happening directly with the help of the connector uh on, and on the other hand, you can also mention how many number of tasks can be running running on that particular kafka connect model.

A

All of those are also configurable, and then kafka bridge comes where you want to have an http connection model uh rather than having a general tcp connection that is also supported. Kafka mirror maker gives you the disaster recovery solution for your complete kafka system. uh Currently, uh the more prevalent and used uh is kafka, mirror maker 2, which actually uses kafka, connect a design pattern to to do this complete disaster recovery. It also supports active active. It also supports active passive, both both the ways um and then plus.

A

You can actually build more and more tools. On top of it like something like a kafka ui, a simple kafka, cowl based ui, you can actually just embed into it now. The great thing about this is uh your complete project is uh very much extensible. Shrimzi allows you to plug and play all of these components very easily and kind of again on the same side have a way to kind of manage all of these resources on a central plane.

A

So, let's, let's look at the demo, uh we'll we'll see how things work? Okay, I have an operator running uh I'll. Just show you my project structure, just to make sure that I have nothing deployed.

A

So currently my cluster looks like this. I only have an operator running and you can see the operator logs running here. So let's go ahead and do a deployment. So I'm going to deploy crd we'll talk about the crd in a minute uh and let's look at the operator. Okay, it's some action has happened and you can see here.

A

It has created this. Let's go and look at the activities which are happening here, so it has started to create a zookeeper first, and you will see that, with with the crd, we are making sure uh your deployments are streamlined. It is sequentialized, uh you will first deploy the zookeeper and then only you will deploy the brokers, and all of this is taken care by the uh by the shrimp itself. You can see now the zookeeper is running and it will schedule the broker immediately and we can actually also look at yeah.

A

You can see the broker has also started uh another way to look at. This is just by doing uh cube cattle get kafka and basically uh you are now able to track kafka resources using a kafka cr. Now this is a one-stop shop for me to look at everything related to kafka. This says that I have, I need one desired: kafka broker and one desired replica for my zookeeper and that's it. It gives me something called as ready and warnings unless until it's not ready, I will not uh allow the connections to be open.

A

To my end, clients. If there are warnings, I can make sure that unless until I fix those warnings I will not allow, I will not make sure my connections are open, so uh here it will. We will make sure that the zookeeper kafka is running and then we are also running an entity operator. Entity operator is nothing but basically it's.

A

It has three containers which are like a topic operator, your uh user operator and other uh side cars uh which, which kind of combine together as an entity operator unless until all of these are running, this will never show up as ready, and uh basically, if I just do a describe on this, so you can actually see uh the complete crd here, uh we'll go in detail about the crd, but the status will always make sure that unless and until this is not met, uh we will not open up the connection uh here.

A

The deployment is still in progress yeah. I can see that this is also done now. If I try to get kafka, I should see that it is already ready.

A

Everything looks good. Let's look at all the components. What are all there available and we can see uh it creates a set of pods. uh You can see. These are the three parts. One is the broker, the other zookeeper, and this is the entity operator.

A

It has created certain load, balancers and one load. Balancer is for an external bootstrap and the other one is for the kafka broker, and then you can see uh an entity operator which is a deployment. That's the part for which you can see, and these are the two stateful sets. Actually the kafka and zookeeper both are straightforward. So, basically, just with the crd, we have actually deployed a complete kafka cluster and uh you can actually see all of this resources available.

A

You can just keep on increasing the number of pods number of brokers and number of zookeepers. All of this will get added and you you have to manage only the crd. Let's look: let's look at the crd for a minute.

A

A

So this is this is the crd we have used, and uh here in the cid, you can see what what all we have. uh We have an external bootstrap. uh This is the particular kind of an endpoint which we expose out.

A

We have a support for nlp. We have support for making internally exposed, rather than making it outside. Similarly, uh we have affinity strategies, uh we have like part affinity and definite, and just remember this is a kind of kafka. This is the cr which we're using coming from the api version, kafka from cio v1 beta2, uh and then we assign some replicas.

A

uh You can just update these replicas and redeploy and there will be a new uh number. There will be new brokers coming up. uh You have different options of exposing the end points. uh We never. We never recommend something like a tls falls kind of a unsecured load balancer for for external usage. We we always recommend something like an mtls base. It's just as simple as making this as true all the certificate management will be managed by shrimpsee.

A

Similarly, you can have a scram shaft haven to base mechanism as well or a combination of tls with authorization as well. um You. The interesting thing is you can use storage as jbodz and make sure you can have multiple volumes available with it uh so as and when you have more and more load coming in, you can just add number of volumes and these volumes will get attached to your uh existing brokers and all of the configuration whatever is possible, which is supported by kafka. You can actually mention all of these here and shims.

A

They will inject this as part of your broker and your zookeeper deployments. So any time or you make any changes here, the rolling update happens for all of these parts. uh It also supports uh making sure that you can have deployment on a dedicated node groups as well. You can have teams, you can have tolerations. All of that is also supported.

A

Similarly for zookeeper as well. We we saw all of these, and these are the entity operator which contains our user topic operator and a user operator, and then we have also used a cruise control to make sure that we can provide some capacities and make sure that your ports never go beyond these capacities and if it goes, it kind of then gives you some hard goals and give you some suggestions and optimization goals based on which you can actually trigger those optimization plans.

A

So this is a single single blueprint for me to deploy a complete kafka rather than looking at different resources. I play around with these values, and I can have my uh my kafka deployment uh kind of triggering based on the change happening on this, because for any change event there is a reconciliation happening by the cluster operator.

A

So uh that's what I wanted to present in the demo. These are the references we have. You can look at the website, the github project. uh You can look at the uh the the slack channel streamsy on the cncf project. uh We will will help you, if you have any interesting use case, will help you to get you guys on boarded with kafka.

A

That's what I had. That's that's what I had to cover. uh Thank you so much. Please ask questions. If you have any I'll be happy to answer them.