Cloud Native Computing Foundation Kubernetes Community Days (KCD) Africa 2022, 24 Jul 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Site Reliability Engineering with Kubernetes by Frank Adu

Description

Reliability in production environment, design best practices with kubernetes to deliver 99.99% service uptime.

---
KCD Africa 2022 is the 2nd iteration of the Kubernetes Community Days Africa, a CNCF-powered free community event. Visit https://kcdafrica.com for more information.

A

Okay, the next speaker is um frank, adul who's going to be leading us on the topic sites. Reliability, engineering with kubernetes, so frank- is an engineering graduate with over 60 years. Experience, building and developing distributed system and frank has worked as a senior software engineer, principal software consultant automation and release, engineer platform, engineer and so much more in the cloud system ecosystem. So um it's good to have you here, frank.

B

B

Hello, can anyone hear me.

A

Oh yes, I can hear you I just um I'll. Let you have the stage.

B

All right, okay, all right- that's cool yeah, so excited to be here today and it's a privilege uh to be part of the cncf today to actually talk about site, reliability.

C

uh Engineering uh in the community space, so I could rightly say, site reliability in the cloud native space, but now everything is cloud native uh at the moment, uh but with much uh further. I do. I think.

B

I will go ahead.

C

To talk about what site uh reliability is actually um we've had.

B

The term and uh the word, the buzzword, right now, I'm an sre.

C

uh I mean sorry devops engineer as well, so basically, an sr engineer is ensuring systems reliable, and that is how I define it, and that is why I coined it. So.

B

You've been a necessary you cut across the entire.

C

Stack of the chain of delivery, uh so you start from the design stage and also uh monitoring stage as well. Also, you cut across the entire stack of uh the divide within the sdlc process as well too, and uh also you.

B

C

With the developers uh very well and also with their ops team as well, uh so they there's a lot of responsibility as you being a site, reliability engineer in the tech space or in within an organization or being a consultant as well.

C

So, let's dive in so these are the key areas. I've had so many questions. I've been asked uh privately either on linkedin. You know on the twitter space as well. uh What is the difference between xre and uh devops itself? uh Okay, uh it's a bit tricky uh to actually separate both of them. uh So basically, what an x-ray does is unnecessary, extends the functionality of devops itself and it's cool of thoughts with some of my key folks.

C

Some of my uh uh the I.t professionals agree that devops should not be coined. Devops engineering and the devops is a culture kind of so there is an argument and there's a school of thought around that as well.

C

So basically, what sorry is uh sre dev uh leverages on devops design to change and also extend the functionality that uh devil's practices actually brings within uh the tech stack as well, so uh you being a necessary or necessary consultant or an hre practitioner within a team.

C

You have a lot of key responsibilities that actually rely uh falls on you uh within that space, uh so you are more concerned about reliability, about availability and uh also system observability as well. I was very happy when the service match thing came up. uh It was very, very cool topic, and that has really given me a very good library to actually start on in terms of system of visibility and also some strategies with a high availability and some drs and most critically, automation as well and built in for tolerance and also github's.

C

I was pretty much excited when I actually saw the github session. Yes uh uh as well, it was pretty cool and also with the containers part that was actually taken care of today as well. So I think they have made my presentation a bit easier for me as well, and also uh what we call spf uh single point of failure and uh and monitoring. uh So this is where some of the issues where it comes in uh in terms of x-ray. What do you have to monitor?

C

uh What are the key functionalities that you have to monitor? What are the metrics business is interested in as well, so, basically as an acery for you to be successful within a necessary team or you're being a consultant or you'll be a core engineer within the team. uh You see yourself more as a technical customer within the team itself. uh So let's say that uh a customer out there doesn't know. uh When is the system going to fail?

C

uh All what the customer is interested in is basically I'm getting uh reports from my uh for my http traffic, and I want to do this and I'm able to do it as well, and so in this case, uh in terms of monitoring, what you have to monitor is a very key topic.

C

It's a very, very full, uh full dose of topic that I would try as much as possible to really dive into some of the key areas that you have to really monitor and some of the tools you can actually use as well- and this is the exchange standing functionality, part that comes in uh in terms of management you as an sre, you handle all core plans and you prepare incident management tools as well.

C

You do rcas, which is uh root, cause analysis.

C

uh This is where your three scene comes in, use your jaeger and their z, kings and all to actually come up with some very good rcas and to make sure uh that is being mitigated, and also we have blindless uh postmortem uh if you have an x-ray or if you are an sre team.

C

This is a practice that is very, very, very crucial uh for the team to sustain blindness pose more team. um There is a template for it as well, uh so what it actually means basically, is: um when anything happens, the entire team take full responsibility, not actually pointing out uh someone else. uh It was his call. It was that cool. It was a bad code that actually uh caused a problem.

C

So when you have a blameless postmodern that is documented is very, very easy uh for that issue, not to repeat itself again and most critically is the software design parts.

C

You are always part of the architectural design discussions, uh because before the software is actually uh being uh released, you are part of every uh sdlc's, uh a process in terms of architecture, so the architecture you bring in all the uh key metrics uh the service of time, uh the way the uh the micro services being built, uh what kind of meshing system, and so there you're, actually looking up at observability and you're, also looking at availability as well- and you know so on so software design part- is very critical uh in this.

C

So now, let's dive into the next one. I always say this every time there is nothing much on us and challenging us working on this area in a cuban ages. Environment is very, very fun to be working in a cloud native space. Very well. I've, never regretted uh okay, but sometimes it comes with the very very big pains as well trying to resolve some issues that are could left you wondering. Why did you actually choose this uh failed?

C

Okay, next, one, okay in terms of availability availability, uh so I made mention of uh has and gr's as well. So now uh I think xr is now. We could really say that we are a little bit advantaged uh when working in in an environment. Where is not a kubernetes based or a cloud native driven environment, so here uh kubernetes was built.

C

Is production ready, uh production rating in the sense that it's thought, tolerant, ready and also is high availability ready, and I actually like the word uh gabriel, actually used when he made mention of uh uh david. Actually, it is uh checking constantly uh just every state is actually being maintained.

C

So one of the critical tools, as as a necessary to have within the community space, is to be able to uh write some crd and play operators very well. So when you have that uh at your back, it makes you manage the infrastructure a bit easier.

C

uh So when you're actually having crds, you are actually extending the functionality of the capabilities environment itself, uh like I agree with you ray uh when he made mention of uh it, gets to a point. Kubernetes cannot take care of itself anymore. You need to enhance it, so that is where you have to have the tool set, the creds and the operators. So it is very key for you, as an sra, to have uh that tool set with you. You need to learn how to write some cr days.

C

uh Sometimes you have to go as far as doing some golang as well. So I also write in go uh too, and the beetle python as well so uh so now in terms of of availability. uh Fine is built is ready. It's for tolerance, ready and all that, but you have to come in with some level of architecture that you're really into really having as well so yeah you'll need your three uh uh just normal standard.

C

uh One master you'll have your two uh workloads running, which is a normal that you have and now for you to be much more for tolerant, you will need an extra three um nodes where you'll have one acting as a load, balancer serpently and you have the other two acting as masters as well. You have a secondary and the third master as well, so that is a full, highly available environment that you are actually planning, because all these things start at the stage of planning.

C

I remember I talked about the software design, so you were always part of the design process in there. So these are where you have to come in as an sre to actually plan out as well. So if you're running a cloud you're a bit very uh lucky uh to really do the configuration and uh do some add-ups to the nodes, but if you not so lucky and you're doing on bare metal, it happens to be some of the times that the environment are working most of the times.

C

You'll definitely need to deploy this yourself uh manually. You need to get extra uh nodes in there and uh I know you'll be wondering what is the etcd are doing out there. You know, so you are unnecessary. You care about performance, so for performance-driven architecture, it is very crucial for you to have a separate etcd, node or cluster running separately separately. So with that, you will you're very sure of your speed and your performance as well running.

C

So here I use her proxy. You can use any other process out there. uh You can use a uh engine x or actually proceed to uh to act and also uh have your ingredients controllers work in as well too yeah. So that is that please keep your questions coming. If you have questions, I could see uh them then, also uh in terms of reliability, which happens to be very crucial.

C

I could give you a very case scenario that I've really had uh in the past when I wasn't in an e-commerce space.

C

uh Fridays are usually very heavy with traffic, and some certain hours of the day is very heavy on traffic as well, and you'll see a huge spike with traffic coming in, and how will you be able to mitigate that? How would you a very clear example, for instance, is julia, doing a black friday or doing a promo on a second hour of the day?

C

How can you really mitigate that traffic coming in as well and no matter how you try to uh do some uh um throttling still you'll still face resource drain issues, so managing resources is very critical to service reliability. It is very critical, uh so some strategies at this level are vertical scaling.

C

uh You have a vertical scaling running, which is your vpa vertical scaling system, which is the cost scaling already running, and you also have your uh horizontal port that also does scale and as well. So it gets to a threshold when the threshold is above 60 percent that you have set for your hbas.

C

What it needs is, and now you need to have your ca. What happens to be your cost out?

C

Auto scalers have to kick in because the the the request coming in on this particular day or this particular time is above 60. So there is an algorithm that that is already reading.

C

uh You use your crd to actually do that if you want to really have the full functionality of that you'll have that reading to really set in and kick in their threshold to handle that so sometimes I could believe uh with all the experience way back two three years ago uh you will have. There is jam with, for instance, like the jam portal. Sorry, I'm actually making an example of that uh it gets so bad.

C

No one could access the site at a particular time because of traffic you'll hear there's so much traffic, the site is inaccessible or it's pretty slow. It takes a lot of time for this to happen. This is the problem you understand, so as an acer within the team, you need to make sure you mitigate against that. So how do you do that? So you come up with strategies and this part of the strategy you have to come up with this type.

C

You have your hpas to increase the number of poles that are going to be running within your clusters. You also need to include the vertical scaling system within your environment as well, and this is so interesting because it happens a lot and so, if you're running the cloud, you have a big advantage already running in the cloud. You have your auto scales uh running in the cloud and it's pretty good working. So how do you handle that in a bare metal uh situation?

C

So that is where the physical nodes have to come in as well, that you have to really fit in, um so that is for that and also uh to improve uh layers, uh improve traffic, improve latency.

C

You need to come in with your caching system as well, so the very important type of caching system most times I have this correct during my consulting times- has been.

C

There is failure to actually have a caching strategy at the back end, which is the database layer uh between the service and the database layer most times they actually mix that part out, and there is always struggling. There is always back and forth uh communication happening and which relating to slow queries as well. So, in terms of query performance, it is very necessary for you to have some catching strategies. You can use radius radius, it's pretty cool.

C

Something else is the tool I've actually worked with as well, and it's also part of cloud native as well too, a very good tool to actually work with, and uh so there are all that kitchen system out there that you can actually work with. So the whole list is to make sure you uh increase. Query performance at the database backend layer, and this is going to improve the latency from the api layer. I think um the previous speaker to the invention of api calls and all that you know so.

C

This is an extension to what she actually uh mentioned of now. Let's move on to the next one. uh Please keep the question coming. If you have questions yeah in terms of observability, thank you jibri for actually doing some service machine and all- and I think, if you missed that you can actually watch the video to really grabs uh what service mesh is. So uh he made mention their own eyesight cards very good.

C

So uh in the hre world, okay, we we can hear words like uh uh injection of uh sidecars within a service running, and it's still the same thing as what he explained earlier.

C

So if it's very, very crucial uh and very very important for uh an sre with any team or handling team to have an eyes into the services that are running at the back end back-end level, which is very, very critical. So if you have the service mesh running, you could directly uh scrape or monitor the uh the end points at for the interconnectivity to actually uh check uh which is failing, which is not actually working and also this will also help you to mitigate against service degradation.

C

You know it's something I I see almost every day in some teams and they have a lot of service that are degraded. uh They are not picking up again because of the visibility strategy they actually use. For instance, there are different strategies you can get in to make that work, you can have uh retries, you can have a single transactions.

C

You know there are a lot of strategies that you can really put in to really make that happen, and all this happened within the design stage, so within uh the during the architectural design. If you could remember when I started, I talked about uh the sra, be part of the software design plan, so you are always in every meetings. You are always uh giving out your own level of experience, making sure the software that is about to be delivered is 100 or 99 conformity to best practices as well.

C

So you cannot imagine a service being put out there and you are an sra within the team and there are no service mesh help. There are no ways to actually uh improve the intercontin uh interconnectivity of the micro services that actually are running. You could say fine.

C

We are running an spa, just a single uh uh application, we're running you can actually have some level of obsessability within that you can actually have an endpoint being scraped periodically to actually uh check if that endpoint is up so not necessarily uh within the microspaces alone, so you can actually have that done in an sp uh environment and also for the um distributor tracing.

C

I could remember when I got into a particular team. um There were a lot of issues. uh You don't know where the issues are coming from. You can not even tell is a 200 or it's a 502. It's a 504 or all what you see is internal server error and it's quite vague, you know, and actually having uh developers or the software engineers to actually start tracing start checking out and all that that takes a lot of time and it becomes so modern and repetitively.

C

So one of the areas of.

C

Success possible every repetitive task or to make those processes as much as possible. So one of the good one of the areas you can actually do is to bring it distributed, tracing and a very good tool. You can use out there within the microservices space, which is part of cloud native. You have the what we call jager. You can use jager to actually do this. It runs on aws, it runs on uh azure space as well also run on google space as well to also run on bare metals as well.

C

It's something I've actually uh have been exposed to a a couple of times, and it's one of my tools that I use as well, and uh so what the distributor trades in the developers could actually tell is a 200.

C

Is it a 504 or is it a 500 and it comes up and it drills down to the exact service that is actually uh failing, so you visualize it and you actually see it.

C

uh I I started a whole big topic on this phone that I could talk on and on and on, and um so we all can gain clarity and also beauty and apm. I talked about apm right now, uh uh an application uh monitoring now you're drilling down to a particular endpoint uh directly, and you are scraping every seconds hitting the traffic you could use promoters for that. You could use other enterprise uh tools for that you could use new relic as well.

C

You can use data dog as well, but my go-to tool that is quite flexible for me. I try to work with this prometheus. I use permeability to actually do this. uh Almost all the time depends on the environment. I find myself as well, so you create that one with it as well, and uh you set a large on the end points that you're actually monitoring.

C

uh If it's failing or if it fails, you see it first before the client or the customer actually says it, so you have to mitigate it. So when that happens, that is where you have your rollback almost immediately. uh That happens. um Most a very good scenario is usually sometimes when the new release is out and their endpoint fails. So what happens? You have to do a quick roll back.

C

That is where your github's coming and your github's automation comes in as well to make sure you. I will talk about that. uh I think in the next couple of slides as well and also back-end observability is very critical.

C

You have to monitor your database because, right now there is even a new role uh called um database observability engineers, uh basically taking care of the database part uh getting some logs checking out latency from queries and building the very best to make sure there is full observability at the backing backend services running on you know the database part and also uh in terms of uh if a lot of us have heard about fluent d and all influence is very, very good logging tool that I think every siri out there should uh really have as part of their toolings as well.

C

So with 4d, you have a full sensibility into your services that are running. You could do a whole lot with it a whole lot with fluency. I think uh when I'm giving them this uh opportunities sometime, I could talk about fluently uh in the asari space as well uh extensively. I think I should move to the next slide now. My time is almost uh gone: okay, so for tolerance, what is for tolerant um for, within the uh sorry space, you use a chaos method, a chaos mechanism.

C

Fine, you have your qa, your q is working, and so, with the photo around, you could also automate that process. uh Within your pipeline, your builder pipeline automates your full pipeline uh that is running. You could uh be running in a traditional jenkins environment, for instance, and if you're lucky enough, you could be working in the github actions environment as well, so you automate the entire process. You have your qa within it as well, and, most importantly, that you have to add in to make your kos method um complete within their pipeline.

C

You, you need to have endpoints within your pipeline. So before the release happens, the pipeline actually does an api call uh to the end point before the release happens. So if the end point fails, it should not get into production, actually not getting to the cat into the uh uh the sandbox area or what's on the stage and to not even make it to staging.

C

So it's a very critical area and a very new uh uh system way to make sure before you go to production everything to a hundred percent, uh because business focus is to make sure that uh the services out there is one hundred percent. They don't want to know if it's ninety nine percent everything is hundred percent for business. It's very critical for business, so chaos method is one of the critical areas before you go to production.

C

You have to make sure that happens as well, so also our mesh system has made for tolerant testing, also very uh easy, and um so with the diagram here you can actually see from one you have this analog balancer that is coming in and also you have the ingress data plan plane that is coming in and the inverse control plane and the service mesh control plane coming in and hitting the service process. uh That debris actually talked about service processes and all that.

C

So this is a inter communication with the micro service itself, and this also improves security in terms of security as well into pro security. uh In terms of inter-service communication uh from what you can see here right now, they are all communicating separately and independently and actually hitting directly to their uh the service processing. So each of them has a separate policy that actually enables for service discovery as well and how to avoid uh service degradation.

C

What kind of mechanism are you having? uh What kind of uh how many seconds actually for a service or what constitutes a service to be degraded? Is it after three tries the service is degraded, and if it's degraded, it's a replica or is there a replication? So that is where you have a strategy within the k8 space, where you have a stateful set as well. That comes in so with this for states, you can almost have as photo around as much as possible.

D

Okay, now to the next one.

C

Automation, the critical parts uh I think, every sre. I think this is one of the call functions, uh one of the conf functions uh within the xr space, talking about automation, code, compliance, uh infrastructural code and they are having helm to uh to to to to act as your app management tool set as well. uh Like I said here is one of the cheapest, the yet most expensive practices, um some good practices with good quality, with good quality.

C

You can have uh lenses to make sure your infrastructure is cold, uh well, taking uh lenses to uh to monitor your yaml and also have your control, and it's also a use case as well, as is one of the cheapest. That's like I said so, one of the case of failovers.

C

So let's give a practical example: you'll have your infrastructure that has been built using telephone, for instance, and you spawn up your eks or your ats or your gk environment that you're actually managing and uh something happens to the region or happens to the data center or uh you can easily grow back. uh You could usually spin up the infrastructure almost initial period of time. Like you having to do your snapshots, you have to do some geo replication.

C

That is quite expensive, uh so it's cheap, but it's expensive uh when I mean usability, because sometimes that is not being put into perspective as well. So within an x-ray in a team, the sre ensures that every infrastructure is built as a code, not a doi.

C

The ki is highly discouraged uh because of the prone to errors, uh mudin uh approach and the number of time it takes to actually bring up back if an infrastructure actually goes down, for instance, uh okay, then, a very good other way is uh uh having your backups, uh your ssl search, your installations. For instance, you want to install some softwares. You want to install some packages with the pipeline, for instance, you have to use sensible to do that, uh so it makes their entire infrastructure more robust and quite easy to maintain.

C

Okay, I think this was dealt with yesterday. I have a very limited time. The githubs this was very, very dealt with yesterday and I was pretty cool with it uh talked about flaws and I go cd as well.

C

uh There is so much talk about it, or should I go for argo cd, or should I use flux uh as well, so floss can only observe one ripple at a time. I don't know if they've been able to update that. I think uh the lady from weave works can actually point correctly on that and while ago cd you can observe multiple repo. So you before you actually use the tool. You should actually know the reason why you have to use uh such particular tools.

C

So remember I talked about rollback and how quick for you to get back up up again to make sure you maintain some level of reliability and availability. um This stops. Automation has come to stay and we have a good city and flux to actually do that, but I prefer using aggro city because of the multi ripple that I actually uh work with.

C

I think it's one of the areas that flux has to really improve, that I don't know if they have a new affection that can actually uh do that at the moment as well, and also it's a source of truth and also release integrity and becomes ownership driven as well and also for containers containerization. I really enjoyed uh the container session that was done by grits. I really loved it and I enjoyed myself and actually lunch as well. So um we all know the regular container and health check about containers.

C

We could tell you: okay, fine, uh the life cycle. You have uh liveness probe readiness, probe and startup pro and they're pretty cool, but there is more to that. um If you are working in a very in a real in uh cloudy space, that is data platform, for instance, and uh the data streaming platform. For instance, they want some audio java based python-based uh golang, based applications to actually run side by side with containers.

C

uh What you have to do, you have to use the init container, so you have to initialize the software first uh before their uh the containers themselves that have the images actually around. So there are set of scripts that you cannot find within your images. So it's something you'll have to write. I think we all know this command very uh simple command that we can remember.

C

You had enabled this sleep, sh echo blah blah and uh sleep, and all that and also you have the init uh containers that initially initially initialize the containers that you need to run side by side. uh With that, I think I can show you very quickly uh something here how that works. So uh by the way, if you're wondering what kind of ui I'm using sorry, this is one of my tools that I'm using this is what we call lens. So this is an example of uh in its uh running all right.

C

So this is how and in each round so it runs side by side with the uh the app that has the images the latter running. So it runs by side by side. So the advantage of this is, uh you don't have much of uh your app throttling or your image throttling or your image shutting down unnecessarily, because you've taken uh the heavyweight off so the code for setup. Everything is the other container. So it's like you uh during this separation of process.

C

Okay, so let me move to the next one. My time.

D

B

All right so uh single point of failure, single point of failure: um yeah you have to have the multi region.

C

uh Deployment kind of strategy that you need to have you need to have your easys from the diagram here. I think I need to be fast as well. Sorry, abu bakar, I'm taking much of my time more time. So yeah you have the. Maybe you could go to the video and actually uh look through it and see how this actually works as well. You have the multi-multi region and also okay. Okay, thank you. Thank you. Thank you. Okay, all right, so let me go ahead.

C

Okay, cool yeah, so you have the application so bus, let's try to just avoid failures in your production environment is actually have a multi region. Your girls, like I mean mentioned on iac's, which is in particular code. I'm not going to delve into that again. You have your disaster recovery kind of mechanism and your multi-region as well. So in terms of your girls, it's always very, very, very critical for you to have your database uh structured in a dr process as well.

C

All right in the gro as well, so uh so with your drs, is for your database. uh You'll have uh different different uh drs are very expensive, very expensive. So before you do that, you should know it's gonna cost huge amount of money uh to really really do to go ahead with. So uh you have the transition logs as well, and also with the uh with the regency having your cluster, then a monster.

C

So all this actually helps with your replication and also helps with your data performance as well too, and uh also, I think I will rush through the other one as well. So I can take questions okay, interesting. I think I'm on the last part now cool this. Is it now we've heard about slls slis?

C

I think uh abubakar. You need to bring me back again. So I'll talk more about slo and sli, so the devops community or the cloud native community, some people really understand what an slo how to set slo uh work. To say I could remember my very first uh consulting was okay. We wanted to drop an slo for their application or for the company I it was very difficult for me. I have to go through google documentation to understood what an slo was and all that I wasn't looking at it from a business perspective.

C

I was looking at it from an engineering perspective, so the time I started looking at it from a business perspective, it was quite easy for me to actually relate with what is an slo to business and how important it is for a business and you're making a call to that application, making that streaming uh on netflix, for instance- and they really uh the business- want to make sense of what how the response time is, for instance, now and uh very critical is your latency, and this a typical example of an slo dashboard that was made in-house um was made in-house with uh from corel uh with rounds on pro meteos, and this was done with the grafana dashboard as well.

C

You have the throughputs and era budget is something I can talk from now till six hours. Every project is quite huge um and it's any any any company that is running 99.9.

C

What that means that in a whole year, uh it's been calculated to know the number of time, they're actually losing money, and that is very critical for business anytime. I always present uh slo dashboard business get excited about it, so they put more pressure on the team to make sure they perform optimally as well. So I got into a team, they were making 95 and that was a huge loss in terms of our budget, but in their engineering world 95 percent is okay. It's a good timing.

C

It's a good one, but no and sorry is very, very, very bad time. 95 percent is not a good one, not a good one. You can have a three nine if you can have a four nine, which is fantastic and most of the time that happens with the question that goes. Why are we having bad error budget?

C

Why are we really having error budgets wrong error budget? So one of the issues with having wrong error budget is the tech stack. Most of them were actually using the environment that had 95 percent where they were actually using a legacy application. I'm not trying to say some uh languages are bad and all, but if your application is going to be uh interoperability, you need to make sure you are running a modern stack.

C

If you understand what I'm trying to say like you're running a java application, for instance, it's quite different from you having an application running in a goal or an application running purely on python, for instance, the speed is not going to be the same thing, so they were actually using an old stark and they had to really change that into a modern stack and they were able to achieve 98.7.

C

So that was a great improvement for business. So in a whole year when that was calculated and for you to have a effective area budget, you need to measure or for a minimum of 14 days of that particular service 14 days. That is where you can actually okay. You can come up with something so uh slo maturity actually starts from 30 days.

C

uh If you build an iso dashboard, now you're not going to get the exact picture to have to wait for 30 days before you can really make uh sayings of their uh dashboard that you've actually made for business, so business actually love dashboard. This is one of the beauty about you being an isari. uh You communicate with business. You are a business person within a technical team. You are the technical business person within the team as well to make sure all these parts are followed and also taken care of as well.

C

So in terms of incident management also, there is a template for that. I think you can. um You can chat me up or something I could share that as well. uh There's a template, and also for rcas- they usually templates as well and also for postmortem, there's also a template as well to actually uh work with and also on call. Thankfully, you'll have a lot of software. Now that does on call rotation. uh uh You can use quadcast to make that happen.

C

uh You can use um a new relic inbuilt on call that you can actually work with as well or you can have your own internal on-call system to actually handle that as well. So the whole essence of phone call is to make sure. uh Well, I think I will also teach that as well. There is there are procedures that come with phone call uh that comes in when a software goes down. Who do you have to talk to? Who do you need to speak with? Not in our traditional setting?

C

The software goes down, everybody is hair wire. We don't know who is responsible. Who is the next person? Who is the right person to actually speak with so sorry brings in processes uh within the software paradigm, uh so the entire sdlc chain? You are part of it till the very last uh uh now till the very last part. So everybody goes to sleep, they doesn't necessary. You are considering uh how to really make sure the software is uh in uptime and it's always been decided. So there are a lot about to talk about.

C

You've been an sre, uh so much so much so this is quite a limited time for me to actually talk. I would have also loved to do some demos as well, but time is not really going to permit me to do that. I can quickly show you um I don't know if I can share this. My screen right now.

D

Okay where's my screen. Now uh what should I share? Now? Okay, I think I can share something now, all right so.

C

Cool all right, so this is a dashboard all right- and this is a streaming dashboard built with kafka, and so business wants to see this business is interested in this business actually want to see how this is happening and all that, and uh so this is how business actually makes sense of what is happening in the back end, and this is a very classical example of how this works as well.

C

So this is a kafka running at the back end for data streaming processing, uh which is live right now, so this is happening, live and you're getting live streams and seeing the calls uh that are happening. So this is a practical example of what an xre your day-to-day is. uh You have to build it. This is why I call it an sre you'll, be part of it. You build it, you monitor it. So those are the three core areas you have to really look into as well too. So uh thank you.

C

I think I'm uh I'd be good. Now.

E

Awesome I have been sharing everywhere that it seems as if illyria yoda planned this schedule so vision and carefully placed your session to be the last one. You've been touching on almost all previous sessions from yesterday. Yeah comments have been going on around that. This is very interesting and definitely if we had known we'll have made it to one hour together, so that you think of it. One hour.

C

Yeah, it's quite big. It's quite huge. You know.

E

Yeah, but definitely would definitely reach out next time if we are hosting more events so that at least you can share more uh of the another, especially the slus and uh sl yeah.

B

C

Very critical, a lot of confusion around that area.

E

We don't have any questions in chats, but I've shared your linkedin url so who can reach out? If you have any questions and to ask you and uh yeah, I I don't know if you have a twitter handle most of our comments.

B

Yeah I have a twitter android food rank at foodrank29. Can.

E

You drop it in the message here, so I don't make a mistake.

C

Okay, food rank, don't mind me.

C

I hope I remember my twitter handle.

E

Okay, so yeah, you want to reach out to frank it's at foodrank on twitter, so you can have conversations about uh things you want to learn and uh around srs sre is a major thing and there are lots of interests and a lot of companies are looking for more talent, talented sres, literally yep. Okay. Thank you very much. For your time, frank and all.

C

Right to see you soon, bye.