GitLab Developer Evangelism Team, 17 May 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SLOconf: Left Shift your SLOs - by Michael Friedrich

Description

Everyone talks about Security shifting left in your CI/CD pipeline. Tools and cultural changes enable teams to scale and avoid deployment problems. SLOs are left out - what if a software change triggers a regression and your production SLOs fail? As a developer, you want to detect these problems as early as possible. This talk dives deep into CI/CD pipelines and discusses ideas to calculate and match SLOs in the development lifecycle. Early in your Pull or Merge Request for review.

A

Hello everyone today, I want to show you how to left shift your slos um a little bit about myself, I'm making I'm working at gitlab as a developer evangelist, and you can find me online as dns michi, which is dns m-I-c-h-I on twitter, on linkedin and everywhere else.

A

Now, let's dive into a little bit of history when we have been slowless, um meaning to say well we're turning back time a little and saying. Well, we had the traditional monitoring state black box monitoring slowly have been adding metrics uh to our monitoring journey. We had traditional sla reporting at some point, probably generating some pdfs for our managers and and also calculating the service level availability for our customers.

A

So we were tracking state changes over time, maybe adding some metric data points and trends um and in the end we kind of needed to learn sla, slo and sli um and that that's it let's dive a little more into the history and find some situations where, where it could have been useful to develop some rocket science and also add slos.

A

Let's start first with a story of the past of mine and say well we're building development. We're building software as developers, we had a slow rest and json rpc api and the responses didn't really turn out to be good in in large scale environments.

A

We were thinking about a cpu overload with threads involved and considered using lightweight core routines in c plus, plus meaning to say those are like um fibers or asynchronous routines stacked less. The function point is put on the heap um when, when the function is being paused and the stack unwinding happens, then for continuation sounds easy and we continued implementing that well and then we had to debug it because it went in production.

A

It went into release and there was a certain crash in production only, but only with having 100 uh 1 000 api clients and a little more than that. The memory was corrupted at some point and it was so. It also sort of was exhausted, which pointed to some sort of leaks, and we were not sure about it and lots of time went into debugging.

A

This course, when we could have created a staging environment and defined our slos, because we actually want to ensure that this uh stack and heap memory meets the operational requirements so defining an uh the service level objective of saying, hey. We want to ensure that uh heap and stack memory is on the usage usage level. We define, for example, 2 gigabytes of ram, because we cannot really exceed that from an agent perspective and the other side was we actually needed to measure that in real time production environments, it's it works on my machine.

A

um It's not a it's, not a valid argument when the application is crashing um and also we probably would have needed sort of chaos engineer for our api requests. Everything was not there, and this is just one of one of the many things where slos would have been helpful. Another story is when it goes really slow. You maybe remember the time when there was a report about the gto online, a gta online and the loading times were like eight minutes or ten minutes, and the user reverse engineered the source code guessing from the assembler code.

A

How how often a specific function is being called and reverse engineered that, in a way of creating a preload, dll and saving 70 percent of time now? Well, it could have been prevented um in our development process, so the mitigation as a developer would would have been like measure the login time at application.

A

Timing points in the first iteration at some metrics in the second iteration, look into tracing spans and just figure out what's going on um and make the application useful this the thing is, we still need to add some sort of service level objective to achieve, and this is now slo in a staging environment.

A

We want to ensure that much requests and pull requests are directly deployed from a ci cd pipeline into a staging production environment and we're defining end-to-end test scopes saying. Well, um it is the user now doing the login. This is one of the test scopes and then the user is playing and also experiencing some like log out login issues and so on for defining an actual slow.

A

It's, let's just imagine, saying: hey the login time needs to be lower than two minutes, but for for low latency connections, for example, a metric which we can correlate from from application, performance, monitoring and user monitoring in a certain way. um If it's a low level uh low latency connection, we want to increase the login time to five minutes, because there could be traffic involved and other things, and when we detect a certain failure of the slo, we have a quality gate.

A

um The ci cd pipeline fails and the release is never been released actually and we can actually fix the problem early enough and the user and the customer doesn't see it.

A

This brings me to defining an slo process, so in order to shift left, we need to update our culture. We need to update our workflows. We need to define an actual process for that, meaning to say, we need to be aware of quality gates in our ci cd pipelines. So, for example, we can use captain as a quality gate. We can combine it with promises for metrics uh measuring the slos, but we could also add chaos.

A

Engineering like adding litmus chaos to it in addition to quality gates, we also need to define application observability like adding the traditional application, performance, monitoring and enriching it with logs and tracing and combining everything into open telemetry. So our slos are actually defined early enough in the cicd pipelines, with quality gates with application observability.

A

Now, when we want to measure everything, we certainly need to find ways to automate things so probably defining an slo means. We generate it from a predefined dsl, and I would recommend watching andrew's talk around this, how we define it on gitlab.com and matching several things, even about simpson as sli, in order to go slow, and this brings me to basically summarizing what I've learned in in my past developers need to get the slo feedback, so it doesn't help when someone is paged or someone's watches a graph.

A

It needs to be there so directly integrated into much request the pull request. You need to get the details. Why the why the pipeline was failing? um What exactly is the slo and more details to read upon and, of course, training the teams to adopt it?

A

So we could have prevented of introducing co-routines, because the memory is corrupted would have never reached customers in large-scale environments. The gto online login algorithm would never have been introduced in that way and, of course, we would have added more code quality checks. um This is everything we can do for improving slos and in order to go even more slow. um We really wanna correlate the slos with our instrumentation and observability capabilities in in the future of our deaf sec, ops, workflows.

A

We need to train everyone and make everyone aware that this is a benefit and also make use of the dynamic resource environments, because this is a huge benefit of order. Scaling virtual machines um in aws in whatever cloud is needed, and we can also reuse the power of container clusters and at chaos, engineering say well. I want to really battle test the application before it reaches production, and there was once a slow that put to prod.

A

In order to do that, I would recommend checking out captain prometheus litmus and a little bit of continuous profiling in the next steps. Thanks for your attention now, I'm really looking forward to your.

A

A