GitLab Developer Evangelism Team, 5 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: CI/CD Observability with OpenTelemetry - Developer Evangelism Research Update 2022-05-05

Description

Feature proposal: https://gitlab.com/gitlab-org/gitlab/-/issues/338943
GitLab with OpenTelemetry Research: https://gitlab.com/everyonecancontribute/observability/opentelemetry-gitlab-research

A

Hello: everyone. My name is marty, I'm a senior developer evangelist at gitlab, and today I want to do a little deep dive around ci cd, observability tracing, open, telemetry and everything. I've been collecting and thinking about in the past months and to begin with- and I will link the issue um in the video description later on- um why are we planning to add open telemetry for tracing into ci cd? um How would it work with git lab which components are involved? What is what are the most challenging parts and actually where to get started?

A

um I will do a recap or like a summary um in my cdcon talk where I'm talking about how we build ci cd observability with open telemetry. This will be a snapshot of the current development stage or rather research stage, because it's it has become quite a lot of things to consider.

A

um One of the things is like: why are we actually doing that? um Oftentimes? There is a problem that pipelines are being blocked or you want to have specific insights not only on the job duration inside the pipeline, but also more context, whether it's it might be. The docker executor having a problem or kubernetes or something else, um and we came or we discussed the idea of of saying hey.

A

We really want to have traces and spans um in in this regard, and there are specific ways to integrate with external tools and just execute the traces or the specific cli, for example, from honeycomb using build events and then uh send it to an external receiver and backhand, and then this visualization, um but over time um like there, has now been a standard being defined, which is called open.

A

Telemetry, um it's a framework, it's a specification for traces and later on, metrics and logs, but for now we will focus on tracing one of the ideas around open. Telemetry is that you have a generic collector service where you're just sending the traces forward to- um and this is like the idea of saying, okay within the gitlab pipeline, we want to send these traces um and so like to really get started.

A

One needs to understand the flow in the traces, and for that I would just skip from the issue which has the proposed steps um from getting started with variables, um adding it to the gitlab runner component. Adding it to the gitlab server component and just doing more um inside, I will just jump quickly into my slide draft to make it a little more clear on on what we are planning or what what we actually should be doing and more on the talk soon in tune.

A

um One of the the most challenging parts is like where, where to get started, um we really want to solve the problems and.

A

The thing is I, as a user, want to get feedback and want to configure that and immediately see, what's going on, I'm just looking what I actually was looking for is we need to understand the gitlab architecture, so we have the runner which schedules the pipeline schedules, the jobs and the server, and the runner connects to the server asks for new jobs when they, when they are there starts the execution, sends back the job logs and later on, when everything is finished, uploads, the artifacts and so on.

A

um The thing is: when we want to start there, um the runner is much more isolated in that regard, it allows um to compile the standalone binary, just execute something um and see it in on on the terminal and later on um as well on the server there is more going on um so like the the pipeline is being triggered. There is a queue the runner communicates with the server and so on. So this um can be a little tricky.

A

um But, to begin with, I decided to say: okay, I want to compile the gitlabrunner binary, which is written in go inside. The job is being executed, so we're actually using the run method. We have different executors and we can use, for example, the executors, a string as a metadata for traces, so enriching the context with more data, which is not just a metric, a measurement but instead could be any type of string or.

A

Number or boolean, or whatever, whatever attributes attribute, is useful um for later debugging why the pipeline could be blocking, for example, or whether why the render resources are being uraned. um So, um let's start with the runner, and when I started with the runger, um one of the biggest challenges is okay. um Now I need to understand how telemetry works and from the design principles we have the collector, um which then stores the traces in the eager tracing later on. But for now um the the challenge is really to say I want to send.

A

I want to modify the gitlab runner code to send traces to open telemetry, which is a tcp port somewhere listening, and so I started my research within the development documentation or the instrumentation documentation of open, telemetry and also started looking into uh the gitlab runner development documentation and also the go guide just to get an idea of what is necessary, what is needed and how things are working and from there I thought of well. I need a local gitlibrella development environment, which is basically following the documentation and then understanding the make commands.

A

I ran in some troubles which, uh where I will be creating some much requested issues um after finishing up, and um but it's pretty straightforward.

A

One thing I couldn't make work are the helper docker images, but at some point I saw something on the terminal which made me believe: okay, um jobs are being executed and I can I can go on and do something.

A

um The other thing I did was I kind of automated the installation of the runner with the token, um and also ensured that the runner is using a specific tag like hotel for open telemetry, and that allows me to really see okay. These jobs, which I'm triggering on the server, are just used by this specific runner and they're, not interfering with some other configuration because I don't want to. um I didn't want to start with setting up the gitlab server environment, um because this comes at a later point.

A

I started out with just using or creating a playground for this specific task which is over here in this specific group, which is just more or less oops, more or less documentation or me documenting the steps and also documenting which uh development environment I just did and so on so long story short uh um everything is linked together. Also in the issue.

A

The thing is, this repository also provides the gitlab ci configuration, which is basically just generating jobs using a matrix, build and assigning the chop tag, which is basically the magic and the other magic is to trigger the pipelines. Always from this is the wrong one: retry, the pipelines all the time. So you, the idea, is to just generate pipeline jobs and then see what the runner is doing or the um the modified run-up binaries later is doing later on.

A

So um this is being done over here and the idea is really to execute the runner which will be shown in this um view, and you can see that I already tested some things or I've been testing it for like two weeks now, um and the other thing is- and this will come later on- um is actually the open telemetry collector. But for now um I can just execute.

A

The gitlab runner in debug mode and then providing a custom configuration which, which defines the base images and the helper and the executors and so on. There is nothing really like special about it. I was just following the development documentation and once I started modifying everything, um I was pretty much overwhelmed by open telemetry and what can I do and I need to send traces and so on, and then I figured well, um let's take a step back and only print the traces to standard output to stdr to the console, so the open, telemetry go.

A

Sdk provides um an sd out tracer, which is super useful. You don't need to have to worry about any specific network connections or something like that and in addition to std out, you can also save it in a file stream as a text file, um but in the end one needs to learn how to initialize the tracer and then start adding some spans, which are then printed to the console.

A

um Well- and this is like the manual instrumentation- which I did- um I also ran into certain things which didn't work so well, so I was initializing the trace and then using deferred to shut down shut it down again, which kind of made it made it so that my traces and spans never were sent to anything um but yeah after some hours of debugging I figured out.

A

Oh I'm, I'm doing it totally wrong and I also got some ideas around how to improve the documentation and and help make make this a better experience for everyone else. In the future.

A

um Long story short, the idea was to really like make it make it work on standard output, but once I got the the idea of hey I have, I can see something um and just to give you an idea. What I actually did in the beginning was to set up the instrumentation which works in a similar fashion.

A

As with the, let me see the metric server and everything else which is already implemented in the gitlab in gitlabrunner.

A

It's actually over here, the run command is getting executed and then the metrics and debug server and so on is started, and I just injected the open, telemetry instrumentation method in there in the beginning and the image request, where I'm currently writing code in will also be linked, and in the beginning I just started adding the std out tracer and tried tried to figure out which attributes I would be needed to to be setting so like what is with sampler.

A

What does with bachelor mean what is a resource, um so there are so many new terms to learn when using the sdk.

A

I I sometimes felt blocked in the way of of saying I I don't know what's next and then I just copied something which I found in a different blog or a different code example. Until everything was working now looking at the diff, it looks pretty short, but the way getting there was not so short.

A

In the end, once I saw that the std out traces have been working thus far, I decided okay, let's move on to the next step.

A

um Let's actually store the traces in jager as a you guys are eager tracing is basically a collector and a storage uh with elasticsearch kafka or click house in the in the back end, and it allows you to um search for traces and visualize them and so on and in connection with open telemetry one can actually collect, or you can send the traces to the open, telemetry collector and the open telemetry collector sends them to jager tracing, that's basically the setup, and before doing so, I wanted to again take a step back and say: I'm first sending something to the open, telemetry collector, because I learned from the documentation that you don't need to have a back end.

A

You can just use uh the logging exporter, which then sends everything to the to the standard output on the console, and you can also enable debug logging. So you basically see everything which is going on and happening when traces are being sent over and after I've. I had this success moment. I said: okay, what what is next? How can I bring up jager tracing with the open telemetry collector, with, for example, click house as a backend?

A

And yesterday I started out and say: okay, I want to use the open telemetry operator the year operator, the clickers operator and kubernetes, but for some reason I couldn't make it work. It was something like ports didn't were not exposed and some tls issues and for today I said well, let's take a step back and focus on a local setup, which was not only done using this binary compilation, but just focusing on what's needed for a docker compose setup.

A

The open telemetry country project has an example which also uses different traces and metrics and everything else I removed everything which is not needed and just focused on um actually adding in there and just navigating over here, adding the specific thing things I need uh the tracing all in one container, exposing the ports and the open telemetry um collector, which exposes two different ports. This is important for 4317 is for grpc and uh 43 18 is for http.

A

um Can one can use both? I I started with http, it didn't work and then I switched to grpc and then it did work, thus requiring me to use an insecure mode with tls, which is not recommended, but for development and debugging. I just went this route now after having started the docker compose setup.

A

I also needed to ensure that um the specific ports where the grpc service is listening on, which is strangely mapped with rancher desktop over here, but it doesn't matter it's it's running, it's working, um it's drawing data and it also exposes the hair tracing ui, which we can see over here and I'm clicking on find traces. I've selected a custom time range.

A

I have a specific service already. I have no idea why this is called no service name because within the code I've configured a service name but yeah from the developer from the implementation specific site. The idea is to send the traces which are now instrumented in the code to open telemetry using grpc, and for that- and this is like seen on the slide- which is not that huge, but within visual studio code, you can see the instrumentation.

A

This is just for debug logging in the sdk.

A

It sets a context and then creates a new open, telemetry trace, cheap grpc client sets the insecure mode not recommended and also overwrites the endpoint, because for some reason it didn't use my default or I didn't reach the environment variables, but I decided I want to debug this later, just use what what I said manually and then yeah, then it's creating a new exporter, registering the tracer provider and using a batcher, and so basically, everything has been set up and the most important things or the most important changes um actually done in, for example, the process runner method where the tracer is being started with process.

A

Runner and specific events are being added. So after having set the executor provider having it acquired and what else and the build has been acquired, and I I tried to add some more descriptive things with using the string for the runner name and the worker id as an integer just to see what is possible with adding more context and more metadata to to the traces and spans.

A

Now. This is a little complex, but in the end, it's just uh just a small cold edition to really send some traces out of gitlab runner when something is being executed um towards open, telemetry and then jaeger.

A

In order to show what is going on here, it will be needed to actually execute the runner, which I will be doing in this terminal, using the gitlab runner binary and the configuration I've prepared- and this kind of looks like this, and when I scroll a little back, I can see there are some variables configured and it will be using this one for localhost and the port mapping and initializes the tracer with specific attributes, and I I need to analyze that, maybe that's that explains certain um other issues.

A

I've seen now when the runner is being executed, it connects to the server and there is nothing going on. There are some internal logging things you can see, which is like the sdk being set to debug log, which helps to see. Oh, there are some spans being collected and things are happening already now um the runner is connected to oops.

A

This is the diff to this project and I won't show you the um the settings now, because there is the runner token in there, but we can retry the pipeline and the pipeline is running and we should be seeing lots of jobs being executed. And then there is like the issue with the failed helper image, because it requires a certain tag and but I have bothered with it that much so it it's. It says it's exporting spans and when we navigate over here into the jaeger ui actually can see find traces and we have traces.

A

We can also not just limit the results to 20. Let's say just show us what you got: 86 traces in the last two minutes or when, when whenever I started the gitlab runner, so it's sending something and, as you can see, it's sending process runner as a trace and it has one span inside, which is great, because I didn't map anything else, but it we can see different execution times already, which is already super amazing, at least for me.

A

First, first success and now when we want to like navigate into the trace and the span, uh we can see the timeline here and what does what happens? We can see yeah the docker message and so on, but let's just click on this and look inside. What did we set here? The client library name is specifically set to gitlab.org gitlab runner, and I think you can set more to that.

A

The added events which I showed you earlier um with we are getting the executor. Then we acquired it and we have acquired the build. The runner id is this one and the worker is number two.

A

Let's see if we can like see different span, maybe with.

A

Different workers still worker number two- maybe one can filter by that. I haven't really figured out how this how this works on the front end. Yet um still it's a great way to visualize that now the thing is. This is just the first step of seeing something and getting something to work, and when, uh when I show you the code or the um things I'm doing currently, there are many to-do's and things inside.

A

So this really needs clean up and more thinking, um and it's also too early to really push a draft, mr to the upstream repository at this moment, because I'm not sure if this proof of concept actually works, um especially when having navigating into what's next. What's what's coming, because this is the this needs to be done for the gitlab grammar, adding more context, adding more things which could be interesting as an mvc.

A

But we also need to ensure um that the entire platform, the devops platform, actually supports um cicd observability. So the server needs to be added as well, and this is after the first success now with with the gitlab runner and providing proof it works with giga tracing and open telemetry.

A

I will focus on the server development environment, getting things up, um investigate and research in the open, telemetry, ruby sdk for for the server components um dive into the code for the public execution backend, which I haven't done with before, and I'm curious to learn and see how far I can get with the documentation and starting my own contribution journey and as an mvc. I'm thinking about pipeline start end for a span and chop start and for span.

A

There is much more to that, but one really needs to see something and as a challenge combining the two, because when we start a trace on the server, so the pipeline is getting started, the jobs are being executed and then the runner executes the jobs.

A

This needs to be combined into a single trace, so we want to see from the server starting the pipeline job. The runner is executing the job returning to the server the next job and so on.

A

um This should be within one trace and we really want to see which span is taking the longest and so on, and this also needs some testing and some tryouts with what is best for a trace, starting it, ending it and so on, um and also for visualization for now we're using uh I'm using the giga tracing front and we might be doing something different with ops trace and the the integration um which is on the long run on on the long road map.

A

But for now it's really just having a development environment which um can be set up rather easy, um but hopefully in the future. There will be more to that. Other challenges are like not only instrumenting the gitlab runner in the gitlab server code, but also allowing users to run a cli or something else or the within the domain specific language. In the configuration to execute something and say this is for me starting a new span and I can kind of instrument my ci cd gamma configuration.

A

I don't know if that makes sense, but it's an idea, and it it's one of my future goals to also evaluate this possibility, which requires learning the dsl and and other things, obviously other parts of gitlab.

A

Now the other thing is like: when the deployment is done and the ci cd job is being executed, the agent for kubernetes might be running and then the application is running. Maybe in a staging environment, maybe in production environments. We also need to continue with open telemetry with tracing with observability.

A

So this is another long-term goal to achieve and um and I'm kind of giving the talk, I will be recording soon. um It's a good practice for you to get an update.

A

The thing is, um there are other things like security and multi-tenancy so which which data can we actually expose from within gitlab runner and gitlab server to an external tracing end point: how much, how much possibilities to leak? Something?

A

Can we kind of limit the configuration to a specific small group multi-tenancy in a sense of limiting what what can be sent where so, like you, don't see everything, but only for your group or for your project on gitlab.com, on the sas platform and for self-managed also a challenge um just to to ensure that not only admins or um the ones who own the platform or the the the instance see the traces, but also um users can actually debug their pipelines and see how long the jobs are being run. And why and so on.

A

So this is like this is something we need to try out and discuss also with our product engineering and security teams to really figure out. What's the best way to move forward which also brings in the problem or the idea of hey we're, not a small instance, we might be running lots of jobs, lots of pipelines, lots of executors and things. um How does tracing being being enabled by default, actually affect?

A

What's uh what's going on with the performance on the gitlab instance or on the sas platform- and um this is something I don't know so, will we will it be enabled by default? Can everyone just enable it will it be like? Will it need certain quotas or things like that?

A

um It's it's still a proof of concept, so we will need to do benchmarks and reviews and different tests scenarios and so on. So me now, sharing this update doesn't mean it will be finished in one month, two months, whatever months, it's just great to really pre being able to provide an update now, and I will dive more into all the ideas and things which are also documented in the issue, um but for now, let's just focus on that and beyond that I've mentioned it in the beginning.

A

Open telemetry also supports metrics and logs in the future, so we might be changing um how metrics instrumentation works. Everyone needs to learn and adopt this. Maybe moving from prometheus metrics endpoints moving to open television makes sense.

A

We also need to config consider using it within gitlab, and this ties into using lab kit for go and ruby. Instrumentation from within gitlab and see, if that can be combined the efforts or not to duplicate the work. For now, the proof of concept is not focusing on um on looking into what labcate is currently doing and our series, but this also needs collaboration and then discussion on what is actually possible um when I will be having the first test, mrs, I will should be reaching out for now.

A

It's really too too early too soon um to to discuss whether to merge it or where to go um yeah, and this has been running for a while now, potentially generating new traces, somehow yeah.

A

I will continue working on this for a little while and figuring out what other data I can provide and then kind of freeze the goal code for now and focus on ruby, potentially tomorrow or next week. um I will be at kubecon uh in one and a half weeks. Potentially I don't have time to hack on this, but if there is time I'm happy to do some hacking, maybe or some discussing some idea, uh some ideas in person.

A

If someone is there and um yeah the asset, I will link the video at the issue in the video description and vice versa, providing an update in here and everything else is public. So if you want to read on what I'm experiencing and what I'm documenting you can follow along over here, as well as in the um development environment guides how to set up things and so on. So everything is documented, like noted in here um yeah, and with that thanks for watching and see you soon in the next update.