GitLab Monitor Stage, 25 Oct 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discuss CI/CD Tracing and open Telemetry issue

Description

James and Michael spent some time talking through an issue to add trace collection to GitLab CI/CD Pipelines, the tech behind it, what's happening in the industry and possible future vision for this area of functionality.

https://gitlab.com/gitlab-org/gitlab/-/issues/338943

A

uh So this is our discussion around cicd tracing with open telemetry. um This started, as I think, a slack message and turned into an issue um that we've been discussing and michael. Let me say I am super impressed with the issue that you have opened here. um I'm gonna go ahead and share my screen uh so that we can talk through the issue since you've authored it, um and then we can take the conversation for them from there. Does that work for you or do you want to share your screen?

A

So you can drive and point at things within the issue.

B

Yes think I think I can go ahead and share my screen um and I to be honest, I think the issue has gotten a little long, um but I I think it really needs some like context on which problem are we trying to solve?

B

How can we do it and what technical background you really need to know about open, telemetry and all the the tooling involved, because it's a huge um ecosystem, and I think there are many things you you probably don't need to know when you think about tracing, um but in the end it will help so like one of the things around ci cd tracing was or like.

B

Why would I need that? And one of the cases is like we're not having like the single pipeline with five jobs, and then everything is fine but oftentimes. We kind of design complex architectures of ci cd power plants. We have asynchronous dependencies, we have add effects, we have duration times, which oftentimes exceeds one minute five minutes um and then the pipeline is being run and there is no fail fast pattern involved or something like that. So it it consumes resources, resources might consume or do consume cloud resources, compute power and so on.

B

So it wastes time essentially, and then there is the thing like. How would I approach this? Where should I be looking at getting more efficient with the pipelines we do have our like pipeline efficiency talks, which started out as from a ci monitoring workshop. I I did one year ago something around that and I was like okay, we we might have metrics, which we can solve with the promises exporter from our community, um but we also need to like have a deeper look into our pipelines, and there is another issue which is being linked.

B

um I think at the bottom somewhere where we talked about presenting this on the ui or in the ux. So you have your.

B

Pipelines, um let me see if I can find that quickly. Is it linked? Is it not linked? I think it's not linked yet, um but the idea was to to have the uh gitlab pipeline inside no probably have searched for it. um Pipeline analytics.

B

No, it's not this one, but probably um I can find it later on um the.

A

Is it an issue about um showing the insights of the pipelines within it within the ui.

B

Yes, it has ux elements inside and there is an animated gif or a video inside which shows um how you can like navigate into new pipeline view. You can zoom in you can view the duration um to to really see like a weather map or heat map, which job, for example, is exceeding a certain limit.

B

So you can like quickly see where the action is going on in your pipeline and then have a deeper look and say: okay, the job is taking 10 minutes. That's not! Okay! I want to take action on improving the pipeline, maybe splitting it up, maybe making it non-blocking and so on.

B

Okay, the the thing is, um like the duration is only one value of one metric. You want to be looking in um and oftentimes.

B

It's not just like this one job taking five minutes or taking quite a while, but it's it can be an event of things or timelines timeline of things influencing this overall behavior of a slow pipeline or of a pipeline which, which doesn't scale that good or consumes too many resources, um and the idea around tracing is that you basically define a span which has a start window, a start, timestamp and an end time stamp, and you, for example, measure the duration of course inside.

B

But you can also like enrich that with more metadata similar to logging in a specific sense but put put into these span blocks, and there is not not only one span in in a in a trace, but this can be like a multitude of linked spans. I don't know. Did I put a? I think I didn't put a picture in there, um but yeah the we have more like insights with tracing into that, and if I um change that picture into saying.

B

Okay, I'm I'm treating, for example, the chop start and the top end as the the bigger span I want to look into, um but I could also be like looking into and this needs to get that runner inside then uh when the job starts, when the docker pull starts, when the preparation for the uh git sub modules start and end and to just provide a fine, more fine, granular inside, what's happening on the runner side, but also what's happening on the job server side, for example, uploading the artifacts or the caches, or something like that so to see the the background infrastructure, but I also have the possibility to instrument um our own jobs.

B

Maybe- and this is something which is linked with uh honeycombs build event over here. This is this was not intended. um We have build events which, uh which is basically a cli which gets called, and then you start to spend your end span and it communicates with honeycomb then embedding that into ci cd scripts is is a little cumbersome in in that regard, that you need to have many environment variables and so on.

B

So, from a user perspective, it's it's a little complicated to get started still the tool is it's, I would say, straightforward to use when, when you dive into it, but again it doesn't feel out of the box and it's probably easier, at least for me as a consumer as a user to just say, hey, I want to enable ci cd tracing or configure that from my ci cd template and then we, uh then I get an insight and, depending on how I configured it, um let's just say to the tool, basically sends the traces to to a back end and then the front in the front that I can view and analyze this in the best possible scenario.

B

um I, as a user, don't need to understand, what's happening back then back there, but in like in the first iteration on the first mvc.

B

It should focus on open telemetry as a framework as a tool which allows us to define where to send them. The traces like use using a defined backend, which is shown in uh in the picture so telemetry acts as a collector service, where you send the traces to and it stores them or it allows to store them in a back end, which is not open, telemetry itself, it can be jager tracing, it can be a grafana temple. It can be something else for tracing a vendor.

B

Providing this I don't know- maybe I think splunk has something in that regard. So there are many different vendors also involved in the open telemetry project, um and the idea is to provide a generic interface, merge from open tracing and open sensors into open, telemetry and store all these traces there open telemetry has another specification coming up for metrics and also for logs, so it's not just like open telemetry just for tracing, but this is like out of the scope. It's just additional information that open telemetry covers more than just traces.

A

So you're thinking I mean, I think that the the problem is both founded. We see based on what we've heard from customers about pipelines when pipelines get too long or they unexpectedly change. They want to understand why and we don't have great. We don't provide great visibility into that today out of the box.

A

um So I think that the problem is it's: it's a problem space, that's right for improvement. um It sounds like the mvc, then would be you can enable span um or tracing within your gitlab ci cd pipeline and configure a place to go collect it. All we're going to do is, say and every step within the jobs we're going to start and a trace just as part of the pipeline run for you, so that, then you can go. Look at those metrics. You can then leverage gitlab to do that within our jager implementation or prometheus.

A

If you wanted and then a further another iteration on that might be a not free feature set that allows you to visualize that, alongside like pipeline editor or somewhere in that experience, you can see hey. This is taking a long time either at the last run, um or you could start to dig in and say, hey now. Historically, this job is taking over. This pipeline is taking longer whatever it might be. Is that is that what you're thinking?

A

um I think I kind of blew that up to 30 000 feet really fast, but um from an mvc to where this might go vision wise, uh is that described or did I understand.

B

A

You're, what you're looking for.

B

Yeah that describes that very well. I tried to break it up as in in many parts as possible, because I know that everyone that we really want to have like the fancy web interface, but we can really- or we should really focus on using, what's already there and oftentimes it's. It's also this uh the case that, for example, someone already has jaeger tracing in the environment, their end and everything else. So we we want to do the integration, and when I saw the data, talk, ci visibility.

B

I think the data talk integration a while ago. I was inspired by that and saying I want to really like look into what is there.

B

I think I linked the diff just because I wanted to to see how how it works or not just how it's defined, but where are the timing points in the ci cd pipeline, so within our code within git lapse code?

B

What exactly happens in the background so which events are um there and so on and from there I was like hey. If we do that for datadork, we can also do that for, for example, for using the open, telemetry integration or something which I found today. Actually, I need to scroll down a little um I've seen at kubecon that cncf started um cloud events which, which is something around a common way of describing event data also for ci cd.

B

So this this could be something outside of the issue which I created for open telemetry, um but keeping it in mind that every um every workflow in in our ci cd pipelines being executed either in ruby or angle um there. There could be like a tracing window or tracing information, but it could also be like metrics or events or something like that, and this is what what it touches base with everything, but to really scope it down.

B

I would want us to add the- and these are like the the proposal. um First off, get a feel how the open telemetry rubric client works. So this is like the the sdk um also have a look into the go sdk for the runner components and then follow what is already there.

B

What has been added, for example, with data dog and the specific implementations and based on that use, a demo environment around jager tracing as a backend and open telemetry, as the collector and I've found that there is like a kubernetes operator, for example, which can be used and the the minimal change for configuring. This would be like a config setting which allows us to specify host port authorization, and I think you need to specify the backend as well, but I didn't really dive into it. Yet it's like more of a high level idea.

B

This is how I would propose the steps um and when this is working- and we can see um success in in by pulling in the ui um for the icd tracing and, for example, com.

B

Compare it with the data dog integration, compare it with other ci cd tools um on the market, how they solve the problem and then from there we can see whether we build a ui integration um allowed to to add, for example, specific, like monitoring observability in the rig in regards of tracing analytics tracing inside something which is then on on on the enterprise here, for example.

B

But this is bound to hey. Let's make this work now and since since it is groundwork and and also like combining existing resources, also usable for a single developer, I was proposing it for the free tier for the court for the core version and based on that and hopefully increased adoption of the feature and gathering feedback. We can build out more features for customers, dashboards, cd, dashboards and so on.

A

Yeah yeah, I like um leveraging this for free, so you can turn it on and specify a collector or by default. Even we could make our jager tracing implementation, be the collector. um If we go to the. If we go to the tracing route, I'm gonna do the product manager thing and say what other ways can we solve this problem outside of tracing? If we stepped back from open telemetry and said, I don't have insights into my pipelines.

A

What is another way we could solve for that problem today,.

B

One thing is.

B

Or like how, um when I, when I prepared the the ci pipeline monitoring webcast actually last year, I was like yeah, we totally have all the tools and all the metrics and everything is there, and then I learned we. We don't have that much. We have certain metrics in our postgresql backend, but this is rather expensive to store and accessing. These metrics is expensive.

B

So when I found the um the pipeline, if uh the gitlab ci pipeline exporter from uh from our community- and this is this is the is also linked in the uh pipeline efficiency docs- um it was like okay, we we have something around that. So it's it's like um how to say that we want to have self-contained monitoring out inside in in gitlab. It should be enabled by default.

B

I think, currently, it's not enabled by default um and on the other side, we also want to provide an interface for for users to have existing monitoring and observability solutions, so like exposing the pipeline metrics from slash metrics on gitlab.com. Somehow would also be nice also for self-managed installations right now this um this isn't this isn't possible. um The thing which is possible is to query the rest api, which then does as uh sql crews in the back end and have a daemon running in front which calculates the metrics in the promises format.

B

And this is being done in the pipeline exporter, which.

B

Yeah, which, essentially, when I started using it, it was, I think it was just proposing metrics for jobs and pipelines. Meanwhile, there has been added support for environments, specific filtering and and much much more.

B

So it's it's a really nice project and I don't want to like close it down and just make make it available in gitlab itself, but getting inspired from it. Providing more cached metrics, faster access to uh to the performance values could provide us with the possibility to have something like this in an abstracted way in the default pipeline overview, so that we can, for example, see maybe with like the smaller bullets or something hey. There is like a pattern of failed job, so hey there's a pattern of uh flaky unit tests.

B

I think we have some sort of this detection for this already, um but more in, in a way that when I'm, when I navigate into a project- and let me see if I can like quickly find one- um I did a pipeline efficiency workshop a while ago- and I started playing and when I, for example, just navigate into um the pipeline overview is just something which is usable um yeah. Maybe maybe let's, let's use a different one.

B

We do have, uh I did a workshop on saturday around cicd and we had.

B

This pipeline yeah- this looks a little more interesting. um So when I like have this overview, it would be interesting to not only hover over that and see the dependencies, um but also like get a pop-up, for example, um for this job took one minute and this job took 20 seconds, and just because there is like a default threshold of 40 seconds, um the 60 seconds one turns red automatically and I can change the threshold, for example, which is.

A

B

Like it's a little do-it-yourself monitoring, but it allows you to see or to filter visually um the the duration of jobs and, like the duration is, is an indicator of how long the job took it. Doesn't it's not really a root cause uh analysis because it could be network dependencies, it could or latencies it could be. Docker pull is taking too long because the image is 10 gigabytes and much much more, um but really getting and getting a high level indica indication of yeah.

B

This actually looks good, but the duration is a little off here or in the past. This job failed three out of ten times having having the detail here and not having to uh navigate into analytics ci cd because it it feels it's interesting over there as well. Don't get me wrong on that, but it's it's a different scope.

A

Yeah being able to yeah to see from your branch, like here's all of the pipeline executions within this branch, here's how it compares to your target branch. If you have an mr open um yeah being able to see the history being able to set that threshold being able to dig in and say this job like here's, how the job has changed, maybe those are contributing factors you know dig in then from there to tests if it's job in the test stage um back into the cicd analytics.

A

I think all of that is super interesting as future iterations. um That makes sense.

B

If, if I move into like the merge request view um if there is, for example, a deployment into a staging environment- and I can measure the metrics so I I know we have the possibility to pro to collect metric reports ourselves, it's there, um but the possibility to have this in a in a more automated way. So it's not just the possibility to add something, um but there is automated collection, um maybe of like the job, duration or detecting something and having this this presented.

B

Somehow I don't know how it exactly fits the ui with the ux, because it's the merge request, has a lot of information already yeah, but, for example, if the pipeline would be failing, it would be super helpful to see hey the pipeline failed. um There were like six jobs um and iteration was xyz um so to have.

B

I would normally I'm I'm looking into pipeline when something is failing, so I I really want to have like a call to action of hey what what's next, when you start debugging this pipeline?

B

Where should you be looking at? Should you be open the pipeline editor? Should it be open the pipeline view um and like improving the navigation? um Currently, it's it's a long way to really get into the job and see why it failed or see how long it took. For example.

B

I totally understand that you don't want to always look into a merch request but to see overall performing or performance dashboard, but this is more like the infrastructure or ops dashboard for ci cd, where you can- and this is me making something up um or maybe maybe it's not me making something up.

B

It's actually like the the overview similar to grafana, seeing jobs or patents which which have a long execution time um and if tracing exists, you click on it and you can immediately have like a pane being opened, which, which shows you, um on the right hand, side hey. This is like the trace of this job, so you can see. Okay, the duration is my my service level objective. Let's call it like being the slo.

B

It should be not more than one minute, but if exceeds one minute, we mark it as red was failed and then you can invest, start investigating and not just like read the job logs and try to figure out how long it took and manually calculate the steps from dr poole to to actually executing a script, um but have this in a visual way. um So this would be my dream of yeah analyzing. This.

A

That would be, like you said it's more of an operations view if you're the you're, the devops team or you're, like our engineering productivity team monitoring a pipeline over time, seeing when the pipeline starts taking much longer than it had been historically quickly being able to diagnose within that. What are like some of our top contributing factors is that, oh, we just have docker slowness right now, or these jobs changed and after they changed their duration, took a long time.

A

um Those would be some of the outcomes that we'd want to drive with the view like that, it sounds like.

B

Yeah and exactly and- and I also don't want to like separate the teams and make them silos again- which which we often see when like hey, this icd pipeline, doesn't work um and like opening a ticket for the infrastructure team and have fixed that, and they come back and say yeah, but actually it's the the pipeline configuration which makes it slow it's up to you and then you play ping-pong on the teams which is which is not efficient.

B

So the possibility to fix a pipeline should be doable by everyone, and this is not limited or scoped to a developer or to a to ops or to devops it.

B

A product manager designer everyone who is working with the pipeline should be able to diagnose the problem and with documented instructions or run books or something else being able to fix it, because when the pipeline is failing- and you really want to test some uh or use the revenue apps, for example, for testing ux, you cannot use that um and it blocks your your workflows.

B

It's frustrating and you always have to to ask someone else for help. But what? If? If no one is available? It's like hey, I need to start reverse engineering, the source code or the pipeline code. Now it's it works, but it's not much fun and given, given that there could be. You know some sort of ai involved which tells you we detected. uh The pipeline duration is going up always on a monday from nine to five.

B

um Maybe you wanna have a look into that.

A

Yeah that sounds similar.

B

A

Say some of those insights um from the within the issue there's an interesting article from the slack engineering team talking about how they were instrumenting, some of their ci cd pipelines and how they just made them a little bit better or a lot better. In some cases,.

B

And the thing is and I'm trying to find my own talks, I do have a talk around efficient, um efficient, second ops pipelines.

B

But probably should be just.

A

So when I get back a little bit to why we would build this ourselves and not promote an open source project or partner with a honeycomb or something like that, build out the minimal and then stop.

A

How are we uniquely positioned to provide value to get lab users within gitlab with functionality like this? We love to hear your thoughts on that.

B

B

The first thing is, even though it looks nice to have that or to have your own like gitlab, in an ios monitoring system or something like that. It it's extra work um if you're a single, if you're a single engineering group or if you're, like the the devops engineer on your team, and you also need to do the monitoring, but just for gitlab or you want to use what's already there. I think it's it's better to have it integrated and oftentimes.

B

At least that's me. um I don't want to understand yet another different tool which has a different configuration syntax. I probably want to have it like out of the box um or as a cloud service or something else. So I changed my opinion from I'm hosting everything myself rather to let's pay someone to do it for me, um because I don't want to learn yeah.

B

It's super interesting if, if you're familiar with it- and if you, if you want to do it from ql and prometheus and graphama, are amazing tools, um but still it's, it needs maintenance. It needs knowledge, but especially when this site breaks not the ci cd power plant breaks, but the monitoring breaks so the the kind of additional maintenance and not just having it inside omnibus or inside the hem charts and everything works out of the box. Even though it's not just like you have an empty dashboard, but you have predefined example.

B

Dashboards which come out of the box, which would be, would be super nice, so the department exporter as a project now provides example, dashboards, similar to, for example, a promises operator which installs a full stack with promises, graffana, dashboards and so on.

B

The thing is: when you do it on your own, like do it yourself, monitoring um you don't know which metrics are important, so um the best practice, the opinion from our engineers from our product teams um and also using or dog fooding, how our infrastructure team at series use that on a daily basis, because cicd on gitlab.com is can also be like slow and you need need the insights.

B

um I think it's better to have it like in the product um and not not somewhere written down. You need to select this metric and by the way, this one, but we can like leverage the knowledge we have in production and also make the product better. Yeah.

A

Yeah, if you're, that person who notices the pipeline is slowed down, seeing seeing those trace fans or the metrics or whatever it might be within the same tool versus now, I need to go log into other vendor. What happens if that vendor is down? That just adds complexity to trying to solve your problem, because now you have to wait for them to be back up et cetera, et cetera, so I could okay, that makes sense.

A

um Yeah I, like the we can predefine some dashboards and be opinionated about what a good pipeline looks like or what what things you should pay attention to within your pipeline, what might be contributing factors to it slowing down? um I think that there's appetite for that, especially for those folks who are single engineering teams or there's the devops person on the team or within engineering and or even worse, that devops person left. I've talked to a few customers like that.

A

Lately who know they want to build out more functionality, but don't have the experience, um so that's giving them. Those that tooling within is super interesting. Okay,.

B

um And the other thing I wanted to mention, and now I found the slide and the issues uh which are the epics, which I can then link to um just. Let me quickly um share the presentation with you in slack, um the idea is like the the animated gif is something I stole from the board from the epic vitica created, um so that you can really like see the timeline of the traces or the spans or the jobs, something which, which also gives you a health indication health indicator.

B

If this continues like this, the pipeline will be broken all the time, um something which puts the visual aspect also with um the the cake diagram over here, which is something where I think we saw of uh with github actions um you can't make make like the pipeline view, the one you really want to be looking at when debugging a problem and not having to navigate into the job details all the time, because the job blog is it's interesting and it's nicely formatted.

B

But it's too much really. We really want to condense down the the level of detail and um is there anything else I wanted to show on this slide.

B

Potentially not um the other thing I'm like thinking about is how, when we add ci cd, tracing and ci cd insights, how can we like build something like quality gates with captain or something else into cicd?

B

So like using a mechanism of saying it's not only ci cd, um as the framework is the tooling which we want to monitor or observe, but we also leverage that into hey the deployment doesn't perform that good.

B

We provide out-of-the-box possibility um to monitor the deployment and when the slow is not right and so on, so I'm not talking about duplicating captain's work here, but just having this in mind, what that we also need to look beyond our own ecosystems and and and insights.

B

um But what else comes to mind and when, when something around metrics is failing, maybe we do a rollback of the deployment um something like that, but this I think this goes a little beyond the scope of of pipeline insights yeah. But it's still an interesting idea.

A

I think the release team in the monitor team would be super interested in that, though, of how do we extend potentially some of those things so that you can look at tracing data as part of your deployment and see?

A

And I've talked to kevin about that of how does it compare to production if we're um or compared to test data or previous uh production data? If traces are now different in a meaningful way and doing an auto roll back? Based on that things, like that,.

B

Then that's again- um and this is me thinking about many topics- you know you have traces, you have logs, you have metrics and maybe you want to do profiling or continuous profiling things. We have seen with polar signals now or I think it's called parser.

B

Parser.Def is it the domain yeah, so they um they open sourced uh the tooling or the tool stack around continuous profiling. I didn't try it out yet, um but having the the possibility, what polar signals provides?

B

um Okay, there's a new website. I was actually like looking for the graphs, so you have like sort of a flame graph of the call stack of your application.

B

This is uh very like it's based on the application itself, so you won't really want to want to preload something or see which which sys calls are being performed by the application of which internal function it. It calls very often, and the function takes one. Second, it's fine, but it takes five seconds, it's bad, um but this is also an interesting like idea around you deploy not only production and you profile your production environment.

B

But what if I deploy my merch request into a staging environment and everything gets monitored inside and profiled, and then I can compare a merge request, which is essentially like a git commit, and I can diff those things and I can compare everything which might have had an influence, and one of the examples um I experienced myself in the past was, um let me see if I can quickly find another talk of mine, um which I will be talking about all the devops.

B

This week, too, is um I was so we were building a a demon written in c plus plus and for some reason, um the the api front. End was not fast enough, so we thought about moving from threading to co-routines lightweight stacks, basically lightweight threats.

B

Basically, the problem was that the memory was increased by using those, and we also had some crashes, which we couldn't really detect over time and um by like having having something inside cicd and developing the feature and deploying it and detecting those regressions beforehand would have been really really nice, but we didn't have that back then. So we debugged for half a years or something with bisecting of um bisecting.

B

The release and the next major release, which were like a thousand commits and every single commit was deployed to production, ran for three days on a production in a in a production environment ran for three days and then you you continued searching for the problem and after a while, you figured out hey. This is the commit that breaks yeah. But if you apply the fix to it, it doesn't work.

B

So if we would have detected that earlier in the process of hey we're, adding coroutines now- and it's amazing.

B

Detecting those regressions or performance problems um earlier in the process before actually releasing that to customers who call you and say: uh please fix that now and yesterday and you're saying hey, I'm. I have no idea what to look at. um This is something which which plays into that as well.

A

Yeah I mean that that definitely and that resonates with me, as we looked at how do we extend the performance, testing, category and testing so that you could start to set thresholds? You can start to track things over time, so you can find those problems sooner.

A

Integrating more you know signals into that is super interesting as well.

B

And again, it's like the problem of this issue or this feature request is really scoped to tracing. But it's like building lego bricks, um one ties in another. We have different tools available. We need to like choose what is already there and then then collect also feedback from our wider community and customers so that we can we. We also know their use cases, because I can like source from my 10 years or 20 years of development experience.

B

um But there are so many like challenges out there and different different things in the programming language, and you know that from unit testing, um it's challenging to find like the the defined format yeah for for ci cd. I think we can like for our own pipeline and runners. We can go our own way because we collect that and we have that available when it comes into slo collection, quality, gates, profiling and other stuff.

B

We can look into ways making it easier for for our community to use it yeah provide examples, use cases, demo, videos and so on. um But yet uh this is like me thi or this is the three-year plan, or I don't know it's an arbitrary number of years. It will take to really achieve that. Everything is lovable.

A

Yeah yeah yeah, I like the I like the starting point, though of your pipeline. You want more insights into your pipeline. You could go instrument it yourself and get the data out, there's a project out there for it. It works for git lab, but you want more insights than what it provides. You want them within the same tool. You don't want to stand up and host this thing yourself.

A

So if we just start with we're providing data that you could point at a collector and if we can then integrate with um or point that at our own jager collector or a prometheus collector, so you can see it within your um within git lab itself, still all within the free tier. I think it's a great start and we can go from there and see what is super interesting um or what what is interesting.

A

What do we find interesting and talk to engineering productivity, work them into it or loop them in see how they're finding it useful and then expand from that?

A

Does that sound as reasonable as a mvc? Yes, I like the way that you've broken it down. I think the the progression of steps makes sense. um We'd want to get let's look at some engineering folks, looking at it and poking at it a little bit, um but I feel like the the problem. Is it's a good problem to try to go solve um and to tackle this.

B

Yeah, I think um we should be looking into tracing, because everyone is looking right now, so the technology is evolving and there is was there suppose there are possibilities to provide our feedback to the open telemetry project um and to to to the cloud native community so um with metrics and the the uh I've linked the epic now which which proposes um the fuse and the things which we already have. This is a separate problem to solve, so we we have metrics in uh in our backend.

B

We want to present that in the ux and the peplum duration, but this can be tackled. I think in a separate way, um the only difference, or the only thing which has it in common- is measuring the overall job duration, for example, because the job duration can also be a tracing span.

B

This is like an overlap, so it might make sense to have the same engineering teams work on this, or at least like sync, on this, um but from like the uh the mvc standpoint, you can totally start open, tell adding open telemetry um into the ruby code. Now then, as a second step, um look into the runner code, for example, see how to communicate how to like sync that even but it's like the separate the separate step um first, things first would be just to.

B

To be honest. I have no idea how it's implemented the ruby code, which executes a pipeline or schedules department, execution and and try to to build, build in the collector code.

B

I I did that a while ago with jaeger tracing and c plasmas- and I think it's I know I'm bad at estimating hours. I don't do that, but I would I would say it's it's an interesting technology and we should be looking into it and if, if I can help like giving thoughts or giving directions, I'm happy to jump in yeah.

B

But in the end, I'm hoping that our our engineers will take the lead on this and and just add it yeah.

A

um Awesome, well, I think the next steps is.

A

Probably create an nbc issue to add the telemetry. We might turn this issue into an epic and go from there.

A

I think that's what I would describe as next steps um yeah. We might do a little just a little bit more validation around the problem space to make sure that where we start is the right place, um we're just getting job duration and doing it from spans might be interesting, as well as getting the spans being able to collect them somewhere. So you can visualize it yourself. um I think that would be a good step like a first good outcome, rather not a first step with the the outcome we would want for users yep.

B

Okay, measuring or tracing what's what we already have in our code and um at the later point we can say we want to make it like a user function in ci cd. So you you have like exclamation mark trace as the yaml function or I don't know no idea about the implementation, but then think about how to solve it on the runner side to trace a script execution, for example.

B

um But this is this is beyond the scope of the first success. Now.

A

Yeah yeah cool, well um anything else you want to cover before we uh wrap up. I think we're getting close to time.

B

I think I probably have too many ideas, um but I.

A

B

Wanna make us focus on getting started and having having the first mvc in in hopefully soon the future and also see what things we can reuse things. Maybe it makes sense that you have a have a talk with andrew nudigate from from our infrastructure team. He has knowledge around open telemetry as well, so um yeah just collecting more ideas for the future, but also like finding a way to get started now or in the in the coming uh release. Planning yep.

A

Yeah um we'll evaluate this with the existing priorities for the upcoming milestones and see where it fits in. But I think that this is something worthwhile to dig into a little bit more um for sure.

B

And our last thought we can create content and blog posts around it. Sharing our insights, our learnings around it um and engage more with the cloud native community in that regard. Yep agreed.

A

Agree awesome. Thank you. So much michael. This has been fantastic to talk through appreciate your insights and all of the content that you've put into this into this issue. It's a lot to chew through, but it's super valuable.

B

Thanks thanks for the discussion today and I'm looking forward to build something and building awesome things. Yeah all right have a good one cheers you too, bye.