GitLab #EveryoneCanContribute cafe, 14 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 25. #everyonecancontribute cafe: Observability with Opstrace

Description

Opstrace starts at 5:56 after introductions.

Blog: https://everyonecancontribute.com/post/2021-04-14-cafe-25-opstrace-observability/
Twitter thread: https://twitter.com/dnsmichi/status/1382365947122581506
Website: https://opstrace.com/

Open Source observability is moving fast, it is hard to catch up. We want to make things easy to deploy and use.

Insights

- Quickstart installation in AWS.
- Opstrace deploys Loki, Cortex, Prometheus, Ingress Controller, APIs, UI, Grafana in the Kubernetes cluster in AWS.
- Authentication with Auth0, future brings Dex to provide SAML, etc. for SSO.
- Grafana comes with default dashboards.
- You can send data to Opstrace from a local demo environment with docker-compose.
- Metrics generated by Avalanche, scraped with Prometheus. Log messages scraped with Fluentd. - Grafana combines Loki (logs) and Prometheus (metrics) as data sources.
- Easy to use Prometheus Alert Manager, configuration using an API for automated rules creation, or a UI. The Cortex functionality is proxied by Opstrace with an authentication token and API interface.
- Roadmap ideas: SLOs and error budgets - generate rules and provide templates out of the box.
- Monitoring Cloud Vendor Metrics, no Prometheus provisioning. Instead, send configuration over the API and a new cloudwatch_exporter container is deployed to the Opstrace tenant.
- Open discussion with ideas and questions:
- High Availability - out of the box, Cortex comes with 3 nodes by default, and cloud/Kubernetes takes care of failover.
- Which problems are not yet solved with monitoring/observability?
- Now focus on onboarding, easy to get started with Open Source, similar experience like Datadog.
- Improve usability of Grafana, should be much more collaborative as a UI. Make it a debug session, and instead of using Google docs / Notion, add text, graphs, etc. and have these documents live in there, even after a year.
- How to answer any question - links between logs, metrics, traces. Exemplars for linking metrics and traces, released in Prometheus 2.26. More on this Grafana blog post about Tempo and our 6. Cafe with Tempo when it was announced in October 2020.
- Integrating Opstrace, e.g. a graph into Merge Requests from a staging deployment.
- Join the issue tracker and Slack to discuss development ideas.
- Thought of integrating Vector for logs?
- What was the intention to create Opstrace?
- Ask infrastructure questions, and needed to collect data. We love Prometheus, but there is still so much to build.
- Datadog and it runs in your SaaS, first idea was more closed.
- Continued to iterate, we are standing on the should of giants - make it an open source project. It is harder.
- Don’t re-implement everything, work together.
- Reporting dashboards & customization - make it easy to use.
- Incident management integrated with GitLab and alike.
- As a developer, I don’t care about the configuration or the service being run in Kubernetes. I want to see metrics from a staging deployment, and focus on the fun stuff.
- Security comes out of the box - communication between monitoring nodes. GDPR for logs, and compliance levels. What data is stored in the backend
- We’ll revisit Opstrace in the future and see how things are going. And of course try it ourselves, maybe in a future #everyonecancontribute cafe.

A

Okie dokie, hello, everyone welcome to our 25th um iteration of the everyone can contribute coffee chat or cafe as we as we call it, um and today we we stopped the uh kubernetes learning workshop and do a little a different topic, but somehow related. I suppose I guess and I'm super happy that we will dive into monitoring observability with ops trace.

A

Today, um it's a relatively new project and I'm happy to welcome sebastian and and matt today, and I would I would say, maybe we start with a short introduction round to say hello and then we'll kick us off and and sebastian will do a live demo and, to be honest, I don't know what to expect, but we will have crazy ideas in the next hour and I'm looking forward to it. um I'm the crazy guy from austria living in germany as a developer when she listed gitlab.

A

I do love monitoring and all things and I host it um and over to you sebastian.

B

Hi, I'm sebastian, I'm I'm the ceo and co-founder of ops trace, and this is a project that matt I and a very group of other folks have started a year and a half ago and released a few months ago, and this is a super early project. I'm super excited to talk about it um and yeah, originally I'm french and german, but I live in california.

B

This is where I'm I'm talking to you from southern california, but I usually live in san francisco, so yeah very excited to be here and talk about all topics: observability monitoring and how to help end users with open source software, matt.

C

Hey guys, I'm matt, uh I'm the crazy kiwi that is living in california. uh I am uh sebastian's co-founder uh we got together. um You know, after spending a lot of time in the infrastructure space, and you know we decided to.

C

Finally, you know solve some of the challenges that we see in the monitoring and observability uh industry itself, particularly, uh you know with uh you know, data governance, sending data um you know outside of your network um and also being able to just really control a lot of your uh your costs um and really own your own platform um instead of relying on others. So a lot of problems that we've been really really excited to solve and we're just getting started. So uh you know we're excited to share a little bit about what we're doing.

D

Yeah to go next, um what do we want to do full introduction.

A

Just go ahead: just keep.

D

It true, okay, yeah, okay, because I also like edge technologies, uh typically in my day, job, I'm working as a devops engineer, senior development engineer, mostly I doing blockchain and kubernetes, so I'm doing only x stuff, mostly for work, and that's also the reason why I really try to see new technologies um yeah. That's why um yeah michael! Do you want to go next?

D

E

So my name is michael, uh I'm from austria and I work as a product manager and software engineer at the company called skw group.

E

So we built uh headlamps for the automotive industry and my day job is building high performance, compute stuff, so raytrace engines, crd engines and such stuff and yeah always try to learn new things and have um won't always get into the trace business because I have to you: have normal windows, clients and not server clients and getting tiller metadata, and such things is not that easy on the client side than on the server side easy.

E

Yes, I'm really excited what what you will show us today and I'm happy that you're here.

C

Awesome. Thank you, philip.

F

Yeah I go next hi, I'm philip, I'm also from germany, I'm from berlin and to keep it short, I'm a kubernetes systems engineer and I'm also interested in observability and other stuff. I also heard sebastian talking on clubhouse, so I'm really curious to to see it now.

B

Awesome so you you listened to. Oh no, I spoken for. I didn't speak in french. It was english, so yeah cool, awesome.

F

Like lucky for me, it was english and I didn't have to learn the new language but yeah. It was english.

B

Chris you joined uh you joined us, uh chris is from our team.

G

Yeah hi guys, uh sorry for being late. um I um chris I live in san francisco. I worked with matt and seb at mesosphere um uh and prior to that I managed the mesos team at twitter um back in the days when it was pretty nascent and growing. So um you know there was a good chapter in the industry and uh looking to write the next one here so uh happy to be here. Thanks for having us.

B

All right, anyone else wants to do intros, otherwise happy to get going.

A

You can get going.

B

All right awesome, all right, I'm going to share my screen, then all right, just a second there we go all right. Desktop is shared.

B

That's over to the browser, so obstrace um very happy to be here and talk about uh what we're doing um so, like I said a bit in the intro uh matt and I got together about two years ago to decide to build a company and we're from the infrastructure space, and our goal was to find ways to help people with all the wonderful open source software that exists today and after talking to a lot of companies and a lot of users and also looking at the things that we liked ourselves, we decided to go and help in the open source, observability business.

B

Why? Because there's so much potential in there that is untapped and that is not easily accessible to end users and to companies that of various sizes. It requires a lot of experts to do this, and this is, I mean, we're talking on the gitlab channel.

B

You know a lot of this as well from your perspective, in terms of what you're doing, but it's not an easy thing to use open source software, and we, we didn't immediately, have the idea to open source this, but now we're we're here and the idea is to say anybody should be able to start using open source observability tooling, like prometheus, like cortex or loki.

B

These are the technologies we've chosen without having to become an expert in them, and an expert is not is someone who everybody can become an expert, but it requires time and dedication. You need to learn a lot, and so this is what companies do as a choice: they they they have to either buy a solution or they have to invest in open source tooling and become experts.

B

So in the in the uh we chose the less a more narrow path than before before we were in the or container orchestration world where everybody could run wherever they wanted. On top of the infrastructure that we were building. In this case, we decided to go a bit further down and to choose a piece of the stack that is necessary for that world, but not, um but not something that has to be like everything for everyone. uh That might not be the right way to say this, but uh we'll get there.

B

So um in the end, we decided to call this an open source, observability distribution. The idea is that we assemble open source a bit like a linux distribution. We assemble different, open source pieces from the open source observability world and we make it usable. We we make it secure out of the box, we test the upgrade path. We all of these things that are very hard to do over time. I'll be doing a demo about this today. This is an early project. Today we have an installer. We have a lifecycle manager.

B

If you want, we call it a controller. It's a kubernetes controller that runs on top of a kubernetes cluster and manages the entire life cycle off this distribution. The goal is for the system to be up, be able to accept metrics, be able to accept blogs, eventually, also traces and other. um Like singular events, that's uh that's infrastructure creates, but uh in staying up in a way that is just as easy as using a sas provider. So uh the the this is a long goal.

B

This is a big goal and but it's an achievable one, because this is something that where there are no technical problems that can't really be solved since yeah. It's that's it. um So um why? Why? Why a distribution? Well, we wrote a little bit about this yesterday I introduced we introduced ourselves, so I'm going to skip matt's introduction uh and I, but uh why a distribution we wrote about this yesterday and a distribution is important because uh we wanted to find a lens to explain what we were doing in a way.

B

That's is relatable to others that people have seen before and a distribution says you can come and build together to to build this, a platform that everybody will be able to use right. Like a distribution says we will be packaging things. We will be installing things for you. We will be helping you with upgrades and we will put all the features in there that you need in one place and in an opinionated way. In this case obstrace is.

B

We chose certain projects very precisely because they have some characteristics that we needed, for example, uh we're built on top of things like uh loki and cortex, and we didn't choose them for uh for no reason. We chose them because they store all their while they they use the principles of prometheus. They store all their data in s3 or in google cloud storage, and that is important because to be able to monitor and log things at scale, you things need to be eventually cost effective.

B

You ca and people need to be encouraged to send more and more data to their platform, and people are discouraged by doing them when by doing that, when there's a price on each exact metric and uh there's still a price when you write to s3, but uh the the the advantage of doing it. This way is that you can write much more, keep it for much longer than, for example, a sas provider.

B

That's that's the idea, so cortex and loki embodied those characteristics because they they chose to build a horizontally scalable system that you can throw logs and court and metrics at and then continue to still horizontally and vertically scale without having to um to to to to worry too much about the complexity of the system, um at least that's the theory. In practice, you have to learn again a lot about cortex a lot about loki to be able to set them up. That's what we did in obstrace in obstrace, we codified it.

B

We created codes to install, deploy it manage cortex and loki help test. It end to end, and what what I mean end to end is the on. Once on one hand, you have the um the um the correctness of the system, is it still firing alerts? Can I trust it is? It's up is, or is the data that I'm sending and reading back correct? That is one thing.

B

That's uh we test and, on the other hand, testing upgrades testing a path to allow people to move from one version to another, especially on fast moving software, like all of the open source software that we use today, the the open source software that we use today is very fast moving changes all the time.

B

So we need a way to channel this. That's what linux distributions do, but we don't have this so much in the observability world. So that's a very high level description in the end, what obstructs is is a simple command line. You uh to start with you deploy with a simple config and entire system in your cloud account that enables you to ingest logs, metrics, eventually traces and then setup alerts. We have more plans for this. We have. The idea is that we have a quite an open road map. This one is um the idea.

B

Is there are things that we want to build that don't exist today in the open like, for example, the total cost of ownership detailed in the ui like the system, since it's running in your account should be able to tell you exactly how much it costs when you increase the the the the number of nodes when you start sending more metrics to it, where we want to help with things like better uis uh to do alerts alerts are not an easy thing to do, and today, like with prometheus, it requires you to know and write a lot of rules and become an expert in how prometheus works.

B

This is something that we want to abstract and put into simple visual uis that people can use to in the future uh do more with it like slos and error budgets, I'm gonna move over to the quick start now, uh since uh I've talked a lot at a high level, I'd like to show more exactly what we're doing, but uh I'll pause here. If there are any questions or commentaries.

B

Nope, all right, okay, so quick uh our demo, so uh the quick starts that actually takes a half an hour to start up. Unfortunately, it's not that quick uh is a a uh an example that we show on our website to get a cluster up and running. I will show this: I will run through it today because it allows us to see different parts of the system and how how it's being used so in.

B

I have an already running cluster uh so since it takes 30 minutes, we're not going to be waiting 30 minutes, but I'm going to kick off and install just to show you uh what it takes. So here's my terminal. um So in this case I'm just following the quick start: it it works on mac and linux, of course, uh for the cli. This is a typescript cli that is basically compiled to uh a single binary.

B

So that's with every all the requirements in one place, so that's you don't have to have a node install or any other complicated things on on the end, customers or users cluster laptop. So the simple cli allows us to create, destroy list clusters and upgrade clusters.

B

This is the goal of the cli, eventually we'll have more of this in apis and also in ways to interact that are not just from a from a from a cli purpose, but this is a great way to start and to to to to get a whole to go experiment with the system, so the um we're going to choose a name uh in this case we can say uh e cc obstructs and this crea we have a config. That is quite simple.

B

So this is the minimum amount of stuff that we require in the config, or at least we could remove a couple of them. We show an example, so ops, race is a multi-tenant system. We have multiple tenants in this case. By default, we create a development, a staging and a production tenant. Each tenant is isolated and can query its logs and metrics and get metric sense separately by default. We also use, let's encrypt, to get tls on our interfaces.

B

So now we're gonna kick off the install, so the install here um so the install here it asks it will ask us to review the api calls. This is an interesting thing that we can have a look at each build, so we call a lot of apis inside of a or users cloud accounts, so every build of ops trace generates a kind of list of all apis calls that are going to get called. This is whether it's in aws or in gcp.

B

In this case, we see a list of all the apis that we are going to be calling to be building an obstrace cluster. This is not perfect, eventually we'll be able to show more details about what happens in a in a in a cluster, but this is to try to reassure the user that what's going to happen so once we launch this, what happens here is that the cli uses the the um the cloud accounts credentials to go, create a um a cluster. The first thing: I've been succ. I've been redirected to a login screen.

B

This is because we've created an ecc obstrace name. This will create an ecceptace.obstrace io domain, so we require people to authenticate to be able to get that domain. uh Custom domains are something that we want to build. Custom authentication custom domain, but we we wanted to start with the very easy approach of giving people a domain, giving people a way to authenticate with their clusters, because these are the kind of frictions that you can see in the open source world.

B

As soon as you need to set up domains tls, you need to talk to other people in your organization. That's not that easy, so this is a way to just get started right out of the box. Now what you're seeing is basically a log of the cluster getting created and the cluster. We start, obviously with all the aws resources that are needed to create this. The cli is entirely reentrant. If I control see this and start it again, it'll just go back and do what go back to where it needed to be done.

B

You'll see that I'm not going to get queried for dns, because dns has been set up and we continue where we where we left off. This is true for install. This is true for destroy it's there's, nothing more frustrating than like closing your laptop and then and everything's gone. So, uh as I said, we will not be waiting half an hour for this one to come up, but we'll come visit it a bit later. uh In the meantime, I have one that is fully up called ecc demo. I started it this morning.

B

um You can see like the logs all all the way to the ends. I mean we're not going to go through them uh in detail, but suffice to say what happened here is first, a vpc was deployed. Then in this vpc we deployed an eks cluster, then on top of this eps, and then we also create an rds database for our config. We create the buckets that are needed and then, on top of this eks cluster, we deploy the ops trace controller. So we're not deploying a bunch of helm charts in there.

B

We uh we deploy one piece of codes, that's and that's the repository that you can see uh like on over here on, like um this repository all the code is in there like, I said everything is open, but the controller gets deployed in the cluster and then starts deploying everything that is needed to run ops trace, so it deploys cortex. It deploys loki, it deploys the ingress controllers prometheus and it also deploys the custom offspring's uis and apis.

B

So once that's done, we can actually just click and go here uh to eccdemo.upstress.io. I have pre-opened the tab, uh and so uh I have not so let's click there we go so the tab basically gets us to our very bare bones. Ui remember this is an early project. Our our goal is to showcase what we're doing here. uh So the barebones ui. It doesn't show much what it shows today is we have a couple of tenants.

B

I talked to you about dev, staging and prod system is a default tenant in every ops trace cluster, because ops trace clusters monitors themselves. I can add users now here like, for example, mats at obstrace.com and he will be allowed to log in into the cluster the login. um I should log out to show you the login. So let's see the login is quite straightforward.

B

It is this uh we provide it through auth0, so that's like you can just log in right away with the email of the account that you use to create the cluster right like so I created the cluster with said at obstructions.com, so I'm allowed to log in I'm added matt matt is allowed to log in um again like this is our first step for authentication.

B

The goal is to to put things like decks or other inside of the system to be able to allow full custom authentication with oauth.

B

I mean open id connect or or saml, or things like that, but this is where we start, because this is to make it very easy and work out of the box. So once you're logged in- and you see the multiple tenants well, we can actually do one thing: every tenant has a grafana instance. So, let's go to the system tenant.

B

The system. Tenant has a grafana instance and we can just see some default dashboards that we've put in. We have dashboards for the kubernetes system, underneath we have dashboards for cortex loki. All the default dashboards are in there. This one is a default dashboard for the quick start. So what do we see here? We see that at the top that our tenants are not sending any data, but we still have some active series inside off and active logs inside, because the system monitors itself as well.

B

So so we can see that we have around a bit under 125k active series by default or well 121, and then we have an ever increasing amount of logs getting created by the system. uh These are going into cortex and into loki as well. um So the rest of the quick start is I'm going to start sending data to the system so to send data to the system.

B

I'm going to do it from my laptop.

B

So in this case we're going to simulate what you would do if you had, for example, a kubernetes cluster or an amazon instance or whatever that you want to use prometheus and fluentd to get the logs and metrics off of so I'm going to be running a lot, I'm going to create a config for fluentd and prometheus. This is the config.

B

The config will go, gets metrics from a local container and the the way prometheus is configured to talk to obstrace are these few lines, so we put in the url cortex dot staging so we're using the staging tenant, we're sending it to cortex and we're sending to the ecc ops trace. Well, in my case, we're going to use the the one that I created.

B

So it's ecc demo ecc demo, okay, so the cluster is eccdemo.option.io and this is a normal, prometheus remote rights uh push for those that are familiar with this and it is authenticated with a token. The token is, I will show you here right now. It is generated when we create the cluster, eventually you'll be able to download them from inside of the cluster.

B

So, for example, in this case, it'll be the staging token. These are I mean, there's not much to see here. It's just to show you that these are standard, jwt tokens that are used to authenticate with the system. Why did we choose to put authentication tls in there by default, because it's always left as an exercise to the user in for for open source and also we noticed that companies setting up and building on top of open source simply don't set it up, because it's hard and complicated.

B

So that's not a reason not to be to do security, so uh there we go so let's create this config fluency is the same way if you're familiar with fluency same thing, there's a new url to say to change, to send the the the the the logs to and there's the token to be authenticated so again. Here we copy it up wrong copy.

B

There we go and then last but not least, this isn't less interesting. It's a docker compose file that runs fluent d and and um and um prometheus locally, as well as a a container called avalanche to just avalanches as a as a is a scale tester for prometheus, and it allows us to just generate some logs. So now we're gonna bring that up no come on.

B

Okay. That is a fail on my side. Sorry about.

B

B

So, okay, whenever some ski is.

B

B

Okay thanks. Thankfully linux distributions are not too bad. That's what they help you with um yeah, so now we're creating the containers.

B

There we go still pulling them. Why.

B

Darker lobbying driver: why does that work here?.

D

There's some more some more uh things already.

G

D

H

B

See I think it's running.

F

B

C

All right, let's.

B

Connection refuse to what seconds logging driver.

D

Oh yeah, doctor demon, probably.

B

B

H

B

Yeah? Okay, why is this not working.

B

Apologies, this is not normal. This has to do with my local docker.

D

Can you check doctor info probably because it seems that you configure the system loading driver to use upstream or another.

B

B

This is definitely something with my with my docker install. This is not okay, okay, okay, let's see what we're gonna do.

H

B

All right give me just a second.

H

Can you hip here from upstairs, can you maybe open the docker compose config again yeah.

B

Let's see this is.

B

E

H

What what does those flu and d address.

B

Oh, I got it, I found it. No, no. I found the problem. Okay, I found this is it this? Is it so we're gonna move, obstruct quick start here to upstream quick start live. This is the and then we're going to move ecc demo to pops trace quick start and we're going to go to upstream quick starts.

H

That logging driver topic is about uh collecting the logs from those local containers right on your machine right. We don't even need that at all.

B

Okay, I uh fixed the first problem, uh obstructs demos.

H

Like we, we specify this fluency logging driver there and then maybe we can just remove that from the config. Where is it playing in this comp? Docker compose config camera? Oh, that's weird! You see those! You know localhost two four.

B

Oh, no, okay, so basically these containers are not coming up, so this login driver is not happy. This logging driver is not happy either because fluentd is not coming up. So the question is why fluentd is not coming up.

B

So fluid is not coming up for some reason. So, let's see if we went dot com apologies, I move directories around and stuff like this, and this is never good. Okay, so fluentd there.

F

Don't worry, we love debugging here, okay,.

B

F

So the fluency.

B

Here created up right status up okay, so.

H

So if we can have a peak into the flue and d container log, it will show us the problem.

B

Now no everything's working now. So what the problem was was that I had created a directory called upstream quickstart. Then I renamed it to eccdemo.

B

Then I showed you how to create a new cluster in australia's quick start and but when I copy pasted everything out of the quick start, it still had like basically my full path in here.

H

So that was an rtfm.

B

Problem, yes, no problem, so everything's up and actually sending yep. The containers are up so they're sending data to uh to this. So now we're going to actually go to this url here well to see if it's coming in so now we're going into the staging tenant. So not the system tenant. So it's another grafana instance. So if you're familiar with the um what's it called, um let's see.

B

Matrix, why is the query not going in there? It's weird the url looks bad query.

B

Oh, no, it's just! Oh! No! No! I have to sorry again an rtfm problem, so here we go. Here's the the query to put in.

G

B

Sorry, quick, I'm just checking here this is the data coming in and then the quick starts. Sorry, let's assume things moving around here. We go so we're going to copy the whole thing here and put it here. Okay, so moving around! This is not ugly data, but you can basically see that this is the the metrics from avalanche, so avalanche generates random prometheus data with a lot of different labels and is currently getting sent from my laptop to this uh staging um to the staging tenant inside of the upstreach cluster.

B

So the staging tenant is the same cortex, but its own grafana and its own authentication token right from the other tenants, and so the imagine this. This is how you would separate well staging prod and uh other uh environments. However, however, you want to decide uh to to split them up, and uh so the cool thing is that, uh if you're familiar with the um the uh amazon graph as a service that they launched right, you basically go to amazon and you say, give me a grafana. Give me another graphoni, give me another grafana.

B

Well, this is this bought for multiple clouds. You do this on amazon. You can do this on gcp and if people continue to work on this together, like you will have this on azure and more more things, and when I say people is like the use cases for where to run. This are very vast and it requires a big community to basically get it to run in a lot of different places.

B

Just like kubernetes did, for example, so uh this is the data coming in um and let's continue in the quick start uh and the we can also go see um uh well, we're not gonna go see that the prod tenant is empty. That's not that interesting where I think we all understand here that we're on. If we're only sending data to the staging tenants, it's only there, uh but what we can go see here that is interesting is let's see.

B

Sorry about this moving around.

B

This last three hours, let's do last five.

B

Okay, I'm not sending enough. I don't think I'm sending enough for this to be interesting.

B

Okay back to the rest of the quick start, so um I showed how to add a user uh I want to. uh I want to move on to another piece part of the demo, so we have oh yeah. So we have logs is an interesting part. So obviously this is not just metrics. This is also logs.

B

We have so. This is a typical loki, slash cortex setup, so you can actually go from like if go from metrics to logs, so the these metrics have a label called quick start. If you go to logs now, uh grafana allows you to basically switch to a logging view and see the logs that correspond to the same label. um If, if people are not familiar with how loki works, loki is a kind of a prometheus, but for logs, so you have the way it works.

B

Is you have labels just like in prometheus, and then you have a value which is the log line instead of having a metric and uh um obviously this is where kind of the comparisons kind of stop. But uh what's what's cool about this is that you can pre-index your logs to do a sign of a to to be able to go.

B

Do nice distributed grabs on top of them right, so this is not a full elastic search type of setup where, with elastic you build a huge, complicated but very useful uh and expensive index, where you can then ask any questions, uh that's cool, but uh for most infra logs uh we believe that's like for eighty percent of the cases. People want a distributed graph like and a distributed graph is basically I go filter it by a few labels.

B

Just like I do for prometheus, and then we have a horizontally scalable system that fetches the logs from s3 and goes, and you grabs basically what you need out of them and you can tail them that allows you to do a lot for a very, very, in a very cost, effective way compared to an elastic search index which requires live machines and uh requires a lot of ram and a lot of disk space which, in the clouds uh in the cloud world, are very expensive.

B

So uh the uh the this is to show showcase like this, both are coming from my laptop right, like the logs, uh the metrics.

B

What we can do beyond that is, we can set up things like alerts and alerts, for example, so alerts are are interesting. We have we've barely started with alerts uh inside of um inside of obstrace. What we've done is decided. Let's make the uh the alert manager from prometheus and cortex very easy to use. How do we add things to it to very to get it so that an end user can use it without too much complications?

B

So first we start with things that are quite table, stakes which are uh putting apis out there to be able to accept rules and accept uh um alerts with simple curl.

G

B

If you want right like so for this, we simply uh transparently proxy the. uh What cortex gives us cortex gives us a way to send alerts.

B

Cortex gives us a way to send rules and we proxy it with an authenticated token, and then uh it is quite simple, I'm here in the docs, uh so first you post a typical alert manager config for those that are familiar with it, and then you you can you need to have the token to be able to to post it and then later down the road you post actual rules to be able to go, be alerted to this case.

B

In this case this example in our documentation uses a system called denman snitch which allows you to know if the alert manager itself is down or not, and so this is an example here of how that you can simply post simple yaml to this api, and then you will start basically be able to go to the alert manager ui. uh Let's see I have that we can actually do right now, so alert manager ui is available in every tenant. So if you delete all of this.

B

It's not configured because we haven't sent something so let's configure it, so we want to configure the alert manager for staging all right. So let's do this sorry zoom is I'm going to move zoom here out of the way, because it's really annoying all right, so in the quick start to configure the alerts in this case, like I said, we're going to use a very basic uh basic thing.

B

B

Let's take this url here we're going to have to change the urls, but that's fine.

B

So what are we changing here? First, we have to put staging here. Obviously this is for the staging tenant and.

F

B

The staging token and then that's it actually.

B

Oops. Nope. Sorry! Sorry, sorry, sorry, no of course this is ecc demo, never mind.

B

So this is ecc demo, because ecc demo is the name of our cluster and the tenant is derived from the key that is sent to uh to this interface. So now the alert manager should be configured and we should be able to actually go here. There we go now we have a functional, functioning alert manager so for those that are familiar with this, this is the normal prometheus interface, and this one is served from cortex it's empty. We don't have any alerts so now we can actually go and simply create a couple.

B

So we're going to take this one.

B

Of course, I'm doing it with curl. We also have a ui to submit them I'll show it in a bit, but we're still finalizing that ui so staging again. No, this is the same one. Sorry.

B

There we go so staging and then here.

B

This ecc demo, what do they do? This is not commands. Obviously there we go. This is not all right. Sorry, I must have pasted something wrong. Let's start again,.

B

B

B

Actually, success oh, never mind. We should be better in our api, like I saw error, but I didn't see there was success, so we already have one. So, let's go see in the ui there we go.

B

Are my rain stitching here almost don't give me a second just checking the let's: let's curl it see if we have it, that's that's a good way to do this here, so staging ecc, demo yep it exists yep. Everything is in there, okay, so back to alert manager.

B

Why does it say? Oh no, good phones, that's a weird thing.

B

E

B

D

Did we use the correct above manager? I will do.

B

What I'm checking that's what I'm checking that's one of the things with that.

G

E

Is saging token staging token.

B

I mean we can we can cycle through them. That's easy, but this one is not configured see. Staging is the only one that's configured so, oh there it is. It just takes a little bit, so you have to understand that the way uh things work here is we submit it to an api. Then we write it to our configuration database and then things have to be synchronized into cortex. They have to be synchronized into cortex.

B

Write these databases, these configs to s3, I'm a bit like finicky here uh thinks uh things do end up working when you have patience so there you go. We have our our our example, so um I'm not gonna hook this one up to a slack, because this is it's a bit redundant and boring, but the idea- because this is what you can do with normal alert manager- you can actually go and and create these uh as you want. um That's uh that's the first step.

B

uh We want to also allow you to look at these things through a ui right now. Obviously this is very limited, very basic. This is just a a way to submit the config itself and the templates through the ui through an editor, but the editor validates the config.

B

So all the like, like I, was lucky to have correct config here but like if you, if you, for example, try to publish a a broken one, it will validate. It uh templates are another thing that are useful for alert. Manager templates are how you configure alert manager to create the nice messages in the emails or the slack alerts.

B

uh Currently we're doing this to this editor, but we're currently also, we have a pr open uh to work in a on a crud, a very simple crud ui, for this kind of these kind of alerts and uh and rules. uh That's where we're gonna start and then uh roadmap wise. The ideas is: okay, that's great! Now, we've created ways and uis to set up alerts. What's the next step.

B

The next step is, if you've ever heard, of slos and error budgets and try to set them up with prometheus they're awesome, and you can do a lot of things with them, but they require you to set up 20 or so rules and they're required to be quite an expert in prometheus.

B

Now that we have basic ui and now that we have basic uh apis to do this, we want to build higher level tools to be able to generate those rules to be able to generate those things for the user. So that's where the system is going upstreace right now is the a lot of the table stakes to enable these higher level user uh interactions and features right, like uh most people have never heard of an slo or or or an error budget, and how to do it right? Why?

B

Because it's scary from all ends like whether from the definitions, what it is, what it means and also how to implement. Technically, we want to remove that from that. We can't really show anything about this now. These are ideas that we can talk about right and then last thing that I wanted to show uh for the demo is a quite cool. So, oh, what is this guy doing here?

B

uh Is um uh monitoring uh cloud vendor metrics, so uh it's it's great to have prometheus, but prometheus also is the power of prometheus to be able to couple it with a lot of exporters. Exporters need to be deployed. Exporters needs to be installed when you have an instance of ops trace. You want to go, get utilize. These exporters to go, get metrics out of things that are not in prometheus, like, for example, uh gcp uh like gcs, metrics or s3, metrics or cloud anything. That's in cloud watch for load balancers.

B

All these things don't exist in prometheus and you need to go get them with existing exporters. So, in the tradition of how we've built the system, we've integrated the exporters, but we've integrated in a way where we created a multi-tenant api, where you can simply, with your token first submits the aws and uh or google credentials that you're going to need to go. Get since you're, watching multiple different cloud accounts. Ops, race runs, ops, race runs it's own in its own cloud accounts.

B

So if it's going to have to go query cloud, watch metrics from other cloud accounts, whether it's different clouds or the same clouds, but different account. It'll need tokens. So we created an api to submit tokens to submit credentials to store them on the upstream side and then so that they don't have to be in the configs themselves, and then we also allow you to submit these simple yamls to basically create these exporters and start start scraping things.

B

For example, this example here will deploy an aws exporter inside of the dev tenant and it will go, get cpu utilization for instances uh all all metrics that are in the namespace build kite. We use build kite as our ci system, so they generate metrics. This is how we can scrape them out, doesn't matter everything you can configure inside of this exporter can be submitted to this api to to generate multiple instances of these exporters that go get multiple things from different providers. See we have work, we have to add a screenshot here.

B

Gcp is the same way like you want to go, get things out of google apis and say how often you want to scrape these things. This is how you submit this. So this is an example for stackdriver.

B

uh We uh so same same as above same as cloudwatch go gets the metrics where they are go, get the logs where they are. This is uh this: is our system to enable this um another example of an exporter that we don't even have docs for yet is: uh let's see the I have it here? I think the black box exporter.

B

So for those familiar with the black box exporter, it's a way with prometheus to do a live, http or pings or dns probes checking if a dns node still exists, uh checking if an http server is still up, you configure it with this kind of configuration. You can go. Look it up in the black box, exporter documentation.

B

We allow you to deploy as many instances of this exporter per tenant, as you want right like so you can go actually not just receive metrics, but you can go actively monitor if certain things are up and then how we're going to build. On top of this, we're going to use something like this black box supporter to in the future, create something like synthetics.

B

Synthetics are a a weird name for active monitoring done from around the world ping them for those, that's that are familiar with it. Well with once you have action and cloud accounts and apis you can program against doing a world ping. Slash pingdom, slash, uh synthetics approach to to monitor your your uptime from different parts of the world becomes okay, I need to deploy a container like the black box exporter somewhere. I need to deploy it with something to scrape it like a prometheus.

B

These are two containers, that's quite cheap, to run with cloud run or with uh ecs fargate, and we can actually orchestrate that and deploy that in different parts of the world with a cloud account that obstrace is configured with. I hope that makes sense, but this is to give you an idea of how one can build things in the open by just by by by utilizing, what's there and improving on it so yeah. uh These are examples that we have. uh This is uh uh I'm like.

B

uh I'm not gonna destroy the cluster yet like I'm going to leave it up. If there are questions, but I'm going to stop here and ask like questions discussions, because this was a very long demo, I admit so there we go.

C

D

Probably I have directly a question because um we can also directly talk about the exporters. So um currently, um what would be the case when I want to bring up my own exporter so like I have, for example, another cloud watch exporter that's currently written, um and I want to use this instead of a senator approach. Exporter.

B

That's a good question: uh it's a thing that we like uh currently the way they are integrated are uh like you would have to go edit. This go code right to add this capability to it. It's not that hard right like uh like uh we it's done in a way, so that's that enables that right uh so that that's why we added the black boxes. Porter.

B

We first had the exporter for aws and the other, and we want to see what does it take to add another one and another one and another one so yeah it takes editing code and adding it to to the system, but uh we're we're happy to come up with a with an idea right like that's the goal like we, we won't. We won't know until people like you come and say: hey, I have a special exporter here I would be. I need to use it. How do I do that right.

D

Now that's cool.

D

B

B

A

Right, um I have a question, um so everything was basically kubernetes focused now, um if, if I don't have kubernetes, how can I use abstracts.

B

That's a that's a very good question, uh so ops trace itself uses kubernetes, but that's that's, that's hidden away, so to speak right, uh but uh in our case, yes, we started talking about how to monitor kubernetes clusters because that's where the prometheus world kind of like really like, took off, uh but we're also working, for example, with one of our customers that uses ecs, and so when they use ecs. The world is a bit different uh in this, but they can still use prometheus.

B

They can still use 4d right like and so that's an example or if you have single instances, for example, a lot of people have single instances. Then you have options, you can well, you can keep using prometheus. But if that's too big you have things like the grafana agent. You have. uh I mean we support the datadog metrics interface, but that's more for migration. I wouldn't I wouldn't use it.

B

I wouldn't start using it to to monitor single instances but yeah um cloudy that what we decided to do is let's embrace apis that are open like uh prometheus, uh like loki, and so that's it. You don't have to be using prometheus. uh Sorry um kubernetes to send us data um open telemetry is one of the next places to go right like to to to continue to to to to to make this bigger, there's the geometry collector that could be deployed and utilized a bit like our exporters.

B

Here, as an example, does that answer your question.

A

Yeah and I have a new question as well thanks, um so if you, if you're talking about like you, want to integrate open telemetry, does this also involve like using elasticsearch uh data instead of loki like how? How far do you want to go with opening up or using apis or providing providing data sources? um Because yeah?

A

I could imagine that this at a certain point you get into maintainer hell somehow, because you need to support as a lot of interfaces, and maybe you probably have made plans around this- how to make it easier, and I would love to hear your thoughts or your roadmap.

B

Yeah absolutely so, it is true, like one of the main purposes when we started was say it becomes insane to have to maintain a thousand like different projects, different apis, so which ones are the best for what and you mentioned elasticsearch, there is a need for things like fully indexed logs right like that is the thing that's needed, uh and so, but you don't need it for 100 of your data.

B

So what we want to do is enable you eventually to deploy with ops, trace, elastic search, managed by obstrace uh or elasticsearch, open search whatever, whatever the ones the winner but like deploy that and uh send a percentage of your log traffic or tracing to it right, the one that you deem useful uh and so the way we want to do this is yeah like we want to be careful about the apis that we choose. Thankfully, open telemetry is kind of trying to get things to one kind of api, we'll see where it goes.

B

It's still super early and then uh in terms of back ends we're trying to build something, not that not a distribution that has thousands of packages and thousands of choices, but a couple that are well tested right. One thing that I couldn't show in this demo is like everything in here is completely tested through and through right.

B

Like we test for correctness, we have ci for everything, that's one of the important things and you like, and uh you can't do that if you widen the area too much, but if you focus just on a few, the ones that people want to use, you can actually uh that's can make that quite robust right.

A

Yeah yeah, I would totally love to try it out and I need more time um I was wondering I've seen a blog post. I think yesterday, where you discussed um that you want to be fully open source and I was wondering um what what made this decision um is there anything which could be an? What is your way to make money with opens uh with ops trace? To be honest,.

B

Yes cool, uh so the decision to go open source is, if you want a community around these things right, just like for things like kubernetes or others like you can't have something where it's crippled in some way. Right like, for example, have the security features closed? That's a something that happens very often right like uh so, we decided to go fully open source because we believe that if we build this distribution out, we can do what we do with our current uh poc customers, which is run it for them inside of their accounts.

B

So we can actually sell a subscription to an open source product and we can then manage it for the customer in their accounts.

B

So what we're building as a company is not just this product uh like that's the thing that we need to build with a lot of people together, but it is also building some kind of distributed sas if you want right so, instead of managing one big sas where we run ops trace and people send us the data, we actually get people to run upstreats in their accounts or we run it for them in their accounts.

B

And then they pay us to keep it up like not just for the maintenance of the distribution and but and not just for the support, but for the actual act of being on call for that infrastructure right when you use datadog today, uh you you, you don't wake up when datadog. While you do wake up one day, the dog is broken. It's it's a big pain in the ass, but, like you, don't wake up to go fix. Datadog, datadog team fixes datadock for you.

B

Well, this is the same for us when we're going to be running offstrays inside of the customer's account, we fix it for them if it breaks and uh that's why we're building ops race in this manner, in this more restrictive manner, to allow to manage not just one but thousands of instances of upstrokes in the similar way, as you would one, and if you do that from scratch, then you can actually uh do it right like it's, not that different from a from a from a big big central sas, where you have to manage thousands of users in one place.

B

Does that make sense.

A

Yeah, that's great. I was just just thinking of like if I'm look, for example, located in germany, I probably don't want to have the data in in a in a different data center rather than in germany. So my question, my follow-up question would be. Are you planning to, for example, um provide european hosted instances or, like close close to my dmz, or something like that?.

B

Yeah I mean that's why we built it this way we built this week because I mean the region here. Look like the region is freely configurable every oh okay, this is an exporter, but the the region is freely configurable everywhere. In upstate, you.

G

B

It in the country that you're in that's the the entire, because not just the country where you're in but your cloud accounts close to your firewall right like uh all of the above. uh That's the only way to run obstructs uh and, more importantly, you you probably don't like. If you have serious infrastructure, you probably don't even want to run your monitoring system in the same amazon region as your other amazon stuff right like uh regardless of countries right. If you have all your stuff in u.s east one.

B

Why would you put your monitoring in your east one today, people put their monitoring in us east one when they build it themselves, because it's so much work to do multi-region stuff. That's like uh yeah that they have no choice. That's what we want to provide we're saying you can take ops trace. You can run it anywhere in the world, any region and it's made to be run on the internet. It's made to be exposed on the internet right, you can fire wallet.

B

You can do all these things, of course, but it is made to be there and to just that's why I do the demo from my laptop when you use datadog, when you use a sas provider, you start sending data and that's it right like if you want to add restrictions on top of it great so yeah.

B

I added a bit more to the question, but.

A

No worries I'm I'm trying to like push you into the thought, um because I know that, like the requirement of having the data in the cloud, but still next to me is um is a strong one and it could be a strong requirement for decision makers um and just to add to your point with, like you, want monitoring in a different region, uh probably want to have like agents in different regions. Providing me a distributed view on my deployment on my application.

A

So monitoring observability tools should probably always be outside of your stack or outside somewhere, and you also probably need to build something which monitors the monitor to monitor the monitor, because if it breaks um exactly so, I I know that um you could use uh thanos or cortex for distribute that um promises as a distributed prometheus backend um and using it also for high availability.

A

um Is that something I get out of the box with uh of stress.

B

Yeah, so out of the box, I mean I haven't gone like enough into details about what's in there, but like when we say we run a cortex like out of the box. It is three nodes right like and uh if one dies, another node comes back to replace it uh typical kubernetes cloud type of world and yeah. It's a horizontally, scalable, cortex and loki in there that inside.

B

But then, if you want to monitor, obstructs itself well, we ourselves use ops trace to do that, but uh and to monitor our customers once, but in our ci and stuff like this, but down the road like, depending on your needs like and on the size of your organization. You can stack things right like.

A

Yeah, I think, that's great and I would probably recommend to focus the marketing a little bit on that because for me, um high availability and redundancy is super important. So I've been building monitoring systems where you have active, active setups and you do failover and other scenarios, um and I know it's really a business requirement to have it um and focusing on that. um You probably have a blog post or website already, but um I could see.

A

This is also similar to what makes up stress different um to like a standalone monitoring tool which you install on the on a virtual machine and it breaks. And then your monitoring is dead, which is bad.

B

Yeah, no, that's exactly it like uh like. How do you monitor at scale is? That's that's the biggest reason we started this right. It's like monitoring at a low scale. It's one thing but like where people really start running into trouble is at scale and when all these questions that you're just answer asking start coming up right like yeah.

B

So yeah, that's that's what got us here. um I wanted to add something. But now I forgot.

A

I have too many questions, I'm sorry if someone else wants to step in um I just uh what are like problems which you see which are not yet solved with observability or monitoring. um So what is the not the the immediate roadmap you have for this year? But what is probably the three year or five year vision? You have.

B

Okay, so way further down the line uh so closer, now we're helping with onboarding helping people to to get started with open source and everything, but really further down the line. I mean first things. First grafana is awesome and grafana is the d factor standard to go query prometheus data, but grafana is unfortunately also not that easy to use. You saw me click around like mess up a thousand times like not have I don't. I didn't even have uh like beyond dashboards.

B

I didn't even have a session that I could show you that it's like this is the demo right like. uh So we do believe that it should be much more collaborative as a ui. It should be much more of a ui where you come in, and you say things like: okay, like I have dashboards, and so that are fixed, that's great, but what about the ephemeral debugging session? What about the ephemeral moment where I go in, and I want to debug my problem and maybe I'll come back to it late a year later.

B

So if you use things like google, docs or notion, or things like this, you just create sessions and you collaborate in the place and instead of creating complicated dashboards and everything, one should be able to just create documents where they're like. I want to add like a little text here. I want to add a little bit of like um graphs about certain things, and these documents are live things that in which you can work, debug a problem link to other things, and uh so it should be much more yeah.

B

The the ui should be much much easier to to approach use and share collaborate with. If I want to collaborate with you right now in grafana, I have to send you a dashboard that I built, which is not easy to do or I have to, and it's just a static dashboard or I have to send you 10 tabs 10 links of my session, like that's, not a good way to to to work like, especially in a remote world where you, you can't just get somebody behind your screen anymore right.

B

So really, a more collaborative version and a more uh uh intuitive to use version uh for uh of grafana would be good. That's the long-term uh type of stuff that we'd like to build on top of this right and then long term. There are other things right today we use things like cortex and loki, and we we're thinking what are we gonna add for tracing but long term. The world is going towards. How do I ask any questions of my system right, like yeah and and so uh the open source?

B

Databases are not there yet, but eventually we'll get to a place. Where we can ask, we will have much more links between our logs traces, metrics and also random events and we'll be able to ask much more um fluid questions if that makes sense, uh but that's not possible today because of the state of open source today, which is fine right like uh uh like these things, don't exist, but they will and the sas vendors are there and when the sas vendors are somewhere. Eventually, the open source world will be there too.

C

B

C

There actually uh sorry, michael that making the whole experience in the end really simple uh is important because you use a product like datadog. It's really easy to get started. You know they make it really easy to deploy the agent and you know just start using and interacting with your data and the open source world of observability. Just isn't like that.

C

If anybody has tried to set up prometheus and configure it with all of your relabeling, and you know then fluent d, once you get deep into all of that, it becomes really really difficult and it's too easy to get it wrong. So you know we think of observability as the whole system from where the data lives and where you got to start collecting it right through to having insightful answers to questions that you have, and so collection is going to also be a big area of focus for us to make the whole thing.

C

You know very simple in the end.

A

That's really great to hear I I just wanted to add that I've seen that prometheus provided examples allowing like to link metrics and traces in in one of the recent releases, and you probably will be integrating that. I, I think, and the other thing for tracing is uh grafana temple which looks promising as well.

B

Yeah no yeah- these are these. This is a cool way right, like finally, there's a way to start linking metrics to something else right via traces or something else, and uh that's fascinating, but it shows again that with things like tempo and all these things open source, the open source world is very fast moving and keeping up with it is complicated.

B

So that's one of the things that we want to do is like: how do we get people to use things like tempo or exemplars faster right like how do we get them to integrate open telemetry codes into their systems faster sooner right? Why and more people like if you look at how many people are in our community? It's not that many!

B

It's uh there's thousands of course, but like there could be more people that just pay and send things to to datadog two other things, and I have no problem with the sas providers, but in the end, paying so much for observability and in the way that you do when you use sas. Providers creates wrong incentives and people stop measuring things, and so we want to get away from that and show them okay with open source.

B

You can do it differently and the world's done it with kubernetes right and uh like they like this is this is this is the kind of stuff that, let's not rebuild the same things over and over and over again and like do something together right: okay,.

A

Yeah, it's really great. I have a last question, um I'm thinking about integrating metric reports and graphs into merge requests to pull requests like you have a staging environment somewhere. It's intro instrumented with upstream for example, and I want to kind of deploy my application measure something and then present it in a metric phase, and this is something where I want. I personally want to go um and make this happen somehow.

A

So the real question is: how would I start contributing to trace? Where would be the? Where would you say, should someone who just tried it and like that yeah? I really appreciate the docker compose d1, making it flexible um to generate data. But how would I approach it and say: hey I wanna use this api or want to like add something on top of it, of integrating an iframe or integrating something yeah.

B

So uh join us in slack if you want talk, live or if you don't want to talk, live just start writing in github we're we're super open to any issue or anything like uh the like. uh If you are have like the the code that we have, there is it's all there, it can be a bit daunting, but it's like uh we that's it like either talk to us on slack or uh uh or or or on github. I think these are the like to talk, live and then uh yeah.

B

We we actually like we're ready to to on board people and with their ideas like this, is quite like we're quite open to people coming with our own ideas right. Even if this thing seems quite opinionated like this is how it's built so uh yeah.

A

I think I think you need to have some sort of opinion, because otherwise it will be like ending up in chaos. Somehow exactly um do we have do you have any sort of like development, documentation or contributing md or something.

B

Absolutely and so here opening up yeah on autotrace.com community and uh you can go there and then uh that's that links to a bunch of things, and then you can also go in the docs and in the docs we have something called: let's see, more contributing contributor guide there we go development environment. This is how to set up your development environment. It's up to date. It works. uh This is how to write tests for it.

B

Like we, we explain our workflows, we explain how we write docs, it's even these docs are open and everybody can contribute to them right like uh so. uh That's where that's where it starts.

B

If I wanted to build it myself right, like yeah, and you can check out the repo and like there's a make file that like is quite quick to follow to to build it yourself where it starts getting complicated, I have to admit, is we require you to have an aws or a gcp account and what we're going to be working on is yes, this is a fully integrated system, but we are going to allow people to run the controller on their own kubernetes.

B

Even if people can hurt themselves very badly with it, that's okay like because it allows you to it will make the hurdles to do development easier right like just to get started running it locally. We haven't invested time in that, but we will uh like that's. That's it.

A

I think yeah. Well, I think that's that's great. You need to start at one point and you can always iterate and say: hey, there's a different environment to development so like we added git port to to our contribution, workflows and like spinning up something somewhere and it's it's super fast. In contrast to and contrary to um installing node for 30 minutes and then yeah, you don't know where to start. So I I really appreciate that you are having a documentation and and an entry point for everyone to to get things going.

A

um I was thinking.

D

Effortless question for curiosity um across before we break up so when you um showed vcli, um you said that the ci will do all the stuff to creating all the aws resources right. Yes, um so um did you implement it if with aws cdk, or do you use another tool under the hood.

B

So we use the uh node libraries uh for so basically the sdks, not the cdk itself like, but the sdks for nodes to to do for google and for aws, and so uh uh we we then call the we call the functions uh and it's not.

D

Okay and but it's.

B

A bit more than that, like matt, is the one who actually like uh did it uh uh in the beginning. It's like there's a whole system there, where we basically, uh the reason we did this in typescript in the beginning is because of many reasons, but we wanted to have one code base that has the same code everywhere right like whether that's a long-term goal or solution is not necessarily is up for debate right. We already have quite a bit of go code for things like our apis right, like whenever code goes through.

B

It live uh like metrics, goes through something live apis, it's all in go uh like uh like so yeah we're open to quite a few things yeah, but it's all the idea was we needed one code base to to to not just deploy the system, but also run the system on top of kubernetes, and also there be able to manage the cloud provider itself right. What if I want to create a ui to add a node right like uh to the system right or or to auto scale? This needs to happen from within the cluster.

B

So that's why this was started this way, you want to add something matt.

C

Yeah we iterated and tried a bunch of different approaches here in the beginning, and you know the natural one in the beginning was to use terraform and then stand everything up, and then you know deploy things on top of your. You know environment from there, uh and then we thought well. Terraform didn't have the flexibility that we needed to be able to have control over the infrastructure, and you know the containers that were also running on top of the infrastructure at run time.

C

So to do what seb said you know, we want the flexibility to be able to control the underlying infrastructure um from an api or from you know, user interface to be able to horizontally auto scale. For example, as you know, more metrics are coming in. uh You might want to create a policy where you can automatically add new machines to the cluster, so that somebody doesn't have to come along and say: oh, the cluster is about to run out of memory.

C

I need to do something about it, so we moved from terraform to pollumi, because we thought okay, we can have everything in a common language and what we found was that palumi was still too restrictive for us. Even then. So that's when we decided to just uh you know, narrowly focus on the apis that we needed and just build our own sdk for those, so that we had full flexibility over all of the infrastructure.

D

Yeah, that's great, so that was also my intention so to ask so why we shouldn't use terraform for letters so, for example, because in most businesses or also my work mostly, what I write is totally a red terror from group. Mostly, I write some code on the platform, so in kubernetes, but it doesn't title it directly to aws master or other drop providers.

D

B

Our philosophy, oh sorry, good.

D

And I really like your idea on vision about the terms of um observability, because it reminds me really. I know I read also a bit about the background from sebastian so because he worked on doctor and doctor, I had the same space, so continuous also exists before docker, but docker had that quite right step to enable a lot of more people to use this technologies, and I think this is the right key to get the adoption mostly.

D

So that observable, like you, said it needs to be usable and not that I need to invest three days only to set up all the whole setup, and I don't know what to do. I want to get simple in and want to add simple stuff and the technology should do it, and then I can build up my business.

C

Yeah absolutely absolutely.

B

Yeah you're just making the case for us. Thank you very much, uh um yeah now one thing that I want to say about the thing. That was why it's all built in one language together.

B

I think it's about testing right when we, when we started with this, we wanted to be able to to like, of course, you, the user spins up the cluster once and then is supposed to stay up and that's it so install seems like why do you invest so much in install because we actually on our side, wants to install it again and again and again and again and again and again so that we know percentage rate of failure so that we know like it is so that we can actually test the whole thing end to end as much as we want right so that we can go spin up anywhere in a branch, a version and start building upgrade tests for that place.

B

Right like we need to be able to fully automate the testing of a solution like this and it's doable because thankfully it's observability so there's a signal that goes in, and it's uh there's a lot of different shapes to that signal and will eventually have tests around these things. So it's it's. It's a super fun thing to do.

D

Yeah- and um I had also an addition or no idea- comes into my mind because you're talking about testing and I think two or ten episodes before we um used instead of roon d, we use vector- probably you know the log shipper and they had also a test frame and work in where you can also test your log shipper directly. So it's also really cool yeah to use that.

B

Yes, that's cool yeah. We looked at vector, we haven't played with it's just because, like you can't play with everything but like.

F

Yeah, it has a.

G

F

B

Ideas and it's pretty it's it's pretty nice.

C

Victor is very cool.

D

H

D

Make a pull request: how to integrate it and help stress.

B

Yeah go ahead like uh feel free. If you find an idea.

F

B

Anyone and anyone else any questions. It's been wow an hour and 25 minutes. Okay, like uh I, I didn't see the time pass.

A

You, you probably missed some meetings. um No, no.

B

No all good, we stood time. You said this would take more time so.

E

Maybe last question: um what was the intention to to build ops trace? Was there uh an event or something like that or an idea which grows over time.

B

Yeah, I mean it grew over time into what it is now, but the first part was oh observability like if we interviewed a lot of companies and we asked them of different sizes and we asked them the same questions right. How do you manage your infrastructure? What do you do right, infra questions in these questions.

B

Obviously, when you talk about infra is monitoring logging, observability and all this right and after we had all that data like in after looking at it, matt looked at it for a while and reordered it, and it became quite clear that there was still a problem with observability and honestly.

B

It was kind of a surprise because in our head we're like yeah, we love prometheus and all these things so fine right like and on the other side, it's like people who can use datadog and other things, but turns out like when we talk to these companies, there's still so much to build and then then the idea kind of iterated right in the very beginning. We want to build what I said uh about our conversation uh strategy, a sas that runs in the company in the company's network right.

B

So in the beginning we said we're gonna, build datadog and datadog is gonna, run in your account, and then we started iterating and seeing that we use all these open source projects right. We're like we use we're standing on the shoulder of giants like we can't just like. If it's closed, it's like nah, it doesn't work, so we continue to iterating like okay, like yeah, and we were quite inspired by gitlab to be very honest right about how, in the open gitlab does things and talks about what they do and everything you're like.

B

Okay, get bob and others, and- and I spent a bit of time at red hats and uh all these things were like. Okay, let's just make an open source project, it's different, it's harder! It requires a community right like you, can't just do it by yourself, but this is what we're doing now.

B

We're building we're saying: let's build a community to do this because we're not the only ones with this problem, and this is and then the companies that we're going after right now and, like sure, we're doing pocs where we manage it for some, but like most companies that we want to talk to our companies. That said, okay, we have already decided not to buy something right, datadog or something we want to build something with open source, and then we tell them cool.

B

Let's build this together, don't re-implement everything you want to re-implement auth tls, all these things again and again and again and again, like that's, not the core business of companies, and so that's exactly how we got to what we're doing uh today.

A

I'm so I have a really last question or recommendation um yeah. I could I shared it on twitter already. I could talk all day about monitoring and observability um in in the enterprise environment. It's often times you have some classic service monitoring, and then you do state-based sla reporting and generating nice pdfs with the cake diagram and some other fancy things.

A

I know that you can do it in a certain way with grafana, but I think like to um to to enter the enterprise or to enter the um the ops for admin personas who want to present reporting to their managers and just sending them pdfs. I think it would be a great asset or great way to um to have have this provision to have this as a template available, and you can customize it in an easy way.

A

Maybe in the future, with an ui and kind of making making people say hey, I want to use upstreach just because the reporting is awesome and it's so easy.

A

This is, I think, one of the things I had like heard so often in my past when maintaining a singer, and it was really hard to fulfill as an open source project, because normally people want to pay for that or do pay for that, because the pdf export is an enterprise like in kibana, um and I could I I could think of like if you want to find the yeah the special case.

A

um Sla reporting in the boring way could also be an entry point similar to incident management, which brings me to another idea of integrating gitlab incident management with ops trace um yeah. I will, I will hopefully find someone who does it um or I will do it myself.

C

These are exactly you know, the premise of what we stand for too is you know trying to talk to people like yourself. uh You know who think of all these different things that they might want to get done with such a platform um and that's what excites us each day to think about all the various possibilities and to work with people to try and make this stuff happen. So you know I say: please keep the ideas, rolling and come chat to us and slack and let's see what we can get done.

B

Yeah for sure- and it's like we're in it for the long haul. We know that this is a big project right like, but I don't want to join another company one day and have to like be choose monitoring again. Like you know, this is not like just like uh next time. I want to use kubernetes or whatever, whatever the state of the art of that moment, is not saying that kubernetes will be this forever, but uh that's it.

A

I think, from uh from a developer's perspective or from my backend engineer head, it would be amazing to have some sort of apm or tracing or instrumenting out of the box, because I really struggle of watching an sdk or looking at an sdk for telemetry and trying to understand how to add instrumentation. To my app, I don't know where it's that fits.

A

Does it affect the performance and those old things at the top of my head, which I want to help solve, but I currently don't know and if, if we can do, if, if you go into the direction or plan to go, I think lots of people will the love of strays because it just works, and you don't need to figure out how to configure something because, as a developer, I I don't care if it's kubernetes or whatever. I just want my app being deployed.

A

I want to see the metrics and I probably want to be, alerted or see some reports, but how this works in the background yeah, maybe for debugging. um But in the end, I I want to be fast and focus on the fun stuff, which is code, which is which is seeing my application being run in production.

C

Right and that's the point, it's you want to build and focus on your own.

C

You know problems that you're trying to solve, and you know monitoring by now should be shouldn't have to be one of those things that you also have to focus on, um and so you know we we think of that as our job, um and you know we want to make sure that other developers, like yourself, don't have to think about it too deeply. It just works.

A

Yep and security teams don't have to worry because everything is cls secured and you probably could also harden security and say this cypher is not allowed because company policies forbid that and things like that, because security is super complicated yep. Let's.

D

Also, unlocked see is really complicated, so, for example, if you're singing in germany, you have the gdpr and you need to check that all your lots are currently shipping and when for most teams that I'm currently work on, I try to work with the developers and say don't put these information in the lots and we don't have problems. Otherwise you need to do it on the other end, so yeah, then you need to put it into 3d or some having expressions to check that the stuff is not there and so on yeah.

D

It's also what could be a really cool use case to see how to do this on different compliance levels, at least because some of the enterprises are struggling with it, at least as well.

B

D

Is a use case? uh Sorry.

B

To interrupt you matt, but like we in the past, we've worked with enterprise companies who had their problems. We ourselves have to be gdpr compliant too, even if we're in the us right. We know how it works and you just build an architecture where the logs don't exit the the company's domain and if it doesn't exit the company's domain, you don't have a problem right like.

H

B

Even have to prove that you're deleting them, because by law they're there and they're fine right like uh matt, you wanted to add something: yeah.

C

Yeah, I wanted to say that also the beauty of having an open source uh project which is extendable um and particularly when you have an architecture like ours, where we have flexibility to deploy new pieces or new containers um into uh the logging pipeline. For example, there's no reason that you know we couldn't create a way to inject.

C

uh You know a scrubber or something like this too. You know in your pipeline so that if you do have really sensitive data, that developers are accidentally right, logging um that you can still take control of scrubbing that data out on the ops trace platform, end of the system so think things like this, you know, provides that's an example of how the security team, for example, can take more control over the matter without having to sort of bang on developers too much and tell them hey, stop doing what you're doing they can say actually like.

C

Oh I'm just going to make sure that the data doesn't get stored in the database that shouldn't be stored. There.

B

B

All right, I think, unless I'm happy to answer more questions I think matt as well, but uh up to you guys.

A

I think everyone is installing it now.

G

D

F

D

I don't do it right.

F

D

I've prepared everything, but mostly not yet. I need to do some maintenance work and afterwards I can do it or when waiting on pipelines.

C

Please reach out to us we'd love to hear about your experience with it, and you know if you have any questions along the way, we'd love to help you uh in slack too.

D

Yeah, I would do awesome.

B

And yeah, we don't don't hesitate to say this is crap. This is crap. This is crap no worries right. This is what this is about. Right, like we are building something where we know that the final user, most of them are skeptics, most of them like uh not something like this again right, like it's like.

C

B

A

Yeah thanks for joining today, I would totally love to do maybe a wrap up or like a review from our side in half a year and see how things are going like you joining again and sharing the latest the latest demos, the latest things it could also break. We love debugging together um and thanks for taking the time today. If there are no further questions, I would just wrap it up and um say.

A

Thanks for attending, I will write a blog post or just publish the blog post later on and the recording is on youtube, we're all on twitter um and we can also link linkedin and stuff. But with that I would say we're saying, buy on youtube, but we can still.