Cloud Native Computing Foundation TAG Observability, 11 May 2021

Previous Meeting

⏯

youtube image

►

From YouTube: 2021-05-11 CNCF TAG Observability Meeting

Description

2021-05-11 CNCF TAG Observability Meeting

A

A

B

Happy post uh kubecon everybody.

C

B

My mind is still very, very full.

C

Yeah, it was a rough week. It was an awesome week, exciting.

B

I wanted to watch all the things oh hypertech.

A

A little bit too low for me, I don't know if this is the same for the others.

C

I can say that yours, arthur is very, very high. I almost like.

B

Yeah that was being careful to not blow out your drums with levels that are too high.

B

I guess we'll wait. Just another minute is richie.

B

We have a fun agenda today.

C

Yeah, I guess richie won't be joining us, but given what he said early on, he would be around here already.

B

B

Okay, okay! Well, let's, uh let's start in in, let's say a minute! So if you haven't yet signed in, please do.

B

I will link the google.

B

Document all right and while we get started this is a cncf um meeting. The code of conduct applies and, as usual, we are kind to our colleagues as we are. um So today we have a couple of. uh We have a couple of guests, uh we're that are going to do some quick presentations um uh they're enlisted in the agenda. We've got the author of prom dub ivan sim, and we also have a very cool talk by pixie.

B

So zane is going to walk us through pixie labs. What they've been up to? um I wanted to extend a welcome to everyone who may be here for the first time uh if anyone is here for the first time or you've never introduced yourself before. Do you want to take a quick second and just say hi.

A

Oh hey: this is my first time in this meeting arthur invited me because I was very interested in hearing what pixie has to say for the ppf stuff and I'm lorenzo. I work at github with github with artur.

A

That's it everyone! I'm italian.

C

A

Word small world hey, I can introduce myself, I've been lurking on the channel a little bit and I attended a few other meetings. My name is dave cruzanti. uh I work at comcast, I'm just we do a lot of stuff with durability, most of it not very um homogeneous. So I'm trying to get some uh things going there and I've just been contributing on so happy to be here.

D

Welcome, hey everybody. My name is ivan. I work for red hat yeah first time here and um I got a co-email from matt who invited me to join, to come and share like a personal site project that I've been doing um so.

A

D

Happy to be here cheers.

A

A

Go ahead, sorry I'll just go ahead and introduce myself I've been attending off and on I'm just on the side, never introduced myself. My name is mohammed. I work for seagate.

A

I work in their open source program office as well as some other stuff and observability is definitely hot on our um you know, topics to kind of do more work on. So it's been great to kind of see what you guys have been producing.

A

All right, I'm michelle- and I I guess I'm here with saying just to talk about pixie, so I'm super excited to be here and I think we're both very excited to show you.

D

Guys what we've been working on.

A

Great to have you thanks. Welcome, hey I'm, I'm matthias! I haven't been here for a while. I think I joined the first couple of meetings and then um transitioned to another company previously worked at red hat. Now I work at polar signals. So hi everybody again, hello, everybody. uh My name is anaes, I'm mainly here today because of the presentation, because I'm really curious about pixi. I just joined the sre team at sivo and before that I worked at codefresh.

B

I think that's everyone that we haven't cool. Thank you for um saying, hi, everybody. I don't think we missed anybody, um so I'll. Just take a very quick brief minute at the top uh last week you know we talked about launching some initiatives for 2021.

B

um I'll, follow up with this, with a blog a little bit later today, uh but very quickly. I won't go through them all in detail, but we have about five or six depending on how you count work streams and in the last week there have been a couple of additional suggestions. So if there's no uh objections, uh I think we'll plan that we'll just use github projects and we'll make a con mod board, for you know the definition of working groups.

B

So if you're interested in some of the things we talked about last week again, I won't I won't elaborate, but they were generating a big giant list of vendors and the projects they contribute to a plan for engaging with other cncf projects intentionally over the year um with a prioritization and like a common uh interview template if you will um plan for and foster in-person meetups, once the world gets vaccinated, um generating personas and curating case studies, the compendium of case studies, uh um case studies and other existing content that we want to surface from cncf members and partners.

B

So those were the five that we talked about last week. uh If folks are interested, I think we'll we can collaborate in github and slack uh and then maybe uh in the two weeks from now, uh we can we can present once we've sorted it out online together uh how we want to self-organize and self-manage. But a number of people have reached out to me personally and said: hey, I'm really interested in helping driving one of those things. So that's awesome to see so with that.

B

um I don't want to take up any more time on administration than is needed. uh So if there's, if there's nothing else uh pressing, um why don't we start with ivan and then move on to pixi.

D

Yeah um thanks everybody thanks for having me um so yeah today, I'd like to share with you like a slight project that I have recently built and published, and uh it's basically a quick, auto plugin that uh allows like sre, I'm sorry is to dump and restore like um permittivist persistent blocks and um a big thank you to bartek, who um you know, provided some feedback and take time out of his busy week weekend chat.

D

um So I am gonna try to share my screen.

D

So I don't know about you guys, but clicking on the share screen button is always a gamble for me because it might crash my thing: can you all see my web browser yeah?

D

Yes, okay, cool, great um yeah, so um the gist of it is um you know, late last year, like I was talking to one of the red hat um software engineers, and he told me that they have been like an increasing number of issues and bugs on our customers openshift customers, clusters, where it would be super helpful if we can get access or dump of their primitives data.

D

You know, like I mean life would be a lot simpler if they can just give us their config file, and then we just like put forward to their primitives on console and do what we want there, but not gonna happen.

D

So um even up to you today, right, like um you know, we usually tell like um you know, like um the customer representatives to hey capture, was to capture like a premier state committee's data, and here's, like you know, like a 15 lines of bash code and try to run it in their environment and what that bash script does is really just.

D

um It run like the cube, cutter, exact command and then like uh really like, attach to the container and then run like a tar command to archive and compress the entire data directory of prometheus.

D

Now like uh it becomes more challenging if it is like a production from bts instance with maybe 100 gig of data 200 500, you know gig of data right like um as we like, I'm compressing tarp on this file like we need, and we try to restore it and say if it is like a 500 gig of thumb. You know data directory, it's almost impossible for me to run it on my laptop.

D

You know to try to just to reproduce like um that that that the issue to see what's going on uh to spin up clusters- and I have to port the file over et cetera, et cetera. So um you know like just been thinking about it as like, really like when we diagnose bugs all we really care about is like a very specific um chunk of data that falls between like a very specific um time window.

D

So, and you know I just started researching into it and, as um you folks might know, currently prometheus like offered like um two ways to dump data, and one of them is through like a prom tool where it has like a tsdb snapshot, sub command and then there's also like a snapshot like an http endpoint.

D

Both of them. Don't really do what we want because, like uh first of all like it dumped everything, but you know again: 200 gigafrom data metrics, which we don't need um and then, secondly, with prompt tool like there's a github issues that I believe are to erase about how, like the output, is somehow limited that you can't really just like um you know, restore it onto another primitive instance.

D

So that kind of led me to just do some like research like um in the evenings, look through code, and you know and try to find out like. Is there a way to do this, and it turns out that, like primitives, has this some tsdb package that really offered like a very nice um apis interface?

D

That would do what exactly we wanted to do capture data blocks, but allowed us to filter it by um by you know like a time time range, so that's kind of it, and I want to go into a very quick demo and really there are only three commands that I need to run so right now. I have two kind clusters running and one of them is listening to port 9090 um and then you know, and this demo http request total metric.

D

It's just like a dummy metric that I created and just for demonstration purposes and, as you can see on this cluster, like there's some weirdness going on right, like I said my the the summation rates, like I'm dropped, all of a sudden. You know something some weird stuff happened here and you know and pretend this is like a 200 gigs of um data directory. You know I don't want to again dump and restore the entire primitives.

D

You know I only care about this time frame here, um to help me to get into um to to reproduce and diagnose a problem. Now I have a second instance of primitius listening at port uh 1991.

D

Let's do a quick refresh to make sure it's still there and you know what, if I try to do like a demo underscore like you know, there's no, but the metric is not there. So what I want to do is really just to like dump the data and then port it over to the second instance of prometheus.

D

um Can you see my console?

D

Yes, okay, cool, um so yeah. So I have like um one kind cluster running here and um the first thing I'm gonna do is just like. um So I'm gonna, the problem dump is just a cube, cuddle plug-in and uh it provides like maybe like two sub commands and one main command, and um actually let me just make sure I am looking at the right um clusters.

D

Okay, so I want to run like a very quick like uh meta like um just make sure I'm pointing to the right.

A

D

um I want get the uh the metadata of the ts uh db from here.

D

So what that essentially does is like um the the plugin, the cli it like uh upload. um It send like a post request to the exact sub resource of the prometheus container, and then it attaches to like the sender in and send it out of the container, and then it just streams like a a binary like a second application, essentially stream, the binaries into the container via extended in and then like. uh You know and tell the container to hey.

D

You know: here's like a stream of data when you get it unturn it and run this command and then, like um stream, back the output to me via send it out.

D

So what we see here is um you know it give me a meta theta, the matter, the meta sub command, like I'm, just giving you a sense of how much data am I dealing with so have blocks per system blocks. You know, details about primitive storage, um tldr have blocked in-memory memory map files uh and mutable.

D

Persistent blocks are like more like quote-unquote, long-term storage, where I think the data becomes immutable and like they get compressed, and you know just safe on this basically, and so, as you can see like in my head block in memory like I'm about like what 25k of number of series and the data size uh you know, how much is this maybe like 37 neck or something of data um on persistent blocks.

D

um So you know you give me a sense of okay. How much data am I dealing with right so now, if I try to like um just like um dump the data that um I'm interested in uh same part and then, if I switch back to my prometheus console, so maybe I'm only interested in the data between say, maybe around four o'clock and um maybe two o'clock there, so I'm just gonna I'll tell it to like um go and look for this data.

D

uh What date is this? This is 11th um the other time it's all in like um utc. So and then I go next time. Equal, 20, 21, 11 and then say what was it.

A

D

Okay, let's say one o'clock in the afternoon so and then I'm just going to redirect the output to like um a z file on my local files too.

D

So, okay, so now like, I should have a local file here, so 23 mac of data. You know again not too bad, say image, pretend it's like 200 gig in total. So let's do a tar ts and see what we got here um so essentially yeah. It just captured, like um the the data, the persistent block that falls under that period of time that I'm interested in and at the same time it will also like uh capture everything in the chung's head and the wall files shadow with biotech. Briefly about a briefly about this.

D

I think like yeah, like the trunk, the head block allows you to. um I guess it kind of query like the time range as well, but it's not as easy to split it um so anyway. So I guess, like I, just took the easy way out and said: hey, you know, give me everything in the head blocks and everything in the wall files, because those are pretty much like um confined to about like two hours of data anyway, so you know relatively small compared to all the other persistent blocks within the container so yeah.

D

So I got my data um now I'm going to switch over to my other cluster. I hope it's still running okay, so there it is um so I do from the meta data just to see what is already in the container so yeah. This is a new like um primitius instance that I just installed via helm and you can see from the last column here it's about it's less than half an hour. Oh so not there's! No persistent data blocks, yet it does have some head block data, though.

D

um So what we're going to do is we're just going to run an example, restore sub command and then we go minus t and tell it to pick up this um dump file and just think this should be all. I need all right. That's really quick, relatively quick, because you know just 27 meg of data, so not a lot there. Let's run the metadata again um oops did I pick it up, okay, yeah! So now I can see you can see like.

D

um I have a persistent blocks here so now for those of us who are more observant, you might notice that hey, you know how come there are two blocks here. um I think it has something like um so they you know when I talk when I, when the prom dumb, like talks to like um the tsdb, there's really no ways to say like hey, you know like um break up your persistent blogs. Give me a quarter of it or half of it.

D

You know so you get you tell the tsd say this is a time range that I'm interested in. I start time and end time and you have to fetch basically the entire block. You know so essentially we might just end up like getting more data than we need, um but you know it's.

D

You know having more data slightly more data, hopefully it's better than you know having missing data right, so I restore it and um then, like uh you know, I am going to have to restart my part and the prometheus server has like a pvc underneath it so killing.

D

The part will not destroy the restored data and the reason why I need to restore it is just to give prometheus a chance to replay all the the wall files and pick up, like um you know, other data that has been restored so now I'm gonna like watch it and just if I have done everything correctly like uh you know, this should not crash.

D

um I had when trying on open shift, like uh you know, sometimes like when, after the data is restored, like it will crash, because it complains about like um the head blocks not being in order, um but that seems to happen like um on openshift, but like we're just playing vanilla, like kubernetes kind of mini cube, like that, I don't see like any data corruption at all so far in all my testing with data restoration, so, okay, so this restore now I need to redo my part four to the second instance of committees, because, as we all know, part forward to service is really just part forward to one of the parts.

D

um Okay, so go back to my second instance. I do a refresh and then I should see my data here and let's execute it and see what does it look like, so we were looking at like a one day old time frame. So if I done it correctly like, we should see something very similar which oh wait wrong. Query.

A

Let's try it again.

D

Yeah, so um if we compare that chart here, you know it's almost identical, but not quite um you know, there's some missing data. So what we're seeing here essentially, is. uh I guess this is um it's not enough data here I wish I could have like uh you know two or actually no, let's try this.

D

If I look at that, um okay, in this case, it may have restored like every thing that we need. I wish I could. There was a way for me. I really have enough data to show like, oh you know like yeah.

D

Only this chunk show up and then changing um get restored, because you know it falls under a different block, so kind of like what I was saying right, like um I tell the tsdb that hey, I need data from this chunk of time range and then it's just um you just grab everything the entire block of blocks in this case and then just restore it, and here like this slightly like um the head chunk, which is you know, the part that got replayed as prometheus part.

D

We started- and this is the data that was sent for the restore and fall on that one or two blocks of um persistent data blocks, because we asked for a time range that spanned between two blocks, um yeah.

A

D

Is one way this is um but anyway, so this kind of in general, so uh I had a chance to demo it to our teams and the support team. um You know to find it helpful helpful. So I hope that like it will be something that the community community will find find helpful too.

B

Thank you thanks. Thank.

A

B

Does anyone comment? I think we have a couple minutes to chat before we move under pixie.

C

I have one comment that is less technical. I shared it on the chat as well, that is, you will probably find that people love to use it and will use it and they run into issues and they will come to you and ask for support but will not be able to share data.

C

I ran into a similar issue for an ncd backup tool that I wrote a couple of years ago and it was not fun because you can't reproduce stuff. If you don't have the data and and it's you know, you can't blame anyone, because, obviously you can't just you know, uh give someone external a dumbbell of your production system, absolutely not. But that is an issue and maybe you are smarter than me and find out some way around. I didn't.

C

I eventually was very happy that valero came up and I didn't continue that one, but that's certainly something one can run into yeah.

D

Yeah so go ahead.

E

Go even I, I have another question on the unrelated thing. Oh, like.

D

Yeah I was, I was hoping to actually like I open up an issue on the primitives github repo I was hoping to like. If there's enough enough people are interested in it, I can. I can move the entire source code to prometheus community repo, so others can help me to maintain it, and I don't have to spend like all my evenings, trying to figure out bugs and stuff like that.

D

But always I can remind people that this is not like a backup and restore tools, because this is like a just a sra tool with a very specific sra use case in mind, and you know, permeates backup and restores entire thing entire product that I haven't quit my day job to do. But you know this is just sorry so but yeah. I hope that will help scope. The.

B

Hopefully, only the sra team would have exact access to a prod cluster, if yeah for sure.

E

Yeah, and and from my side like it, it's a super nice demo and thank you I think you can in next iteration, actually improve it. uh You know- and actually you can. There is a way to select certain data that you want to actually dump, and you know uh we could do.

E

We can kind of play with some tool that will essentially, for example, remote read the data from promote use using their apis, that selects actual series and time time and by by series and time and uh and essentially build the block ourselves, because this is what our back filling tools informatives already due from recording rules and from csv and from json. So if we can point to the existing promoters, that could be another input for such a block, which could be much much smaller, because it will have only one series that you really care for.

E

So this is, you know something which would maybe make your tool even more lightweight, right um and yeah, and we also you're welcome to propose this project for prometheus community uh repository. But you know also it's worth to mention. This is not a way to just uh you know.

E

Just remove your this project from your kind of you know, area of interest, because this is not like a you know, random place, for you know maintainers with lots of time, but it's rather like how to make sure um it's more adopted, more visible project for everyone else, and you have higher chances higher chances to have someone else to help you uh but yeah. Just ideas uh to improve that, and also we can work on a better uh importing procedure right now.

E

You have to restart the whole prometheus and maybe even remove the actual uh existing data data directory. I mean there are ways to reload that dynamically or maybe even add, blocks on top of other blocks and they will be vertically merged or vertically queried, and I think some of those things already are doable, maybe behind some flags. So we could, you know, improve this behavior as well, but just ideas, yeah.

D

Yeah thanks for that really appreciate it yeah I did. I think I I did came across.

B

With my mc head on uh and we got it, we gotta cut it. Thank you so much for showing us. uh Please join us uh in our slack channel, and I bet uh there are lots of other ideas and actually I'm interested in this discussion, but we need to move on uh to to guard our time. So thank you so much again, um zayn, are you ready to present yep cool, uh so yeah welcome. uh Take it away.

F

Well, let me sorry: let's do my share screen.

F

ah Notice, it says sig observability, like a tag all right. Well, thanks. Everyone for um you know having as over here. uh You know, we're super excited to be part of the community, uh just just for context. uh You know I'm saying uh nascar, um I'm currently the gm and gvp of pixie and open source new relic and prior to this I was the co-founder and ceo of pixie labs and we were acquired at the end of end of last year.

F

Also I'll kind of go through the story of you know why we build pixie and what pixie does, and you know, we'd love to get your feedback and your thoughts and um as of about two weeks ago. uh Everything over here is is open source, so happy to you know, uh get feedback and also collaborate out on on building this up in the open.

F

So why are we building fixing now and what problems were we aiming to solve? um You know I'm not going to go through this entire, like marketing fetch right, but basically software is more decoupled.

F

uh You know, but we still spend a lot of time just like wrangling data and debugging, and securing and managing our apps uh for for contacts.

F

Most of the pxe team actually comes from a data, and um you know, data systems and machine learning, background and uh observability is actually a you know, relatively new area for us, so um please feel free to provide us with products feedback, but some of the challenges we we noticed is that you know when we're building production systems, you usually land a thing a lot of times, adding adding a bunch of instrumentation uh and a lot of this instrumentation is added up front.

F

Some of it is added, as the programs are built out and services are deployed. uh We wanted to find a way to do I'd say like 80 of this automatically quickly or uh in some way extendable after the fact- and we understand that you know in many cases it's actually useful to manually- add instrumentation, where you have a specific business logic into the capture.

F

um Another area where you know we really uh look at especially as we started to tap into new data sources is just like the sheer uh data volume. um That's generated especially captured metrics traces and watch, and then you trump this all over to some centralized backend, which is usually hosted by some provider um and then just like extensibility of the interfaces.

F

So one of the things we did with pixi is that we kind of moved to this model. Where similar, I think in many ways, similar to prometheus. Everything is hosted inside of inside of the cluster, um but one of the differences being that you know we, we focus a little bit more on the logs and traces, even though we do we do metrics and are less optimized for for some of the metric use cases.

F

But more specifically, you know we moved away from having like very rigid, predetermined collection to uh doing everything like code and then on on the fly, uh no instrumentation collection and then we'll see how this works uh in a couple of minutes. um Also, you know pixie moves away from a not only model to this patch plus cloud model.

F

um We, you know uh one of the things that I think someone earlier mentioned, bpf one of the things we heavily leverage is bpf inside of pixi to automatically go and capture data uh and also allow pixie to be extended. uh Part of the challenge over there is that you're basically flooded with so much data, it's very difficult to move it off the machine, so we have to build our compute platform to be able to handle all this data. While you know it's sitting on the actual actual machines first collected.

A

And then you know everything inside of pixie.

F

Is api driven and scriptable um using uh using this? You know python dialogue pandas if you're coming to the data systems world, it's very pretty common uh way to represent data programs and that's what we use in pixi.

F

So, just from a perspective, what is pixi right again, it's instant code, driven debugging uh part of our goal is like how much data could we get? You know, after installing pixy for for less than five minutes right. So uh that's not the experience that we optimize for you get pixie installed and then what is the instant visibility you can get and we've been primarily focused on uh apm, which is like application, performance monitoring and then debugging, and it's to provide this like baseline level of visibility and start to get into code level context.

F

um Although some of the people um in the you know the broader pc community, um like jana at aws, she's, been looking at doing some security stuff with with pixie, um because our scripting system is pretty extendable and I'll show that in a second um yeah. So the way pixie works at a very high level is that we have these things called pixie edge modules. It's a little hard to see over here. So let me just blow this up.

F

Cool, so we have these things called pixy edge modules which install themselves as gaming sets and kubernetes nodes, and then we have another uh level called wazir um which basically serves, as our you know, semi-centralized data system. So our data system is split between you, know, storage, on the edge and then storage inside of inside of the centralized collection system. And then we do have a cloud system to manage things like you know, authentication and our back and metadata, search and and stuff like that.

F

um All components over here are actually open source and you can launch your own private, public or private pc cloud and then connect multiple clusters to it and get administration across them um or you could use our like our essentially free, like community hosting which we have on uh um pixie labs and it basically just hosts the uh open source version of pixy for everyone to use um yeah. So there are two two main components to pixie.

F

As I mentioned, the cloud and the zero both of these are available, uh and then we can interface with the cli uh which is available on our github repo over here, uh the web ui or the apis, which are which are either located in this directory or a copy in this repo.

F

uh With that, you know, I'm just going to dive into a quick, quick demo.

F

So when you install pixi, you can install pixie with our cli uh once you install pixi, uh you can immediately you know within within a couple of minutes, uh depending on how long it takes uh kubernetes to deploy all the public services uh get visibility into um into your application. So let me actually start with the cluster level view.

F

uh So there are a bunch of services running on this cluster. Some of them have been making uh connection or uh requests after services. uh You can get. You know pretty deep inspection of what's going on right, so it says like how much traffic is going, how many errors all that stuff everything over here was done without adding any manual instrumentation to the code uh or relying on any instrumentation.

F

That's already in the code um so I'll dive into something quickly like plc, um which is where pixie cloud is hosted on on this account over here, um and you know you can actually go in and see like where the requests are being made so over here. You can see that there's a proxy service that talks to the api service, which you know actually talks to the authentication service, uh so we basically discover all the the graph and everything automatically.

F

uh If you want to dive into a specific service you can, you can click in uh you can see which pods hosted uh get all the high-level metrics. uh We can do things like this, where it says sample of all the slow requests. uh You can actually open this up and you can see that there's a graphql request that was made.

F

uh This was a query that was done. uh This is the response status and this is the response body. um It's been truncated uh because it's long, uh but we actually do like full body tracing of requests and uh the system works uh regardless of whether you're using ssl or not. uh So in most cases, even all of this traffic is actually ssl. uh So in most cases we can actually capture the ssl traffic um and be able to show you what it is um dubbing in.

F

You know just uh one of the things that we've been really into is getting code level contacts. So if you're trying to debug slow requests- and you click on the api server, um so you can see for the specific pod um how the specific plot is behaving.

F

um If you notice, like cpu usage spikes, you can narrow things down. um One of the things we all we do is run continuous profiling in pixi uh using vpf.

F

So if you ever run into a performance issue, you can go narrow down the time window and then see like you know where your code uh performance issues are happening.

F

um So part of our like long-term goal is to try to make fixie as like developer friendly as possible, which in our mind, means like trying to get more and more code level stuff into pixie uh and being able to make it easier. For you know, engineers who are actually trying to debug performance issues like oh, where in the code should I go and look um didn't want to dive into too many details of exactly how the flamethrower works, but basically wider bars mean you know, you're spending a lot of time over there.

F

So it's a good area to go, go debug, um just to add a little thing on here. You can actually see that our contact starts in the entire cluster. It goes to a specific pod container pad. uh We can actually profile across the entire cluster and tell you how you know. Different different pods have different performance profiles.

F

um If you that's helpful for debugging, because you know one of the things that happens is in the background.

F

Sorry, is there a.

C

C

Winter, can you mute yourself please.

B

Yeah zayn: do you want uh questions in line or do you want to finish like.

F

B

F

Them in any order, I was just gonna- show like two more things quickly. um So one of the things I wanted to quickly show is that you know everything inside of pixi is done using using scripts. So we have this like python dialect pixel script, so you can go ahead and like yeah, it's basically based on canvas. You can say here's a data frame and then you can do all sorts of stuff on it. um Our goal is to basically build up a bunch of scripts in the community right.

F

They already have a lot of them that you can run um it's kind of where our long-term goal is to like be able to build workflows on like specific things or specific types of projects like. Oh, I want to debug something that's low in costco like what are the best things to look at. What is the right keyboard flow?

F

So everything is done using this, like python python dialogue, how to compile and execute um it's pretty powerful like in the sense that we can actually go, and uh this one is super ugly, uh but here the code is actually from vpf trace.

F

um We want to trace the number of drop packets in the kernel, um so we can actually extend pixie by creating like temporary temporary tables, so this table will live for 10 minutes and trace for 10 minutes with all the tcp drops and so far.

F

um So, if you deploy this, um hopefully uh this will deploy in a few seconds. um We should be able to.

F

To actually get a graph of like all the tcd drops and requests that are being made across all the services and nodes um you.

D

Can go ahead and like aggregate.

F

It in like many different areas, in this case it's basically aggregated by source and destination endpoints. um So um I guess a step back. We have the notion of like pre-built probes that we already have like always instrumented, but you can also add, like temporary probes, that you want to instrument and be able to capture data for um some of the other functionality we have uh for things like pre-built stuff is.

F

Let's say you want to take a look at postgres data, so we support many many different database protocols, so you can actually uh see what um you know. uh The the exact postgresqueries are so actually here's a postgresquery that was executed um and then it was it got pursed. So um we have like this raw string of like here's, all the postgres events that are occurring. uh You can do similar things, for you know, mysql or or various other databases that we support.

F

So those are like the built-in features, uh but then we're extendable by being able to add either bp of choice, code or dynamic tracing to like specific needs in your program.

F

I know I went through a lot, um but you know feel free to ask me questions and then check it out and give us feedback, uh but I'll stop over there.

C

Yeah, that's amazing. It's it's really fascinating what what you can do there I I have one. I have two questions, but the one I'd really like to understand- and maybe you can help me in the sunshine- is that um if you compared with hubble like what are the differences or the overlaps or when would you say use the one or the other? Is it totally different like what what would you say in terms of comparing pixie with hubble.

F

Yeah, um let's see I'm trying to remember exactly which.

C

Yes, this is essentially the same um like high level pitch in the sense of you know, using dbf to for ppf to um provide that kind of insight.

F

Yeah I remember when hubble was coming out so hubble I think, came out like a little bit before we open source pixie. um I think one of the main, or rather a little bit before uh like right when we're launching pixie as a as a product. um So I think the main differences are that hubble is focused on generating metrics from the data.

F

If I understand correctly- and we help you do a lot more of the raw analysis um and I think in addition to the bpf stuff, pixi comes with an entire data system behind it. um So, like you know, like I said originally, we kind of come to this machine learning and data systems background.

F

So for us, like dpf, was like how do we get lots of data into pixie very quickly, so we could actually do more things with it, so it kind of came from that direction, whereas I think hubble's coming from the the other direction. So I definitely think, like you know, there's a future where they're both complementary right and can work together.

F

uh One of the things that we're working on with pixie is being able to export data and other formats like uh open telemetry, and all of that, so you can actually make it easily consumable by.

A

C

Cool well, thank you so much.

B

Yeah, thank you. I had a quick question if I could briefly um also sort of around exporting, is there any facility in pixie to kind of export, not a continuous stream of things but findings? So when you were talking about a systemic view across all pods of a cluster, for example, you know you might be doing a right sizing, experiment right where you're going from a staging cluster to a production cluster, and you want to put in place the right kind of quotas or resource limits based on actual performance.

B

Not you know, whatever defaults were either copy pasted or put in potentially um with what's running or or if you wanted to identify problematic areas and like generate some sort of actionable report that you could then give to a team and say, hey everything seems to be working across these two dozen services. But these two are having problems right or is there any kind of way to have a reporting feature or an export of findings?.

F

Yeah, so in some ways you know our standard answer to that would be. You should just write a script, that'll export the findings, and then you could stream the output of the script. um So we do have streaming support right. So if you aggregate the data and say you want to do this over a 30 minute window or something um we are adding in the ability to manage, like persistence, persistent scripts, so you can be like hey, run the script in the background, and then you know, obviously with flag errors.

F

The script stops for some reason, but run the script in the background- and you know if the script says every 30 minutes make a request somewhere to tell me yeah, give some update and then description on that they do that um or you could call the script to the api and just watch the results or call for the cli request. Results.

B

Yeah yeah, I was I was, I was kind of thinking of you know when, when running things at scale, you know across multiple regions and things like that. You oftentimes at least from folks. I've talked to you, have a fairly small team, doing a whole lot of things. So if there was some way to kind of say you know of all the things happening, we've predefined these sort of alerts or conditions. um You know here's things to look at or you know we want to.

B

We want to watch, for you know some condition over the next two or three days as we roll something out across regions. Yeah.

F

I think that absolutely makes sense. um We haven't focused on those use cases, but that's actually been brought out to us. Many many times from various various people, um including, like kelsey, was really trying to convince us to do resource optimization stuff, um like we ultimately focused on like doing performance monitoring, but I think he's actually gonna demo, some resource, optimization stuff in a couple of weeks using pixi, so uh we'll see we'll see how that works out, basically to watch resources and be able to adapt, adapt resources on.

C

Kubernetes we have another question arthur asked regarding um how sql analysis works with stuff that runs outside uh not in the cluster, and I think that's that's a really good question right.

F

Yep, so the short answer is that as long as one of the network endpoints is located in the cluster, uh you won't be able to tell the difference. So if you have a database running outside of the cluster, but the service is accessing, it is inside the cluster. It'll still work. Fine, we will switch the requester, we'll switch the tracing automatically from server side to client side, uh depending on which end is located within the cluster.

F

uh We obviously can't get you like the uh information about the database like, for example, we do like jvm, metrics and stuff. You won't be able to do that if, if it was running on a different node, um but over time uh we haven't really prioritized this, but this is highly requested is that we want to be able to run our pens, which is our gaming set on like an arbitrary linux vm and be able to phone home the data to as long as you have one kubernetes cluster.

F

You can have as many vms as you want. uh We do rely on kubernetes to host our data system and that's it's pretty hard for us to move away from. uh But as long as you have one kubernetes cluster, you can connect as many linux videos, as you want to thank.

A

E

Question coming from the prometheus world, um I usually care about.

A

Alerting a lot so I'm just curious how much focus there is in pixie on alerting, or is it just like really like a debug tool for the most part to troubleshoot once actually something happened.

F

Yeah, it's a little bit.

A

More of the latter.

F

I guess I should have been a little bit more clear about this up front, but pixie doesn't do long-term retention and it doesn't do things like alerting. Our goal is to be able to collect lots of data for a short period of time and then be able to help you work through it and process a large volume of data and go overhead. So you.

D

F

After the data gets through pixie, our goal will be to like export this like prometheus or loki or whatever, where they can get converted to metrics, and you could have like you know the months months of metrics.

F

E

That's probably the reason we don't really.

F

See ourselves as competitive right, because we we serve this like very narrow nature, like ebook tools,.

A

Yeah, no, that makes perfect sense.

B

Thanks, so are you saying that there's a plan- or there already is the ability to effectively treat pixie as an exporter of sorts from a prometheus perspective like you've got this sea of data from that you could derive interesting time series, either directly or by computing rates, live kind of thing and then providing a a scrapable endpoint. Is that what you meant by exporting to.

F

That's right, I.

B

Mean like a data transformation where you're writing. You know directly to a local instance and then exporting it or something like that.

F

um Our goal will be more the former which is like you, have a script, that's basically generating metrics from all this data and then that those metrics can then get exported to prometheus, either by us, exposing an endpoint or through some push gateway or something we haven't quite uh figured that out. Yet there is a prototypal way that basically use the prometheus push gateway, um but we're open to suggestions.

B

So I'd love to talk after that we've written a prometheus aggregating push gateway for metrics from our front end for the same thing where we needed to get some visibility um but yeah we had the problem way too much data, so we aggregated versus a standard, push gateway, but awesome.

E

But it sounds like more important questions, so you might want to speak with ivan from today's presentation, but.

F

E

Thank you for for explaining this. This is um quite epic. I wonder what it is involved to install um ebpf kind of instrumentation inside my cluster. What should I do? I have kubernetes cluster and you know what steps do I need to do to ensure that uh things are installed properly, and uh you know the the pc knows how my service is actually named and what service endpoints? I really care to.

E

You know record right, because maybe I want to care about a couple of apis.

F

Yeah, so right now we require everything um we plan to provide like more configuration around, don't record this or record more info about this stuff, uh but in the current state you know we connected the kubernetes api and discover all the services and pods and everything so it's all transparent, uh like it actually only takes about, like you know, two or three minutes install pixie and like those dash like the dashboards, I showed you were not like specifically configured. They just should automatically work on those clusters.

F

um We do have some challenges with certain things. Like you know, running on things like kind like kubernetes kind, could be very challenging for us, because you know kind actually runs on your local linux machine and they can.

F

It can be pretty challenging, but like typical, like kubernetes clusters, even on mini cube um or gke or aks, or whatever eks, we don't have any issues with sells, hosted.

F

We do need, um I believe. Currently we require uh linux kernel more than 4.12. um michelle's.

A

Our writers are 415.

F

I can't remember.

F

Yeah they're, all they're, all relatively old at this point, like 415, is still like several years old.

A

uh What's the what's the overhead, like, I instrument everything.

F

Yeah, so that that that question varies because we're continuously optimizing things. Our goal is to stay under five percent to get like full instrumentation.

A

F

Continuous profile error, for example, is about half a percent overhead. If you just use the continuous profiler uh and it's pretty constant, it'll use up half a percent of your cpu, regardless of what your server load is um and um typically for the network tracing. You know we see like somewhere between two to four percent.

F

On typical servers, if you have some like really high performance service, that's making like thousands of requests per second, then it will probably consume more.

A

So it's proportional also uh with the with the traffic more or less.

F

So the regular like the network, tracing and the database stuff is proportional with the traffic. The continuous profiler is independent of the traffic or independent upload, because we basically do stack sampling. uh So it's like a constant overhead, even if the server revivals.

B

The data model underneath pixies, like where you're, storing your both in node, local storage but also elsewhere, is that one security domain, if you will like, is it sort of you have access to all the things or not. You know with uvpf. Obviously you can get inside ssl tunnels, you can access all kinds of things and when combined with a script like you know, this is basically uh god mode, plus visibility.

F

Yeah, that's what we call it. We got like that, god mode visibility for clusters. um Our our goal is to actually sorry matt. I I think I spoke over here. We were saying it was just plans for adding more security.

B

Well, no, I was curious if currently there's an arbac model or if there are any ways to compar compartmentalize the data either to have some of it be uh yeah. Is it just like you have access to all or nothing or is there a security framework or model to allow for protecting some of the data that may be sensitive.

F

uh Right now it's it's uh sort of all or nothing. We do want to move to an our back model where we can even restrict people from accessing certain tables or even like certain fields in certain tables, but we are not there yet. So, for example, we could prevent users from accessing encrypted traffic, uh but we're not we're not uh we're not there yet, uh but that's something that is on our roadmap and plans to build out, because it's obviously very important, as we scale out.

B

Sure- and um I know we're getting short on time- this has been fascinating. I could watch this all day, but uh how should people uh get involved if they're interested in this say they said they want to play with it or they have ideas or they want to make it better or they want to.

F

B

F

I'd say, like kind of like three things right, one is like someone pointed out: there's a slack community, so please uh join the slack community. um I didn't you know they'll, be super super helpful for us to get feedback over there and it's an area where you can have a lot more discussion with somebody on our team.

F

uh The other thing is to um the other thing is to you know: file github issues, we're not very actively um asking for contributions yet, but there's some area that you're really interested in working on like file an issue and we'll figure out a figure out of the process. For that uh we will be opening up for much more active contributions in a couple of months. It's just that we're trying to get everything organized and figured out, because you know we started out as a sas product and we're trying to make everything open source.

F

So there's some some work over.

A

There, to be honest, that's commendable by the way.

F

um And also um the other thing I want to point out is, you know, feel free to use the community version of pixi. uh There is no like there is no plans for us to, like you know, add like pixie, upsell features or anything like that.

F

um I think the only thing that we're gonna have in the community version over time is like you can you know you might want to export your data to some other service or whatever, but we don't plan to uh to do any like pixy upstall, we plan to build all of our stuff in the open source.

E

Is there any road map any plans? Any major features that you are looking for, maybe for someone to contribute or something like that.

F

Yeah, so I think there are a couple of areas that were very actively. uh You know in the short term could actually use coupons, so one of them is we're working on a grafana plug-in for pixie um right to be able to get. You know, part of our thing is we want to be able to work with all the other tools in the ecosystem. um Like our goal is not to build like a you know, amazing, ui or support.

F

We have a ui for debugging, but we want people who will use for fun and other popular tools, we're probably actually going to flip the switch on the karfana repo, like today, so very happy to get contributions and help on that. um The other thing. Obviously, uh I think that we mentioned the prometheus side. uh We'd love to get some help on figuring out what the right way to do. That is.

E

Amazing yeah. Thank you thanks, michael.

A

E

C

Amazing, thank you. No worries.

C

B

Any other questions before we break we're almost at time all right. Well, thank you so much uh to everyone who had questions and discussion and for the presentations themselves. This is really exciting stuff to see, um and I will see you all online and next week.

A

Thank you thanks. Everybody, bye-bye thank.

E

You bye bye, thank.

E

E