GitLab Monitor Stage, 14 Oct 2019

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Monitoring Future Vision Workflows Discussion

Description

Conversation with Sid - CEO at GitLab, Sarah - Product Manager, Monitor at GitLab and Kenny - Director of Product, Ops at GitLab.

https://about.gitlab.com/direction/monitor/#future-vision

A

Hi I'm SID, I'm, CEO and co-founder at get lab, and this is about our monitoring, vision and I'm. Here with Sarah and Kenny Sarah. You want to introduce yourself hi.

B

My name is Sarah Waldner and I'm, a product manager within the monitor stage. I get lab Kenny over to you, yep.

C

Kenny Johnson I am the director of product management for our ops section, which includes the configure and monitor stages, cool.

A

Hey I think I've been noticing, like we have a great vision and we've got some great worksheets, where we talk about how to triage how to resolve how to improve monitoring further, but we get some things that are still very minimal and get lapped. Today, for example, the logs are not aggregated logs all live in the individual containers, which is kind of a bad practice.

A

Like you shouldn't issue, you shouldn't do that and logging is such an essential element of monitoring so, and that was one of the questions like what's our plan for that when, when are we gonna get the logs and what's the minimum step to get there? Yeah.

C

So and I'll just say: there's two halves of our monitor group. One is the health side which Sarah's the park manager on and the other one is the ATM side which I have a new product manager named Doug, who came to us from elastic who was not able to attend. We have on the schedule for 12.5 the installation of elastic to your kubernetes cluster so that we can start aggregating the logs and he also has an issue this release that at a minimum lets.

C

You select the pod, because today, when you go to the environment and if there are multiple pods it just so, what's the first one automatically it kind of assumes that there's only one pod for that environment, so good.

A

Lamp and all those skills to one exactly which.

C

Works well for the auto dev ops use case but yeah. What the plant is to do both of those simultaneously so we'll start preparing for when we have more aggregation. But even today.

B

C

Can start giving you a more robust view of your logging and that's five I.

A

Think those are exactly the right things like make. The existing functionality, although very very minimal, make it at least work and then also work on elasticsearch and I want to check an assumption. The assumption is like when you stole elasticsearch gitlab will automatically start sending all your logs there, so whatever it is, log stash or something else felt beat on that cool. That's great and Sarah I'm. Sorry, if there's not super relevant for you, because I think you've you've been doing great work on that on the health side and and I.

A

Like a lot of that, there's a lot of progress. There's I'm not worried the only only question there is and I think I asked it before, but why not? Why not show that it's it's important is how how we're doing on auto remediation is that on track to launch this year an auto remediation is hey, I got a dependency de, a vulnerability was found in a dependency I use and it gets updated without me doing anything and.

C

That is not actually in Sarah's. Well, that's in.

A

C

A

C

Is on schedule, I was talking to David a director of products for the secure stage, I think it's scheduled for twelve six, the last release of the year, which would be in December cool.

A

That's cool and.

C

A

Something that that is cross, that is across the company, that there's multiple departments involved.

C

Yeah correct in it I guess in the sense that it doesn't create an incident. It just creates in a merge request, and so Sarah's world in health is really about incident and that kind of firefighting response there's also a bunch of work that Sarah recently did and did a opportunity canvass review around error tracking, which is important not just for production applications, but developers.

C

You need to debug their code and development environments can have the errors sent to a service like century and so Sarah's working on improving that, including adding the ability to install sentry as I manage staff on your cluster cool.

A

Sarah you want to elaborate on that. Yes,.

B

So, and in the face of the into both triaging from the developer perspective and then leveraging a lot of the incident management workflows that we built this prior quarter so that we can easily get up and running as Kenny said so we're creating an app that makes it really easy for people to install it with their instances of get web, so be a minimal configuration.

B

And then we are further embedding it in the Caleb development workflows, so correlation between century Aries and releases century Aries and merge requests servicing all of this intelligence that we get from our open source partner and different places within get web ultimate goal. Being how much information can we provide someone within get labs so that they don't ever need to go to century without actually rebuilding all of the workflows and functionality?

B

That century provides so.

A

That is sociology that I'm I'm loving that I think it's so great that we're first like using sentry to the maximum. So we can kind of get value to our customers as soon as possible, and then don't forget about the long-term goal of having everything in a single interface. But I agree that, even if it's in the gate lab interface, we can still like, for example, aggregating errors and making sure that errors that are about the same thing get combined together.

A

That's such a hard thing that takes years of experience and sentry is doing such a good job at doing that they're so good about their open source values like everything on century comm, ships and their open source client. So I love that you're, combining that and I think viewing century as a piece of infrastructure like if you Prometheus, for example, is, is exactly the right path.

A

I'm I'm gonna keep trying to have a question. That's at least relevant to you I'm, not sure this is one but I'm interested in it as I'll ask anyway, when you look at the health of things, it's really important to get kind of an overview of the entire cluster and I think there's a nice cluster view in data dog rat hat is making progress Jiali that has a nice cluster view.

A

I'm, not sure that honeycomb has a cluster view, but they're dipping all kinds of other awesome things. Is that something you're looking at like hey there's an incident can I quickly get an overview of all my clusters and where the problems aren't.

B

As part of the future vision that Kenny, dove and I are collaborating on, so my vision of what that would look like is holistic infrastructure systems and where they are running with different visual indications that allow you to easily drill in on that part of the service. And so, as you pick a system or a service, and your view expands to include just that section of your application. You're provided with the metrics, the aggregated logs, the stock traces, the synthetics, the impact on your user.

B

Additionally, business metrics that you've said within gitlab, where you are or where you have exceeded your service level, objective thresholds and so Splunk. Does this a little bit where they had the bigger view and then the really easy way to drill in and then link to other services with and get? Let give you more insight. So, yes, on the roadmap. No, we do not have marks right now, but that's the ultimate goal of from a health perspective where I would like to be cool.

A

Sounds great said.

C

I think that's also an interesting dialogue about, if you don't mind about the persona we're serving, because we are seeing a trend where centralized teams are managing infrastructure as a platform for the individual development teams, and so in some cases there is just a unique I'm I'm.

C

The person who's in charge of providing a platform for the developers within my organization and I wants to get a view of the health of that independent of that applications or services running on it, and so we're that is happening more in our configure stage, where you can attach a group level, cluster or an instance level cluster today and what we're building out the is in that world, which I was slightly different than Sara's, which is more about a triage operation.

C

Where you know something went wrong with an application and one of us, one of the places you might investigate is the system or cluster metrics cool.

A

Yeah, that makes sense. Are we getting demand from that persona? Like? Are these people I assume those people are not using his lab a lot yeah. So we're not getting a lot of traction day or are we seeing people asking for those features? I.

C

Think we see it a little bit in well, so, first of all from your dog treating it in that word, we're starting to build a project within get lab called get lab services that has a number of different internal, smaller applications running on it and so we're starting to see more usage there and we're seeing customers ask for it. Maybe in the roundabout way, because they're doing infrastructure is code where they're managing their that platform.

C

As a specific you get lab project and then using terraform to deploy and kind of like monitor the health of it. So today, to the extent that people aren't doing that directly with kubernetes we're seeing it as infrastructure as code we'd like to continue to support teams who are doing it with kubernetes, and you have a dog for use case today, cool.

A

That's great I love to talk footing with our own infrastructure teams.

A

Fionna problem, it's really important to go from your going from alert to paly metrics that cost the alert, probably two logs to look at like what specifically is going wrong and like a way to do. That is select time in that incident and then view at the relevant logs I. Think, for example, data doc does a great job of showing that when are, we gonna go there with get lab and if so, what's what's the route to that?

A

What's a rough timeline, what are the steps we need to do because it's super hard to combine like different things like logs from permeat, the logs from elasticsearch and the metrics from primitives in a single view? How are you thinking about that.

C

As Sarah comes from a world where she's intimately familiar with that process, I guess from a house, so yes, we definitely want to do that. That is a common core part of the workflows, I think we're describing in the triage process.

C

Thinking through how we get there first is getting more logs and a view of logs and abilities like we have aggregated logs and then you can select logs wide times live, and you can add to that.

A

No Yi surf a time slice if you don't know aggregating the logs, so great.

C

But then I think you know we we stand on a charting tool called each arts and I'm not sure how supportive that would be to that kind of interaction. So that might force us to rethink that choice. Is.

A

There any charting tool that can do that I, don't think like something like d3 has support, I, think I think you have to make it ourselves anyway. I, don't think we need to leave a chart. I just think we need to build on top of it.

C

Yeah I guess what I was saying might require significant investment in each arts. I'm.

A

C

Sure to make that a reality, I honestly I know you can we do some slider selection today in zoom in and zoom out? So maybe there are some primitives there, any charts that would enable you to do the the time slice selection. You.

A

Can already select a time slice, so the only thing is that you have another widget on a paper only finish it. Sorry, but you'd have another widget on that page. That also gets that information. Instead of showing a detailed view, it will get some logs from that time period and show those yeah yeah.

C

The one thing that oh, the visual interaction that I really like, is that I'm looking at a chart over a significant period of time and I can time slice without zooming. Does that make sense like if it I don't want it to just be, as you zoom in you get no.

A

No, no, no, no totally, not no, it's like kind of you're, selecting two time periods and then it should automatically update already and I. Don't want to I, don't want it just a metric view. Yeah and.

C

So some significant effort that its first gonna be about logging, I, will say one of the things that Sarah and of have been pushing is today. We think about these things in Category terms, typically and we've kind of assigned the group's category terms, but Sarah just had an mr2.

C

Also have each group be assigned some of these workflows so that we don't lose sight of like Oh dopes working on logging, but Sarah really needs that log aggregation time slicing for triage, that we don't lose sight of the interconnection, because the categories are kind of today in the monitor world are just the legacy effective. The different companies used to have point solutions here, but it's becoming more about this combined workflow yeah.

A

For sure yeah we're we're monitoring. It started a bit earlier like we're single application for the DevOps workflow, but it's clearly monitoring that there's constantly so consolidation started earlier and you see all the great companies like data dark, like honeycomb, even even I, think Splunk is adding some metrics to to to their offerings or they acquired a metrics company. So I think I think the consolidation is like in full swing there already yeah. It's one part signal effects, that's it! Thank you. Great acquisition, great company.

A

Cool, that's everything I wanted to talk about. Are there? Is there stuff you want to talk about? I want to show off for Sara. You.

B

We could drive into a handful of designs that we have for era, trekking I'll.

A

C

B

You want to do that yeah! You may want a moment okay, so while I search for these issues, I'll just give another recap of the plan. We're gonna start by making the view of century errors in gitlab, far more robust and then adding detailed air reviews, as you want to drill in and then easy connection to creating issues which will allow us to leverage incident management workflows, we just spent the last quarter. Building the next milestone following that will be all about correlation. How can we tie errors tighter into the gitlab workflow, any one?

B

Second, to find? Okay.

B

Okay, so what we're looking at is our first iteration on the area list within gitlab. You have the ability to filter, by search by click out view it in sentry, and then that connects you to handful of different actions that you can take on errors within century. So again, our goal is provide enough information so that nobody needs to leave the tool, give them enough pertinent details, so they can decide if they need to ignore triage or create issues. Beyond this cool.

A

Ignores the red is the orange thing yep and then triage is creating an order. The green thing is creating an issue yep.

B

A

Gets me to that thing in century or I could just get me some more detail.

B

Get you to more details, nice. Let me find the we were playing around with a modal view of that that doesn't easily support a stack trace. It give me one moment to find the next iteration.

A

So I just the same thing as if I would click the title right. Yes,.

B

And we may remove that as that's duplicate, 'iv and hovering over an information icon could suggest something else. So.

A

It's that's hard. Sometimes you need two things for clarity. Sometimes not I, don't know what to do there and we've.

B

Got our product designers providing expertise on that, so detailed. Our review within gate lab we're gonna. Do this via the API? We're not gonna build out errors as first-class objects. That's a lot larger of a yeah.

A

Well, we'll keep the single source of truth in century, and just query me that I love that otherwise you get it gets super messy.

B

And then this is this is a future vision. We've not yet looked at implementation of embedding the stack trace by that's something that we would want. That's a critical part of the triage and resolution process, I.

A

Agree with our the stack trace: it's not yeah, that's an essential part cool. So.

B

That's opening.

A

B

In the next milestone, that's.

A

12 to 5 already yep.

B

Nice, so we're excited for that yeah and Beyond arrow tracking for the health team will be working on a synthetic monitoring MVC at the beginning of next year. Cool.

A

And that's: that's basically do a ping to a service yep.

B

And we may choose to do it ourselves or base it on a tool such as site, speed but yeah.

A

B

A

Speed is amazing, and were you using that in debrief your apps already so thumbs up for something like that.

B

Awesome and that'll be, are you alive or you not alive? Are we getting a heartbeat or not yeah.

A

It's and on which side speed you can dial it up to like. Is it loading fast and all those other things? Last night, I.

B

Said Kenny did you have anything around the APM, so we wanted to dive into no.

C

We talked about logging in Anna, that's been a big part of those focus is improving the logging over the next couple of releases.

A

Cool, yes, what's what's taking us so long to improve the logging Kenny, not not not about people but like? Is there an organizational thing? Is it were there too few engineers, so what's their missing direction? It just we focused on metrics and it was so much work to get the a charge and get up, and we did that part.

C

Of it is that I would also remind you that the monitoring team is responsible for the self monitoring too, and so we've spent a lot of time as part of the self manage scalability working group, improving our access to metrics, or they get map instance itself, but I think, if I'm being honest part of it, is also direction. We spent some time saying like. Is it a last to search or is it some other tool?

C

Should we wait for a service like I, think it's Keowee, although I'm drawing a blank right now, like the other log related tool, and so some of that was frankly having douve gen, who has a clear vision about what we want to do comes from elastic good. We broke through that a lot quicker cool.

A

That makes sense, yeah I've looked around to see alternatives because elastic is like requires so much compute. If you do that in excellent byte, but it doesn't seem like there's anything else and I think you alternate if we might have considered was from the Prometheus people like indexing. That's right.

C

B

C

Thank you and Isaac yeah, yeah and yeah. There was an. We did a discovery issue a couple releases ago and I think just didn't drive it drive to a let's do something immediately that decision, cool.

A

Well, I'm glad! That's! That's changing now awesome! Well, thanks for this overview, this was a lot of fun yeah. Thank you. Oh good. One.