GitLab Analyze:Product Analytics Team Meetings, 28 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Discussion about product analytics export functionality

Description

Max Woolf, Dennis Tang, and Sam Kerr discuss the problem, proposal, and requirements around exporting user data for Product Analytics.

A

Foreign perfect, so we're here trying today about some questions about the requirements around exporting data from product analytics. uh We're currently scoping that to two different issues, but we're talking about the one about exporting to Json files for this discussion.

B

That's right, um so the the overall aim is that data in gitlab's new product Analytics um capability should be portable, exportable movable to other tools, um which is the general Gest of what it is we're trying to aim for is. Would you agree with that? Sam.

A

Maybe a little bit softer on that last part about integrating with other tools. That's a value, add feature but and it's something we want to do. But it's not a critical requirement. Really the the base level requirement is. We need to make sure that customers and users don't feel like they're locked into gitlab because we're holding their data and won't give it to them.

A

um We would love to be able to give it to them in a format where you can copy and paste it to other tools, but if it's in some other format that's acceptable as well.

B

Okay, so that's that's a useful. Why um we, you know so the the problem that I've come up with here is in the issue itself, we're discussing implementing an API that connects to a cube and then exports all the user data, um which sounds right. However, Cube doesn't really allow you to do that Cube, my understanding of it at least it takes data and produces information based on dimensions and measures.

B

So the idea that we could use Cube to Output a list of raw Json objects- I, don't think, is possible, and even if it were it's not really, what cube is designed for so I'm, not sure how scalable it would be, um which leads me to two questions. I guess one is: is the outcome here that we just want access to the raw data?

B

We want to be able to take the raw data, that's in clerk house and just provide that as a Json file to the user, um or do we want to produce a sort of something the cube can give us. That's probably going to be aggregated by some sort of time and then provide that um because that's essentially on cubeless anyway, as it just produces Json, that you can then turn into graphs.

B

If you want to so I yeah I guess, I want to straighten out whether the outcome here is that we want access to click out data that hasn't been um edited or um that's what I'm looking for hasn't been analyzed in any way whatsoever, in which case right now we don't have that direct connection from gitlab to click house and Cube handles all of that security stuff, the stops the data leaking between projects, and so that's just not something. We've really considered right now, um so yeah, that's where we are, but.

A

Yeah so I think there's two parts to that question. You asked one about the raw data. That's definitely where we want to be with the intention being that you could take all your data from gitlab in a lossless format, and you could do any sort of analysis or whatever you want to do with it. After it's out of our system.

A

um The second part about Json, specifically I'm a little bit more flexible on we've, picked that from an earlier discussion, because we thought it would be simpler to do versus some other format, but uh Json itself is not a hard requirement. If there's something easier to use. That's fine!

A

um Sorry, the reason I say that is I saw Rob's comments about Cube having a backup option which it looks like it just exports. A zip file. Zip file would be totally fine as well. If that's feasible, to go that route as well. I.

C

Don't think format's the main concern here, though, it's more about how we're exporting the data or where from right, so it's more about. Are we based on and I agree with Max's assessment cube is not the right tool, because I think one thing that we could have clarified is that exporting is just that.

C

The raw data access as opposed to you're, reading, a dashboard from like rendered from Cube or the data, is aggregated from Cube and you're exporting that so, assuming that we're going with raw data, then I think the question becomes: do we misappropriate Cube or do we have to establish some type of connection to clickhouse which, if we are now trying to set up a direct line between gitlab and clickhouse, that definitely will change?

C

From my perspective, our production Readiness conversations as well as I think there might be some prerequisite steps. We would need to require for us to feel comfortable with that, since Max is mentioning that you know, Cube has been handling all these security contacts and stuff like that. I've really just last minute drops the link of like CH proxy, which handles like authentication rate, limiting requests, and things like that. So at some point, I guess this is turning into my perspective on it uh and I apologize for that. I should have just started with that perspectives.

A

C

Right well so what my my gut feeling tells me that this is not easily achievable with the MVC right now, because it requires something like CH proxy I. Think direct access to click house from gitlab will be problematic for multiple reasons, including the infrastructure conversations.

C

um But then, even if we Implement CH proxy, which is just like another set of like security and tools and I do like that, you can map users to like other anyways I. Think we need a layer in front of it before we expose clickhouse is what I'm saying, because it seems like that's what we would need for for this exporting job that doesn't even get into the further details of like do we? Where do we store these artifacts?

C

um You know eventually we're talking about gigabytes, if not terabytes, of data. How does that work? um So my.

B

Yeah, my yeah you're, basically exactly right so at the moment, clickhouse is not accessible to the internet essentially, and that buys us the security we need in a sense of you can't access it. You can't access, it is via Cube and Cube does all the stuff that says where you can access this road, but you can access this table based on what killer Beauty you are, which is great and that's exactly what we need yeah.

B

The bigger problem we're going to have is if we need to export click, enhance data in some sort of automated fashion. We'll benefit. Do that um and we'll need to do exactly the same thing that we're doing with Cube. So in short, I agree with Dennis in the sense of anything's possible. We can do anything we want whether or not we can do that as part of an MVC uh is very much Up. For Debate I would say, I mean it comes down to priorities.

A

Well, so let me ask another question about the the zip file, because my the thing that piked my interest there was not the change in format versus Json, but the fact that now we're going to be managing a single binary blob that might open up some simpler options for us because I agree. Exposing click house directly to the internet is not the direction. I think we should go in for Myriad of concerns.

A

uh If we have a backup zip file, though, would that open up the option for us to store that file somewhere in, say an S3 bucket? Send the user an email to here's, where your backup is go download it because.

B

I believe we have.

A

A similar Paradigm in other parts of the product where, if we have these long-running jobs, and especially with large data, we don't do a synchronous, sort of interaction but say we'll, send you an email when it's ready with where you can retrieve it with.

C

A

Simpler or what do you think of that.

B

I mean it's it's a solution, but it doesn't solve the problem that we have so it it reduces the the problem further down the line where we're going to be potentially sending gigabytes or terabytes of data. If we're sending that, essentially between one data center to another data center, then there's a cost implication in terms of bandwidth there, especially for very, very large uh files. But if we ignore all that, yes, that can totally be done. That's probably.

A

I think we have that problem regardless, because if the customer is exporting data and they have a large database, we're going to be sending large amounts of data with whatever solution we we try right can.

C

We verify something real, quick uh Sam. You mentioned a backup solution. Free Cube did you mean clickhouse that Rob.

A

Person yeah excuse me yeah all.

B

Right that helps the problem that will yes, we can totally do that from a technical point of view. um If we put cost and bandwidth to decide. Yes, that's doable, but the concern, then, is that we need to write the the logic and the code that says well when I backup, Project X, then that needs to that needs to export the particular Clinic house database.

B

Then we have to make sure that this user is authenticated to do so, and then we need to figure out where in storage that's going to be so for self-managed customers that might be so setup storage.com. It could be somewhere and we'd need to decide whether that's all going to go. We.

C

Have an existing configuration options, I believe for this right now, right with regards to like project exports and those artifacts so like that will write to your server if it resides on that in a writable directory or cloud storage depending if you have that defined so I guess, for the purposes of this MVC and ignoring the extreme edge cases of like you, know, terabytes of data and like not having enough space for that. Do you think that's a suitable like MVC start.

B

Yeah I think that's totally doable. um I don't have any problem with that at all. My my bigger concern is more about how we actually access clickhouse from GitHub. So how can I from gitlab say export all the data that this project has from.

C

The sounds of it yeah from and yeah, so I think this is really what it's getting down to, what I what I perceive coming is that I, don't think, there's an immediate MVC like in this milestone for for exporting.

A

Oh, no, no I, think.

C

What would have to happen is likely sounds like we need, instead of doing this backup option that Rod has posted I think does make sense, but that would at least require some type of minimal backup service to interface with clickhouse or CH proxy, which I think would be the preferred option. So then, there's like a couple dependencies here right. We have ch proxy. We have a backup service to actually execute these backup arguments and then say: hey once you've done.

C

This send this to next location, defined by your good web instance, and then three we have the option to actually hit that button, just to send the export up the the backup job I.

B

Mean we do have the one thing that we've got going for us here um is that clickhouse has native integration with S3, specifically, um which probably reduces a little bit of complexity um instead of proxy will allow if we set that up.

B

The X proxy will allow us to say, okay, cool, we'll, take this project uh and send it to our story, and we can use the pre-defined three credentials if it's been set up to do so, and that can be downloaded um that bit I'm not worried about uh I, think that seems totally fine and it will take as long as it takes in terms of size and bandwidth that yeah the bigger concern is those dependencies are on CH proxy.

B

How we interface with that safely, um because my biggest concern is that we end up introducing some sort of vulnerability. That means that someone can export the data. Sorry excuse me from one project and then access it and then store it in S3, and then how do we make sure.

C

We can download it yeah there, it so I as I understand as chpxc allows you to Define like arbitrary users to then interface with uh clusters or databases that actually have like users authenticated to them. So I wonder if there's like some type of way, we can set that up dynamically so that the backup job only has access for that when it's initiated and that's it or something like that. um But that's a very investigatory uh for both the chbox for chboxy, specifically.

A

And on the note of this being investigatory that this question seems to have popped up relatively recently and we're we're talking about Solutions now, are there any other open questions we need to explore to find the best solution here are any threads that we want to pull on before we go down either they're out of CH proxy backup solution or something else.

C

I think I guess in terms of scoping and I, don't know how this would work in.com or we need something we need to clarify because I don't know if all of our object storage is now in GC at cloud storage or if it's still on AWS I.

A

Would assume we're all the way on Google at this point.

C

Right, so the question then becomes for this MVC, given that click house has native AWS integration is, is our first MVC integrating with something that.com won't support off the first iteration with AWS, or should we be looking at a GCS integration, Google Cloud, Storage.

C

That's I think that's the last question we need in terms of like requirements, because something similar I think we were thinking about pulling on a previous group with a compliance, for example, when we wanted to like push events somewhere. We just said okay for our first iteration we're going to support only AWS right so which.

A

So we would rather.

C

Go low effort or least amount of effort with AWS, but then yeah it's effectively a self-managed solution do.

A

They don't have a, they don't have a Google integration. It's just an.

B

Amazon one right now, just looking now, but judging by the backup documentation, it only supports AWS, S3 um yeah, that's I'm, pretty sure you come back up to a local disk, so you can generate the file. You've got to find somewhere locally to store that file, and then you could, if you wanted to send that to GCS. But again that's another dependency. We'd have to yeah.

C

Because that's a stack dependency then, because I have to attach another storage volume to the clickhouse instance. To then save to that which has to then be big enough to transfer to GCS right yeah, which is fine, and this is that's the question Mike. So should we go away from AWS? These are the these are what we have to figure. This is what we have to figure out.

B

Am I understand your podcast as well? Is that it's quite well tuned? It's quite well integrated with AWS in general, because you can import data sets directly from S3 using native SQL commands, which you can't do with gcfs, which is awesome, but very uh vendor specific yeah.

C

It's my my take is that again, this makes it complex more further complicated. But ideally you don't want this to work on.com. So therefore, we need to look for it or GCS oriented with a solution and.

B

There could be a plug-in, photographic videos. This is this seems like a problem that probably we're not the first people that have no so that may be worth looking into, um but I'm all I'm doing is looking at the click House official documentation, but it's fairly extensible. So it's possible, but there's a way to do this elsewhere.

A

Well, and do we have a product setting in gitlab already I think you mentioned the server files location? Would it be simpler to use that whatever's stored there.

C

Sorry go for it. Max.

B

I was going to say for self-managed no um because well, it depends on the size of the project, but, like I've, run into the problem before administering a self-managed gitlab instances that you've got gitlab backups uh and if you're, storing them on the same servers that you're running gitlab. Eventually they just fill up- and you know if your instance is 10, 50 100, a terabyte that can cause trouble um and you can end up losing those files, essentially so yeah, but I'd, be it feels like we should be saying to people using this.

B

If you're going to back up from product analytics, don't store it locally um over a certain size, for example,.

C

We could yeah especially separate editor, so.

A

Yeah go ahead. Go ahead! Sorry, so.

C

Your question Sam was like: can we use that location to simplify things that that doesn't really work, because that's the final location for where the user can retrieve it, but we're talking about like the service in between where we like? We don't want to expose clickhouse to as much like. We want to limit its access to anything right to as much as possible. So we don't want that writing to the gitlab server.

C

We still need a place where the click house can like export, that data to before we copy it to that final location or am I on.

A

Okay, so we don't want the two directly talking to each other and need this sort of intermediate right.

C

Because if you think in terms of gitlab.com, let's say we have like hundreds of clusters out there and like there's a hundreds of other potential connections of like writing to data yeah, okay, yeah.

A

C

That's why we need that backup service to say, like okay, you, this cluster export this for this user, when it's ready, I will transport that to the specific location where which then would be that configuration that cloud storage, at least for.com, um okay,.

A

Yeah yeah I know that that makes sense thanks for thanks for explaining that yeah um something else that comes to mind on this is. We will need to put some sort of rate limit or abuse protection in here, because if you click export, 5 000 times a second with a script or something that could just flood us completely yeah.

C

So that's that will be implemented throughout the CH proxy level, as well as the backup service, since we should have controls already in the rails for real living, so yeah, okay, um I guess. The summary, then, is that this is obviously a lot more complicated than we thought and I. Think that is immediately actionable, uh I, guess potentially, then it means I, don't know for for Max. Perhaps this might be best to look into either session aggregation or funnels um yeah.

B

For now, that's fine.

C

Until we investigate further, like CH proxy but I, I'm curious for y'all thoughts on this.

B

Yeah probably probably agree, I mean I'm happy to look into this stuff, but I think by the sounds of it. The initial problem isn't infrastructure on, rather than a get that movie code problem um in terms of other things, I'm looking at the aggregation thing, as well as throwing up some potential issues as well, um but that's a discussion for for later I think for another.

C

Day this week,.

A

um Yeah it sounds, it sounds like this is a lot more complicated than urging speculative. So how about we do this for next steps, because it almost sounds like we need to have this issue sit in just planning breakdown for a while to get an implementation plan against uh potentially a spike issue.

A

If it's that big, um we also need some requirements updates, as well as some implementation notes that I'll need URLs help on to kind of capture what we discussed today, yeah some of some of the things that I think will need to be done in the requirements. Updates and I can do these. Let me know what y'all think is to remove some of the specificity around Json files or to re-clarify Json as just a placeholder for something simple change it if it's easier to do it in a zip file.

A

Tar file, whatever um also talking a little bit more about the security requirements around customer, a can't, Export customer B can't see other customers that sort of thing, as well as the abuse and inappropriate use protections and rate limits.

A

um Okay, what do you all think about that from a requirements perspective? I? Don't want to talk about implementation, details and requirements just to make sure we have freedom to go around, do whatever we need to in terms of solving that problem.

C

A

C

Sounds good to make sense. Those are all topics we need to nail out and ultimately, I think out of this call there's several Spike issues. We need to do to go further, but I think yeah. We just remove about format or we can just say like it's just some type of archive. For now um it.

B

Seems it's going to be backup.

C

Options yeah be.

B

Actually, simpler, yeah we're using whatever gitlab, whatever click houses. Backup solution is I think by default it's a zip file anyway um yeah. So that kind of answers that question for us.

C

A

C

Things well to figure out with the abuse, security and all that so cool.

A

What else do we need to talk about on this one then I.

C

Guess sorry playing Devil's Advocate is it? Is there any value of no I was thinking like? Is there any value in.

C

Basically, allowing the user to to export anything that the view from the dashboard, but this whole point is about unfiltered data, so.

B

To be fair, that's kind of where I was going. That's what was my next thought of okay? Well in the MVC, if we kind of explore all data investment support ability, can we export the data from an individual yeah like.

C

A

There is it's a different pain point right, yeah.

C

A

To export the dashboard, because I'm, an analyst I built this really cool thing that shows my boss. Here's what we need to do here. You go! Here's the report, great I, get a promotion. This is really Target more at the pain point of I need to change, vendors or I'm. Moving to a different system, I need to export all my data wholesale.

C

Yeah and also it gets into other discussions about well, how does that format look like when you're exporting an entire dashboard and the reason why I was hung up in the beginning about explaining it was because maybe it makes more sense when we have something like the query designer fully baked in there, where it's just okay, you query, you have a table of data export that specifically because I don't want to get into okay, you get a zip file of like six Json files for the six visuals that widgets you have on your dashboard right.

C

So maybe that's something that we can provide sooner, but that still, that still has a dependency on the query designer um yeah. But that's where I was kind of thinking with that in terms of just providing value quicker, but yeah yeah. That's.

A

C

um No I mean there's.

A

Definitely value in that use case. It's just a different one than we need to solve for for this right.

C

Yeah, ultimately, we'll have both that's why it was good to clarify this whole exporting thing yeah in.

B

Which case I could probably do with some input from either or both of you about where, because I was probably going to spend the majority of this Milestone looking at this until I, dug into it and realized that maybe nothing happening. So uh how to think about where my effort will be best concentrated. This milestone um have a look at pre-aggregation funnels.

C

Or anything else I think sessions, ultimately just off the top of my head, because if we're basically like the direction we wanted to get towards is like shipping, dashboards and I think sessions plays a big part of the audience dashboard. Okay. So um if there are complications with that, though I'm happy to set up like another call to Workshop that yeah.

B

C

I do that: okay.

B

C

A

Right I'd agree with the sessions Direction, that's uh one of the key key things. People are going to be looking for once that's available.

B

C

B

Yeah well, in that case, do you, if you set up a call with you and me tomorrow, Dennis like tomorrow afternoon, my time uh we can go over that and sure.

C

um Would you like to join as well or I can put it on yeah.

A

Put me as optional I'll attend if I can, but I've got a few calls already so I don't want to block you all for making progress if I can't attend. Okay,.

C

Okay um sounds good I'll try to find a spot on the calendars right, all righty, so we'll uh we'll we'll create some more issues out of this and set up some spikes and then we'll we'll try to figure out where it.

A

Sounds like this should almost become an epic, the more we're 100.

C

There's gonna, there's gonna be its own architecture. Diagram for this now I I can see it either there needs to be, and then infrastructure is going to have to review it as well like this is going to be a lot more.

C

um We're gonna have any more thoughtful about this, because there's no easy one-leg click thing I can knock out of this. Unfortunately,.

A

Cool well thanks for the discussion. This was uh good helpful for me. Hopefully it was for you all as well. Definitely.

B

Absolutely all right, let's see folks yep all right thanks.

A

We'll talk soon see ya.