GitLab Analyze:Product Analytics Team Meetings, 14 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Product Analytics Sync - Data Export

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Thank you so yeah. Let's talk about the issue specifically and so I'll load the issue up, so we've got it in the recording as well share my screen.

A

So yeah the eye, the problem is well, it's not a problem. The thing we're trying to solve is that we want to be able to export the underlying data from uh click house or keyboard, get the data out in some way. That means it's portable, um so yeah, there's kind of two options that we've got so far.

A

One is, in my mind at least low effort, medium reward and the other one is sort of high effort, High reward um so using cube.js to surface a data export so that this is sort of the the easy thing to do and as far as I'm aware, there's no engineering work, whether it's documentation work so just to recap for anyone who's watching this um by just listing a list of the dimensions which map pretty much one-to-one to clickhouse columns um or clickhouse Roche I should say uh that allows us to essentially output a table.

A

That looks like this, which is pretty much a click house export.

A

um There are a few very glaring downsides to this, um one of which is that the data's returned in Json format, um so that would require whoever's exporting this data. Using this method would have to either be okay with that or process it themselves into a CSV or other whatever other format they required um it would by default, be in Json.

A

um A couple of other problems are that there's a SQL limit, um whether that's set in Cube by us, which I'm not sure of, but even if it's not. There will come a point where, if you're trying to export a hundred thousand a million 10 million rows, um you're gonna hit some limits somewhere uh and that's gonna be slow or even impossible.

A

In some cases um we could probably do some magic with like pre-aggregations, but then we end up storing, essentially an entire copy of a clickhouse table in a pre-aggregation which doesn't really solve any problems, um so there'd probably be some like whoever's exporting. This data would probably have to implement pagination for it as well, um so it could get longer. It could get complicated. The the main glaring upside to this is that there's no engineering work to do right now, so someone could send this query via the API, the proxy API.

A

um Sorry, this query or something similar while the proxy API and get the correct data out. So anyone with maintainer access to a project could in theory, do that as long as they had access to product Analytics.

B

Okay so I guess that's where my point of confusion was because I was looking at the screenshot of the playground, which is not enabled in production mode um gotcha it is for, for um it is actually enabled right now in production in the production cluster, which shouldn't be the case, but that's a different problem, um but but I think what you just said about so querying that all that the input and the output it happens through our proxy, uh with the cube API right. So that's that is okay.

B

That would make sense that, in terms of no engineering uh work, yeah.

C

B

Users are fine with the Json output, which that's if for zero effort, that's I'm sure that's fine.

A

Yeah, so it would, it would require some sort of we would have to document what the The Columns were and I guess. If we moved a snowplow that could change, um but again that's something that we could document as part of our standard documentation process um so yeah in terms of no engineering work.

A

That does work with the caveats that I've outlined in terms of limits, um and obviously the hope is that those using this will hit hundreds of thousands, if not millions, of data points um which would be great for us in terms of user adoption, but will give us trouble and um we'd need to probably load test it or test it pretty hard.

A

um You know what happens to our to the gitlab API if we are waiting on a third party um like Cube to respond with tens of megabytes of data I assume it times out, but I don't know.

C

This proposal that that really clarifies my point of confusion, because I was not quite sure if, when we said no additional work, if we were talking about just the back end piece or sounds like since you just said, the maintainers I as a customer would be able to use the docs go directly to do the download with scaling considerations um which I I think that gets me back to where I start with.

C

My initial understand of customer could use the documentation to get what they need, which I I'd be very happy about for our first boring sort of iteration. um Definitely on the the scale comment. If we do get to that point and I'm assuming we will, that seems like something we could come back and revisit at some point, but uh at least for our first iteration of export.

C

Really, our goal is to make sure people aren't locked into git lab if they want to get their data to take it to another service provider or if they want to do some custom processing with uh Ruby or python or whatever we're not locking their data. In um so I mean that that makes me feel good about this again, since uh we clarified that point. So thank you for walking through that. Okay.

A

So that makes sense, I think we're on the same page there, and it makes it in my mind at least it makes a first do nothing first iteration, um which is always nice um in terms of not locking people in like this. This goes some way towards that um shorter than more extensive solution, which I'll talk about in a minute.

A

There's there's nothing to stop us if we're only talking about a handful of customers, initially anyway, I assume, um if if one of those is is very keen on on exporting their data, I assume there's nothing to stop us from manually doing so on their behalf.

A

Obviously, that's not a scalable solution, but if, if someone wants their data, we have it and we can give it to them, um we can even upload it into object, storage for them from clickhouse, which leads me nicely to sort of the better point of a better word solution, which is to export the clickhouse data directly from clickhouse to object, storage uh of their choice or whatever it's configured to so github's already got support for um object, storage, I, assume, S3 and gcp, or possibly others I'm, not really sure, um and the idea would be to start a background job which would export clickhouse table using native clickhouse functions for exporting, uh save that file locally on the file system and then upload it to whichever defined object.

A

Storage, that's self-managed instance has, or in the case of.com wherever um object, storage for us there's. The first thing that jumps out at me there is that on.com I assume we're the administration of the object storage, which means we could be storing potentially very large files and object, storage and there's a cost implication there.

A

um But the the big thing here is that it's scalable in the sense of we just use clickhouse to export the file and any any number of Records should be able to handle this, especially if it's running a background job. But it will take time.

C

You got it: okay, yeah and under like large file storage, I mean we might be able to manage that with pricing and packaging. Maybe you only get one export a day or whatever, whatever we decide um but yeah we could. We could definitely find a way to manage that yeah.

A

C

The right solution, long term I, think so.

A

um This feels like the the scalable solution in the sense of even if we only provide the download to be available for I, don't know a few hours or 24 hours and then remove the file from object storage. um You know we're we're providing it as an export. It won't be available forever if you want to get it again, you'll have to generate a new export um which should keep costs down to a fairly manageable level.

A

um Yeah like this. This is the solution, um but the big red flag, as I'm sure Dennis, is thinking right now is that it requires direct access from the gitlab application to clickhouse, which currently we only have fire cubes abstraction layer um so that that becomes an infra task. Then.

B

Yeah and we don't want to directly expose Cuba um excuse me, click outs, anyways, and we want to implement a proxy in front of that as well, which I've mentioned before CH proxy, but and we'd have to map out that whole interaction model. But I I we've spoken about this particular solution before anyways and our I think the last time we've actually had a synchronous discussion about exporting um and so I mean we'll have to map that out, but um ultimately I think yeah.

B

I also believe it's the right solution, and so I guess in terms of the well I, was curious about the waiting, since we don't know how it's going to fully interact with it with like the infrastructure stuff, but I guess yeah I'd be curious to know like what your implementation plan is based off of your current weight here.

B

So well, I guess it's! This is an option too, but yeah which we'll get to but yeah.

A

So yeah the implementation will be made up of a background job, so an export background job which would, um in the background, make the clickhouse call to generate the file. um Presumably I, don't know how the internals work on click house, I, don't know if an exporting of a single table locks anything else.

A

So we need to make sure that it doesn't affect any other consumers or any other projects, but assuming it doesn't, then the clickhouse command, which we'd run directly via CH proxy, would export file which we'd save somewhere locally temporarily and then that same background job would then use fog, for whichever um object, storage was configured and uploaded somewhere useful and make it available to a set number of users.

A

I assume we'd have a signed URL, which we would then pass to whoever generated the export job and then um but a life cycle rule on that file for it to expire after a certain amount of time, at which point we can delete the file and that person has got access to their data and can do what they like with it.

B

Yeah and if we were to do something like you know, notification when that job is completed, the dessert there's an additional heavier lift or is that still keep.

A

Not I'd, keep you within within the uncertainty of a five I would say.

B

Okay, cool yeah, no I think this is still the long term. Like I mean we still have to figure out the details of the interaction model of all of it, but I think I think it makes sense. um Yeah, okay, I'm feeling good about.

A

It well, in which case I, will write up to issues one for each of these um and ping. You both on them just so we're make sure we're in agreement about what it is. We want to do and then yeah. We can schedule that first one as soon as you like, Sam and the second one as dinner said, might just take a bit longer to Think Through, exactly how it's going to work.

B

Yeah I think for the second one I want uh you want, you won't be uh well, you might be involved uh for the for the second option or exporting the object. Storage. um There I believe there will be follow-ups like investigations for front end in terms of how to like trigger the job or we do API only, but also for the infrastructure.

B

uh It's like investigation, which um that's why I said you might actually be involved, but um yeah I think once we we have the plan laid out from your perspective from the back end, then we can start the investigations from the other angles. Okay,.

A

Cool well I'll I'll do that this afternoon. um So you should have those lessons today or tomorrow. Right awesome.

C

Awesome: okay,.

A

B

C

Was uh this was good I'm glad we got the chance to sync up on this uh on the call much much easier than going back and forth on the issue. So oh yeah for sure.

B

Yeah I saw the I saw the playground, screenshot and I was just like I need, maybe I'm just misunderstanding, something so.

A

That's cool that makes sense, awesome. Okay, uh anything else. Anyone wants to add.

C

uh I think I'm good for now.

B

Yep as far as experts concerned, I, think I think we're good and then we'll uh follow up on these and get working on the documentation side of it.

A

All right, I'll get those sorted. Then I'll upload this to unfiltered as well. All right see you later bye.