GitLab Analytics Section Meetings, 12 Jun 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2023-06-12 Analytics Section Sync Recording

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

um So, okay, um so based on the follow-up question, I just wanted to add the point at the beginning of the conversation, because I good impression that there was a little bit of a confusion going on between what is the service ping and what is the snow plow or like what is distinction between those two I think that it will help us to be more Direct in the follow-up points.

A

uh We need this meeting if we clear this out, so the serving service thing is processed that collects reports uh from the south manage instances mainly there's also the service being for SAS.

A

um But this is not the event level data. This is the aggregated information based on the number of rows in the database or personal database like number of roles in the issue table or some other information like the instance license md5 or shark hash, so various information and it all uh we have roughly two thousands of metrics in service being all of this metrics comes together as the single Json object uh via history post request.

A

So it's definitely very different structure than the even based data which reports Atomic bits of information via single event, and the second part of the conversation is Slow blow for SAS for github.com, which reports events based data and it's collected and aggregated later on in the downstream systems um and yeah. Those two are very independent from each other and not really a compatible.

B

I guess, since um it's probably me, that's misunderstood that because I'm pretty sure you've explained this to me before, but I appreciate you uh highlighting it again, just so myself and another zero where um I I. Is it correct in saying that like snowplow kind of came after service ping, or did they kind of come around at the same time like you call it basically, the reason I ask is there's multiple events that are being there's thousands of events being collected in service ping that aren't in snowplow like?

B

Is there a plan to like combine those ever or is it just kind of like whatever existing one will stay there? Until you know someone decides not to I, don't.

A

Have very specific insights but I believe service ping was a little bit earlier than snowplow, um but also there was the um the main event driver for like the difference and the the those two tools diverging from each other was driven by the ability to report from the self-manager, and we got a lot of pushback and uh constructive feedback from the uh wider Community.

A

When we try to introduce the even tracking on the instance levels, um so service being in the aggregated form is way more uh privacy concerned, because it doesn't really reports any information which is that how to what is singular user doing the number of issues or the number of projects doesn't really tells anything about the user and also it's way smaller data stream coming from the instance, so the instance administrators feel way more in control like they can just rolls the payload and see what is exactly going out of the instance and with the snowplow events, which could be Millions as simply impossible to monitor every bit of data.

A

That's coming out. uh There are numbers of initiatives to bring those two data sets closer together, also driven by the fact that um we have a little bit of a challenge: how to replicate the service being metrics per customer on the SAS level, because this service being runs on the instance level and for the self-managed once instance, is one customer for SAS. That's not the case. This is the thousands of customers combined together, which makes the life of customer support customer.

B

A

Harder, so there are a number of initiatives that help to break down the SAS instance into the particular namespaces which represent the customers, uh and for that we're also using this uh snowplow events to mirror or replicate some of the snowplow a summary of the service Bing Matrix, but not all. There is also an incentive which is called internal events tracking, which was recently started. We try to provide more uh cohesive, like singular events API, because we did the confusion that you face.

A

Then this is well whether we've come on GitHub, like we Face numbers of feedbacks that people are really confused, like which events to track and like what this tool does and what the other tool does. So we, we started the recently the initiative to bring those all tools together and provide a singular interface, but it's a very early start and we basically are wrapping up the uh the groundwork uh to even start building. On top of it. No.

B

Thanks thanks for that um that that Rings, a lot of bells to what you famously explained to me in that you know the the method of collection I mean some people manually still send it. It's part of, like I, think their license check um service, paying some contracts for people that are um in air gapped environments, for example, or just not regularly sending them um send that payload.

B

So that's, obviously a lot more uh of a process to figure out when we have to then start to collect millions of potentially rows of events, data things like that, so that makes sense and I also now remember the uh the problem with.com um uh that yeah, we don't have that same level of granularity, since we are a much larger instance with different cohorts of customers, and things like that.

B

So thanks for clarifying that um cool, so I guess on to why uh Nikolai uh clarified, that is that I had a couple of questions that I really wanted to bounce off of an analytics instrumentation. There have been a few discussions that I've had with some members of analytics instrumentation, some with product analytics as well and I kind of wanted to understand better uh with all of us in the room uh like what's actually possible.

B

What we can maybe move forward with or what's what's not actually possible um in terms of using uh what I thought was servicing data, but which now to correct myself, is uh snowplow data from.com um and then potentially use that as a way for us to instrument uh gitlab.com or at least view that gitlab.com event, data within product analytics dashboards um and so uh just trying to figure out which person and which point to refer to um I guess we can just go and order uh I'll.

B

If you kind of go to 2a1 here that I'm highlighting currently in the agenda I'm asking. Is it worth potentially setting up uh some kind of ETL where we can grab data from um and I'll call that and clarify that to be snowplow uh from.com and periodically import it into the production? uh A cluster for the purposes of using that for product analytics features, with.com data that we already collect and then make a while. You have a point there.

A

So this was my one of my first questions: I look into and yeah I got this all snowplow service, pink um confusion uh suspicion. So this answer: if it's about the snowplow, it definitely applies because this thing just distinguish the snowplow from service Bing, right and I.

A

Think that the um the subjects of the ETL was covered by Bastille to be down below, um with suggestion to use the different endpoints and not really backboard the data, but rather keep on the collecting, but I will add busty to voice out, because uh I think he has more broader uh suggestion to share.

B

Cool I'll just quickly cover my my point that passing response to animal tester capacity just for the context um so I was recalling when we had early discussions when we were wanting to implement snow plow for product analytics that we wanted to use um self-describing event schema and part of that was because what we use on.com currently for snowplow is uh using an older, older schema, and so um that was my understanding, but bossy you've clarified it here.

C

um Yeah I think I mean that that understanding was correct, so um our.com schema right now uses this I think it uh snowball calls it structured events, uh which is a event that always has I, think a label, an action, a category so a bunch of different um string identifiers um and that's a kind of deprecated structure. I think that was originally inspired by Google analytics. So it's kind of the same structure that Google analytics use but snowplow by now recommends these self-describing events and not using this old structure.

C

I think if we would really build kind of snowplow from scratch on gitlab.com, regardless of product analytics, we would also use self-describing events. So in theory, I think we could still do something like in an ETL where we take the snowplow data, so the snowplowdata from.com it's actually running on AWS in an Amazon, Kinesis um Pipeline and then put into an S3 bucket. So there's a bunch of S3 Pockets which just hold text files with the snowplow data and in theory we could grab those and just actually transform it.

C

um I think, for example, so our custom event right now has I think just a name. So in theory, you could just imagine I, don't know taking this label action, something something and either putting it into specific Properties or just concatenating it into a long string and then turns back kind of that way, transforming it and interesting that data, then, um at the same um and I think what's important to differentiate here is between events and Page views or other like standard events within snowplows.

C

So there's page views, there's page pings, which tell you about kind of how long a patreon has been ongoing and there's things like link tracking, so events that are specifically sent from certain plugins um that are not affected by this. These are just still. This is a like special kind of event. This is not part of this custom, self-describing event thing so page views and so on. Those we could just take and like use as, as is in theory um but they're, all part of the same big bunch of S3 buckets.

C

So just by looking at the bucket, you can differentiate between those two um and then I think so we already thought a bit about this and what uh Nikola was already referring to. Is this confusion by our like internal users, so people in gitlab who need to instrument an event um around okay? What kind of instrumentation do I need to use like do?

C

I need to use this service ping redis stuff, do I need to use snowplow um and for that reason, I think we would prefer to what we're already right now doing is encapsulating it all in one API. So you just call one method called Track event or something, and in the background we use all those separate systems automatically.

C

So we send things to the service ping if it's on a self-managed instance, if we send things to snowplow, if it's on.com and you would use the same API then to send events to um our product analytics Ruby cluster as well. um That would be our idea and then, in the beginning, we could do um yeah, just figure out a way to not send all events at once, because what's also important to consider is the amount of events we have on um gitlab. So that's around 60 million events um per day.

C

um Around 7 million of those are page views, and so this kind of amount I think um also takes a toll on the um infrastructure we have. So even if we would just ETL it into our system, uh clickhouse would still need to Able be able to handle millions of events that would accrue over time.

C

um So I think we. We should start slowly and not send everything in at once that that might end up catastrophic.

B

I mean I, don't know if you have a general. If we have a better idea in the next coming weeks about a scaling strategy, it might be good to just test it.

C

B

The fire hose like that, but no I'm, actually particularly interested in the fact that we have that much data coming in, so that we could actually use that for testing. uh um It would be interesting to see, and not um might my understanding correct me if I'm wrong from your response here is that we could do an ETL for old events to kind of bring in old old event data into new ones.

B

But given the recent encapsulation efforts, it might be better off just to kind of move forward with just sending new events um to the project analyst clusters and just doing it that way. That way, we're not really doing any um effort on kind of old data that we and just kind of do, do it on new and then still basically have the same value. And since we have enough events that it wouldn't matter, anyways yeah.

C

Except if we out of some reason want to test our system with billions of events at once, then we could just because all this data is available and as free buckets still so, there's S3 buckets going back a long time. So, if we out of some reason need billions of rows of data, um we can ingest it and we can, for example, um yeah write.

C

A script adjust the page views uh if we are just interested in kind of looking at graphs from page views um or or transform the events that are there into a structure, that's feasible for us to work with um and I mean there is also existing knowledge about kind of working with this data. So the data platform team, for example, recently wrote scripts to go through all this data to remove IP addresses, um so they already had python scripture running on the existing S3 bucket data.

C

We could ask them for help if we somehow want to ingest that into an existing um pipeline. Also.

A

Clickhouse has the integration with S3s plus I know so probably would be possible to just connect the clickhouse instance directly to DS3 buckets with the uh events files and build the old old ETL pipeline just in clickhouse.

B

To do all right.

A

Yeah it does the same as we do with the snowplow events right now so Q in.

B

The click house build.

A

Views on top of that.

B

Cool um yeah, not to mention just the billionaire scale, but also if we you know want uh found it interesting enough that we wanted to connect historical data that we currently have um with with what we're collecting um now or when we, when we do set that up um random follow-up question. That's ever that's come out of that. Just to understand how much is historical data is still on S3. Did we ever get around to defining a data retention policy, or do we still have like basically everything that we've ever Okay cool? So that's interesting.

C

huh And it's all as far as I know, it's all also in S3. So it's not only in the data warehouse, because the like S3 is our actual backup uh of of all these events. In the event that the data warehouse would go down, so we could readjust them, but.

A

The data is organizing the date based folders, so we don't really have to ingest all of it at once. We could.

C

A

Okay, let's, let's go back guys by month by year, uh whatever decide as a fit. uh So we have a little bit of control. How much back in the past, we want to get.

B

Yeah that'll be really important because the ones that'll make the process a lot easier, especially making sure we have enough space for all of it in the cluster or we need multiple clusters, but also um depending on our interest. If we really want to go back that far or not so it's good that it's organized as such um I didn't write it down, but also action item I'll create some, probably an epic uh around us and some issues to kind of pursue this further, but I appreciate um Nikolai and ambassy.

B

Here your input on this um sorry I can't type, and this is what I'm terrible taking meeting notes um type and talk at the same time. uh So then I asked about using the browser SDK, but at this point I guess well, I I think it's still relevant, because what we're looking at as far as snowplow is is specifically well. No we're collecting page views is well I, guess I'm, not sure if I should even mention this question, but basically the reason I asked it was that you know.

B

Could we use a browser sdkn.com if it were able to collect events in the same way that snowplow currently does on.com but uh Boston to your point? That requires the use of the Kinesis pseudonymization pipeline, which would then basically be an additional part which wouldn't really mesh well with our current flow. As far as student immunization is concerned, so I'll maybe table this part for now. The people are happy to read through this at this point, but.

C

B

C

I think it might still be I mean we have time so I think it still might be helpful to just kind of lay out all the options, because we have this ability to um either like we said before kind of ETL, all data in or then also new data in, but I, think that's um like handling this amount of data is kind of uh like to with an ETL pipeline can can be also a challenge.

C

I mean that's just to talk to our data platform team who handle this every day um and the alternative is to actually add these additional sdks um and there um I think.

C

The the main point um is that that's important to understand is that we are pseudonymizing quite a bit of stuff uh yeah um right now uh for kidlab.com and that's written down in our um yeah in terms of service or something um that this data isn't collected so kind of it would be important for us to do the same if we um ever add an additional layer of collection to gitlab.com and there is um and rituals and and kind of easy ways to do, pseudonym assessing pseudonymization.

C

B

Taking me months to get that word right.

C

To uh on um with snowplow, but mostly on specific properties and the one thing that, like I, think we are doing, Special that is not easily covered is actually looking at URLs so because in theory, in gitlab, uh namespaces or group names or project names that they could expose personal data like the name of a project, the name of a um and so those are getting actually um sort of uh to domestic or something in uh this.

C

The the screen this Pipeline and you can I, think I linked a table here and you can see it's just like the URLs are now just like namespace one, two three or something instead of the actual namespace name and that's something we would have to implement again.

C

um I think the important part here is also that this is like the in the current pipeline. It's running on Kinesis, but the the the part that does the the conversion is a ruby Lambda. So it's just a bunch of Ruby code. We could theoretically also try to kind of modify that to run with Kafka as well, and so the actual code to do the southernization probably could stay the same, because the structure of snowplow is the same of the snowplow data.

C

It's just kind of the connection between it's it's a different type of connections, instead of doing it with um Kinesis, you do it with Kafka and there's also alternatives to kind of replicate that code, because it's not that much I mean the actual, perhaps like taking the data supplementation. You could also convert that, for example, into JavaScript, which would be easy to to use on as an enricher. So snowplow has possibilities to just add JavaScript enrichers um directly, so you don't need to kind of set up a different function or anything.

C

um So this would also be a possibility to do, but I mean that the big part here is I think is that this is very different to what probably our customers want, because our like this is very specific, gitlab thing. So either we would have to have a different cluster I think just for bracket lab data goes through which I don't know. Maybe a good idea anyway um or alternatively, have the code in a way where it looks at okay.

C

Is this a gitlab project um and then just do the pseudonymization and if not, then not.

B

Right so to that point, I mean earlier: we were talking about using the new encapsulation to send events to product analytics cluster right, but we we would have to re-implement the pseudonymization regardless. If I'm.

A

Understanding yeah.

B

Yeah, so that that's actually, regardless of what we do, if we want to use.com data, we need to have studentization, and so, if it's not happening, can you system we have to be implemented at our at our level.

C

um The one thing is I think, with the Ruby SDK we can theoretically just choose to send data, that is that doesn't need to be pseudonymized as long as we don't send, namespaces user, IDs and so on. We could just send start sending events without any additional, like user information with the browser SDK it's different because it's the browser, it's page views being tracked, and so on. So as soon as you put that in you're gonna send information that is potentially uh pii data, which.

B

We want to, we want to handle.

C

So, theoretically, on the Ruby side, we could Implement start implementing without this being in place um on the JavaScript like web mobile web browser side. Note.

B

And to I guess to clarify the page View events that we currently collect through snowplow, that that's using a browser SDK or is that collected a different way? No.

C

That's the snowplow browser SDK just being implicated. Okay,.

B

But in this case it's being passed through Kinesis, which does a pseudonymization before okay.

C

Yeah, so we just have similar to our current cluster um like there's, no plot part and so on. This is also just set up on AWS. It's like it's, not in kubernetes, it's like the ec2 instances for the collectors for snowplow collectors and uh instead of Kafka and Kinesis, um but otherwise we have to kind of just a similar pipeline. It's just then not ending up in clickhouse, but rather in this has three buckets which are then getting ingested into snowflake, which is our data warehouse.

C

So if you go through license or something you're querying snowflake.

A

Around the around the page views but I want to specify one thing: I'm, like 95 sure that uh the URLs synonymization, which you mentioned actually happens on the server size on gitlab when the page is loaded to the browser.

A

So the snow blow SDK browser SDK is sending already personalized URLs because we don't have on kinesis the way to map the namespace, ID or namespace namespace names to the IDS. So servers gitlab server resolves back so it gets the request. Hp request finds the namespace ID or the namespace name. Project name results is backstory, ID prepares, minimize, URLs, send it with the HTML and then from that the URL is replaced just for snowplow and it's sent back to the pipeline.

B

Got it okay, yeah.

A

Well, I'll increase open up a little bit, be able to use the browser SDK because, depending how the browser SDK is currently built right now, with the pages it may be possible to reuse these URLs with the browser, SDK cool.

B

Yeah I'll open an Initiative for us to really outline the options we have here because I also wonder and I don't know James.

B

If we well I guess: first, we have to get out to customers, so they can try it first, but um you know busty mentioned um we're, not sure if they would want this, but if we're getting a gearing for application developers and being able to see them in a suit and now I can't say it anymore: synonymize um namespaces, when they're doing it in their applications that might actually reduce the barrier for them to actually be able to collect event data without collecting pii.

B

um So for the same reasons that we're trying to get you know, uh instrumentation set up more, maybe maybe potentially there are customers that have a similar situation as well, but um I guess which we won't. We won't know which one will have to come first before the other, but yeah I think at any rate, we would likely need a separate in clustering environment for specific, which we can do now um connecting a project at a to a different cluster.

B

um So we could theoretically connect.org get lab to a separate cluster and which would then have something to do the pseudonymization I can say that version of it um and then later on, we can explore whether other customers who need it and we can build it, maybe actually into it uh for everyone but anyways uh bottom line is I'll I'll create an issue for us to to Really outline this, uh since we started getting into like technical design for it, but um cool uh good discussion. Thank you thanks.

B

Everyone um I think we're at time at this point, so um I think busty and I had wanted to just kind of give an overview of what's going on, um busty's already um written, what's happening, um I will write down what's happening and then um yeah, uh but we're already at time. So unless there's anything, anyone else would like to call out uh I'll just pause for a second. If anyone wants to talk about anything a few minutes, we have left.

B

All right: well, let's get to see everyone um thanks for the discussion and I hope everyone has a good rest of their Monday good rest of your week and take care.

C