GitLab Analytics Section Meetings, 28 Aug 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: 2023-08-28 Analytics Section Meeting

Description

No description was provided for this meeting.
If this is YOUR meeting, an easy way to fix this is to add a description to your video, wherever mtngs.io found it (probably YouTube).

A

Hello, everyone and welcome to the August 28th analytics section meeting um fairly light agenda. Please feel free to add anything under something you'd like to discuss or uh show off um in our little show and tell section.

A

um But the one topic that I wanted to really talk about is potential solutions for um how we could better deal with uh events that are missing columns or just not um basically in the in the past couple weeks and today is what are over the weekend. Perhaps um we've had some events come in that are I, guess missing, columns and I. Think as click house is trying to parse it. It just seems it's uh well.

A

Clickhouse is saying it's missing tabs and because the events or tab separated values, but it seems like we might be missing values as well. um On one hand, we should be able to better handle these when we uh don't use the Kafka engine in the in cluster clickhouse, because clickhouse Cloud doesn't support that. But on the other hand, that's still going to be concern, for you know, self-managed installations for the cluster, and things like that. So I was curious.

A

If we had any ideas about how we could um potentially better handle these, uh there is an issue that bossy he's, not on the call uh linked to for a related issue for analytics instrumentation group to look at this, um so he invites any input there and then um Anka. You had a a point here as well. Would you like to vocalize that.

B

uh Yeah sure uh so so I already booked the issue uh that particular issue where uh like it got, failed and it seems like uh the STM. It is a timestamp and it is not being passed in the payload, and that is the reason uh like in the click house like for snow blow events while creating a table. We gave a schema as a date time and it is not getting there. So that's why it's failing. uh So there are three three possible solutions to that. uh We accept the event and add the current timestamp.

B

If it's not present, uh if it's, uh we can discard the event uh in the enrichment process, as only uh third option is like which update the schema, where we accept the nulls uh like uh for the snow blow events table, uh but we've also mind to check for all the events uh which are present like uh because, like we only know right now, like the HTM as a issue, but it can be other events as well.

A

Right, I guess it's it's! It speaks to more of how they want to handle fields in general that, like null values and all that kind of thing, because we can address like you said we can address it for STM. But you know if, if someone else leaves out another column where the schema changes, then this issue could be reproduced again right.

A

So um it's interesting how a missing field is causing a parsing issue, though, because like I don't know if I'm not clear yet I still haven't looked at the clickhouse error more in depth, but I'm curious. How that's like is that trying to match again? No, it's not even able to parse it correctly, because I was going to say. Is there an issue between the columns to insert and like the missing field?

A

But if it's an optional field as well, then yeah that would that wouldn't make sense, yeah yeah, um so so you're saying that we could potentially deal with like fields that come in that are missing in the enrichment process.

B

uh Yeah with the JavaScript custom in Richard like we can add that or.

A

B

It so uh like what do you think about like? uh Should we discard, if it's uh it doesn't have a STM or like, should we try it with some placeholder, where your value there or.

A

Is it specific to the because a lot of these events have come from someone trying to just send curl events, so it's not exclusive to like um to provide more context. They weren't using an SDK, so they were just. They knew that we could accept snowfall events and, on one hand, yeah we need to.

A

We can instruct people saying yeah like this really only works if you're using an SDK, but at the same time, if we open this up for beta or GA, and someone sees like oh you're using snowplow, so I'm just going to go on snowplow docs and just send a what I think is an event.

A

um That's fine, um but we need a better handle that. So it's not just like kind of halting the processing of other events, so I think discarding. It is fine, but I I. It's not just um a specific issue to the JavaScript SDK like I think it has to be something I, don't know. If that's something we can add to the overall enrichment process: um enrichment, business, okay, okay, yeah, that would make that would make sense. Then.

B

A

Yeah so in that case, I would say I'm fine with discarding events, because we know that we should be sending the right event right, Fields, if they're, using our SDK, um I, I'm, less concerned and I'm open to everyone else, I'm curious.

A

What everyone else thinks about this, but I'm less concerned about someone sends us a custom event and it doesn't work and they're coming to us and say: hey I didn't use your SDK, but it's not working well, we can work with you work with them to figure out how to get it working, maybe for their use case, but we really should only be supporting the SDK I. Think adding not placeholder data but like default values or fields that they haven't supplied doesn't seem like the right fix there.

A

So I would say like if you're not basically like. If, if you're not using the SDK, then you know you should if the events get uh discarded, that's not really um or RSU, so to say any other thoughts on that.

C

um I share the same opinion in this topic. I also feel like we should discard these events uh just proudly push them to the bad events, queue um yeah, and so one thing worth noting is that, right now we have this problem of the STM attribute, but uh uh probably it would be best to find a more holistic approach where right we make sure that it will not break again, because someone misses some other attributes. Yeah.

A

That's exactly that so, like my like, um it sounds like it's good that we can do this in the enrichment process, because I wasn't sure if we'd have to introduce like another valid event, validation process, which would obviously make things a bit more complicated and complex, but um yeah. If we can add something in the iteration process to just put these events into into the bad queue, then we at least have a place where we can recognize that there are people that are sending events in this way.

A

um But is there a way we can handle that with our existing schema, such that, like it's not just specific to the STM field? um That's an open question to the analytics instrumentation group like I guess: is there a way to Discord events for I? Don't know if this just means like we have to validate our scheme every time it goes through but like? How do we deal with it so that it deals with like any any field that um other than just STM.

D

C

Far as I know, someone would need to just dive into the specifics of how snowplow collector and snow plow and Richard works and find this out. I I think we don't know the answer to this question right now: yeah yeah, unless you do I.

B

Don't know like I also need to check on that yeah cool.

A

No, but that's uh it sounds like a good um lead, then, as far as like a potential solution there, so um I'll I'll check the related issue. That boss is linked and then um I can update that to say hey this is the direction I want to head towards and then um we can. We can continue that down that path. I think that's! That's probably the best way to go about it because um and semi-related um you'll see in the production of writing.

A

This review merge request uh that I've Linked In the show and tell there's a bit of a different architecture um for what we're trying to do for for.com, um where we have vector and as because um for context uh for those that don't know yet, but for a click house, that's part of the cluster there's a table. Engine called Kafka which allows clickhouse to actually pull the events from Kafka, but clickhouse Cloud does not support that.

A

So we actually have to push events from Kafka to clickhouse and so I've introduced uh another piece into the cluster called Vector, which could potentially do this validation process.

A

But, of course that is like a further Downstream process, um where you know it's already gone through the enrichment Pro, uh the snow, pile enrichment um flow, so the if we can so if we can have a solution in the enrichment process, then I think that would be the best case scenario there, as opposed to trying to validate it again um or discarding events again um in a down further Downstream service.

A

So cool I'll update the issue there. So thanks thanks for uh chiming in and then semi-related is the a little bit of the show and tell if anyone doesn't have anything else. I just wanted to kind of share the writing to review merge request, um it's quite an old request, but the architecture has changed quite a bit from when we initially used it suit and I switched over to snowplow.

A

um I just wanted to call that out for people who haven't seen that yet there's an updated architecture diagram as well I just wanted to call attention to that. If you're curious about how everything's working there open to any feedback or questions, also because it talks about um disaster recovery and backing up and like monitoring, if you have any ideas about how we could do that um yeah again in a few of us, as we've been going through, these production incidents have been discussing.

A

You know how we could monitor each part of the event pipeline to kind of better, have a visibility and be more proactive to these types of issues. um You know open to any ideas there as well, um but yeah cool thanks for the discussion there, um but that's that's the agenda. So if there's nothing else, I mean well. Let me ask it this way. Is there anything that anyone would like to discuss.

A

Cool um there was a follow-up question that I was curious about um related to this potential solution, where I was asking, if the service ping event pipeline was susceptible to this. um Of course, the architecture is a little bit different since it's going through um Wanda and I believe Kinesis. So, but it seems like because data teams set that up uh the other. If you wanna, did you want to vocalize that point or.

D

Yeah sure uh so, I think it's more of a question for the data team, because um when we enrich snowplow events they all land in in a S3 data Lake and from them the from there they are picked up to be included in the snowflake models.

D

um So as far as I know, this is not a continual continuous process like pushing the events from a stream to clickhouse.

D

That's done exactly yeah, no.

A

It's just a thought because I mean, of course, it's completely different, because service paying is not something that we open up for others to um well. I mean they can submit it through customers dot, but it's a little bit of a different process right uh in terms of how the data gets handled like you mentioned, but I was just curious.

A

um You know what another alternative or another potential solution as well is that I I mentioned Vector um processing events like maybe it's just a good sign for like us not to use the Kafka table engine because it seems to just get hung up on one bad event. Unless there's a configuration setting to say hey if you can't parse this um part of the Kafka log after X amount of tries- and you should just move on but um yeah.

A

The good thing is, though, is that once you do clearly a bad event, uh the rest go through. So all the events from the past few days have should be showing up now, but um yeah all good practice for uh when the customers will playing with our service cool, then that's the end of the agenda, then so I'll give it another few seconds. If anyone wants to talk about anything, otherwise, everyone gets 17 minutes back thanks for joining the call have a good rest of your Monday rest of your week.

C