GitLab Data Team, 16 Sep 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 2021 09 16 Paul, Ved, Dennis and Radovan discussed on how to explore data pipeline setup

Description

Discussing the solution will explore in order to find the optimal solution for the Postgress extraction part.

A

A

And yes, it's about exploration of gitlab.com data pipeline setup. What I understand from this issue, it's about create the extraction tool or whatever or any other way. We can do the thing and also I spoke with that, as I said yesterday, and he proposed solution- maybe to split this work and try each point from here and spend some time on that and also open the agenda. So put your information here just to do some conclusion: how we can move forward with this issue.

B

Here so from here is what we see that we have five options. uh We have five or six total for one of the option, like uh the license dv extract, we haven't put a lot of like a description around it, but licensed tv extract, or there was only we extracted, something which postgres uh they pump the data from postgres to our bucket and from there we pick it up and our snow pipe picks up and loads the data inside of snow.

B

So it is an automated process, but the volume of data is still low over there. If we need to do a hundred percent refresh for all the tables for the scp tables, I'm not sure will that be sufficient to copy every day, gbs of data for every run, what it does so, but yeah there's six options. What we have- and I think for us to mistake it from here that which one would be the best we can either in this okay or on the next okay.

B

We can dedicate just one resource for one of the method: source, optimization a licensed dba, do a poc for three or four days and with the set of agenda that is it possible to do it or not.

B

That's the first thing second option is that if it is possible, what are the blockers and so on, because for most of them the challenge is that to get the postspace sequel logs access to the production postgres equal logs, which I am not sure how safe is the infrastructure team or the db team, is to share it openly to have access to the third party tool like stitch and five turn when they don't when they have a for the stitch uh for the index, they had a huge and that someone else is accessing their data.

B

So again, the gitlab.com, the whole bin.

A

B

I'm not sure how but yeah we can start until we do start digging up in those areas will not be able to say that as well, where we will end up with this. So if we set up the in this meeting, we what we're trying to finalize that we come up with that. What we out of this issue we get to take a next step. Do a poc come up with the results that these are the this is. These are out of six five are doable. One is not doable out of five.

B

What two are having dead blockers, which can't we can't take it further. Those because the infrastructure has raised a red flag on top of it or securities team has raised a red flag on top of it, or we can't bypass that then the three and then we come up with the time and effort what it takes to do it so that my say on this uh dennis.

B

C

Can you add something yeah? The only thing, as I understand right now um is to execute a poc or all five options. It may be a little bit overkill, that's quite time consuming.

C

So what I prefer is to order the solutions that we have now on the table uh and which one is the most preferred solution from us with all the information that we have right now and with that get in contact with the infra team, get in contact with the security team and to do a hypothetical discussion and theoretical discussion, and if they don't have anything against it then start to do the poc. I think that's a little bit more efficient.

C

Otherwise I think we will do a lot of work, which will be also a lot of waste in the end uh yep, not that.

B

Yeah, so out of this six, uh the four are for for four of them. We need to go to the infra team directly now right from option three onwards all are talking about pinging the locks and picking up on the locks, so either meltdown or stitch or five turn or utilizes utilizing postgresql.

B

We, the next instruments are set up for us, should be like in the two direction uh in the infrastructure. Let's get in touch with infrastructure team asking or is the issue or I can reach out to gary asking because he is back from his holidays.

B

That is there a way they can expose the log file. Do they see any concern on exposing the log files? So whatever changes happen, we ping the login system and if they say yes, then we have other four options: open that are. These four can be worked out. Otherwise these four goes to stop then and there if they say that we will not share the log file or we can't share it this morning and honestly yeah. If, if.

C

We, I think, uh the log files that's the most convenient way forward for two reasons. I think it's the most efficient one, because you don't need to scan a complete table in terms of a full load or incremental loads. You also need to filter in some tables, don't have index so there's also time consuming. So I think the the logging files will be a huge benefit one for performance scalability, but also to detect any deletions, maybe you're not going to take deletions until we're doing complete, table, scan and check that with our own database.

C

But if we need to do that for all the tables, I think that is to cost consuming. So I think the log files is the way forward and then, honestly, I think five trend and stitch are not an option because we're going to use a third-party tool um attached to our internal systems. I think that will be a go for security and I think it also costs some money, because we have quite some data to be transferred and 520 states are the cost model. Is data driven so the more data you transfer?

C

So I think, and my preference is to investigate, to which extent we can connect to the log files and if that's an option and then find the way forward. What the best way of implementation is and is that creating something ourselves based on the log files or are we going to reuse a single tab which is already available, which you can leverage in nottano or maybe somewhere else.

D

Yeah, um what's it in the comments? um Justin did also do some investigation on which of everyone saw this, but that he does not recommend multano, basically for this use case, um so yeah I'd also, if we were going to do a poc on five train or stitch I'd say we should pick one, um but at the same time I'm also yeah very reluctant to move there. um There's a few reasons that I also put in there in the comments there um that yeah, I would also prefer to keep the process internal um and yeah.

D

I completely agree that the using the log files is probably the most would be the best next step. um I also think we can probably include some parts of step. One, like I think sort optimization is always a good idea um and we should probably look at any like easy wins that we can get out of that, although I'm not sure that there are many easy wins left but um yeah, it's always good to do a small check on that.

C

Do we recall what the exact concerns were for no.

B

Logging, uh so monitoring, basically the major thing and the stability of the overall product itself.

C

D

um If you just go straight to the bottom, I'm not sure whose screen it is they're all there.

D

The last one, rightful yeah yeah, uh just a few points up there yeah this one.

C

One more this one.

B

Logging monitoring those.

C

And there's a fair point: there there's a fair point.

B

But I uh so uh okay, so stretch five run is like a very costly process, even though we get access. So I think the very first step we can is to check with infra that can they give access to logs only if they give access to logs. If we do a self in-house development using kafka or something that we do, we are doing a kafka streaming of the bin logs and dumping into snowflake. That's a good option technically, but it might take a time to develop it, uh but development and all will come the secondary.

B

Only we get access to the log files first. If not, then all of this is stops, and then we have to look into a total different direction that how do we make it? Optimized? More now we were given now that they are doing a decomposition of the main production database and they are splitting the db itself. Will it reduce turnaround time for us as well, because now we have to extract only 10 tables from one db, which might be quite faster and only the less load is on the main database. So.

D

Yeah, it's not sure, because we're about to see you thinking that that's like running too long at the moment, because there are definitely a few places. I know that we basically also, I think we still read the file into a pandas cf csv, um but we basically like getting the data out of the database, we're actually running um post progress, copy commands, which I think are like one of the fastest ways that you can actually like. Physically extract the data.

D

It's just that we then pass it into a data frame and then do some other data passing. um So if you just wanted to like speed up the process in general, we could probably look at some of the code that we have that's processing the python stuff. um But if you wanted to speed up like the actual database extraction or if we have like a particular concern with like yeah, that one scd is timing out or something like that, then we'll need to look into something on postgresql.

C

I think idea what I read is that the singer tab, which is available only translates the wall files towards json files, said to make it easier to to transform into a tabular format. So if we can leverage that somehow, uh instead of reinventing the wheel over there.

B

Yeah so wall to json.

C

Duplicate or replicate the data out of the postgres database to our snowflake system, which will be fantastic. Of course, I don't know the wiz except you're, going to use it because bbt runs every day, but it will be cool to say we are now having a real-time mirror to get them accommodated. So, let's see.

B

Yes, so, ah okay, so for us to take, it was like the next steps because we have discussed so I'm just uh taking. Like a note, I got like two points, which is the next main course of action, apart from optimizing the source and looking at the bottleneck for that, that is to check with the infra to get access of the postplaced log files.

B

If they can share it, and if yes, then, can we get a sample of a log file to see that we can process through the single tab or we can post process it through enables like meltano or some just a sample? It's not a huge file. There's a small one to see that few mps to check that this works. It is not so this file can be consumed and it will work out if you want to take it forward yeah. So those are the two course of action. We can look into definitely into the.

B

I have not explored uh this licensing db extraction. Part paul. Have you worked on this? uh The licensing db extract.

D

Or it is, it was before you um not much, but to be honest, I'm a bit skeptical that that'll improve things like I say it's. The problem is that they'll basically have to run the same copy commands and then we'll be picking them up with the same python commands that we're running at the moment. You know I mean I don't think we'll be able to speed things up in the way we'd want to um yeah.

B

But the licensing dvd is something which they extract and put. It means the infra team or the dvd, they extract it and they put it into our gcp and we consume snow pipes. So we don't have any any air flow running at the moment.

D

But I'm just saying that the the command that they'll run actually physically, it's the same thing that we're running in the normal one, um so yeah I mean we could also confirm it like I'm, not 100 certain, if that's the case, but like I'm pretty skeptical that it will yeah. I think we're basically just it's a process, that's separated in this case, and it's all in our space in the gitlab.com space, okay, okay,.

C

Well, can you can you give some context regarding the setup in the issue here that you um describe the current licensed db extraction solution, because I think then we have the complete picture and then indeed we can say to everybody that is not an improvement. Because of this is this.

D

Yeah for sure yeah.

C

Cool sorry about the phone, no.

A

I just want to ask about this limitation. Is it official? I think we will not expose our data to third party, of course, for that said it can be costly right if in case we are using fiber and stitch. So in that case, probably it's not the best options to to evaluate right.

C

No, it is not a given fact, but gut feelings less an assumption, but we can check. Of course, yeah no worries yeah.

A

Yeah but but still in that case you expose very, let's say, sensitive data out of our infrastructure. So probably uh someone really.

D

A

What they put here yeah, I agree.

C

But I I agree with with everybody who says yeah that there's an additional security and compliance.

C

My argument is, then, that we're also going to store the daytime snowflake for snowflake, which is also an external system.

A

D

Right in that case, something to keep in mind with those five training stitches that they're actually also just running uh singer taps as well, so anything that you can do with the tap is the same. If you I mean I'm, to be honest, I'm pretty sure most of them are also just running python code. So I'm not sure how much blaster there'll be them up in the house python, um but it would be good to compare yeah.

C

I don't know if it is faster, but it is a complete other way of extracting the data, and now we read the data ourselves out of the database um and we are reading out of and because we're reading out of a database. We saw a huge lag in the past because of log logs over uh over there. So that's why we got provisioned a new copy database which refresh only once per day.

C

Otherwise there are also constraints on the production database, so reading out of the database results in having a 24-hour old copy, and I think that's the biggest problem with the current setup. What I see it right now.

A

C

A

Also, just to understand the full context, I assume what dennis said. We have that copy of postgres database as it is copy of production, and after that we move the data into snowflake right and then do a calculation account. What stop us to query directly postgre database? Do we need all of these data or just some sums, or I don't know, aggregate stuff there? You know what I mean.

C

Yeah, I I don't think there is a limitation, I think, there's also a good solution and that we got the data pushed instead of to that replica directly to snowflake. That can also be in solution, and I believe I believe that the god, maybe that you know better than I am then I know that replica that's provided to us is also based on lok valley. I'm not sure about that.

B

uh So so the replica what is provided to us is a clone of production instance. uh So it's a a production, gitlab.com production, uh postfast tv, so they have given a clone to us. That's the replica what we have, but I didn't get the red one question. The last part.

A

My assumption or my question for us is, I don't know the full context. uh Does anything stop us to query and sum up data directly from porsgrass and just upload results. What we are interesting for in snowflake to reduce the space, because you have.

C

A

Now because you have pause, rest production, positive.

C

Replica snowflake.

A

In three places you have this exactly the same data and it's our goal now, but what kind of data we need to query from snowflake? Can we do a calculation on the fly and push results only in snowflake and then move on, and with that part, you know.

B

Yes, I believe, initially before this we had this snowflake system. They all were working on postparts itself. The whole day dvt was running on postgres itself. I know.

A

That and all the things.

B

Was happening over there over there? I saw some couple of quotes yes over there, but the snowflake was treated as the cloud warehouse and you can't be relying on because the data over there in postless keeps changing. So you have a certain deletes happening, certain which is not being captured. So if you rely on the source db, which is keep updating everything, so today you got, you are not able to maintain the history.

A

Yeah, that's the main.

B

Thing that that.

A

B

Blocker for the whole, that's why they all move the whole post from postgres to a snowflakes on the real time and within a day, so that any change is happening. We are able to track the historical uh content of the data. Anything happens over there, so changing so those were the, I believe, the reason, because initially I saw a couple of codes sitting in the uh somewhere where it was running on post press itself, all the calculations.

B

They were trying to do it, but now everything happens on a snowflake, basically to keep the principle of warehousing in place that we don't lose the history. We have all the things sitting here with us and, moreover, keeping it independent from them, because.

A

B

We were reading directly from the production. uh What I say now we have got a clone, but before this clone environment we had like a real-time replication, so we were reading from the slave server of the production.

B

So you, if you put a heavy uh query over there and everything that already it was throwing a lot of lag like 48 hours one month yeah. I think there's.

A

That performance, yes.

B

Performance of copying from master to a slave, I think the most because of those things taking into consideration they post everything towards the snowflake, because we have a scalable computer. We can scale up as much as we want to do. The costly, computation and postspace is simply not the place where we can do the whole prep model. Preparation. Now, okay,.

A

Okay, okay, that was an interesting for I see that limitation because of keeping history and now currently and stuff like we keep all the data, including history, because we have only a snapshot in postgres. So in case you want to travel back and pick up something from I don't know. Last year it can be a problem because you've lost the data right.

B

A

Okay, okay got it thanks.

B

But performance, I think, was the one major concern for them would be because I'm not because this has happened before me joining us. uh It was present already.

B

This system was in place, but the gut feeling that they were postless db itself will not be able to handle this much compute and they can't scale it up dynamically, like a snowflake that, because when we run one of the few of the models where our dvt models run for five hours in snowfall, snowflake in parallel running lots of cluster of xl size, or else we can't scale up that much in postman. So it will definitely.

A

Could have written.

B

A definitely a bottleneck over there in performance.

C

B

So much uh taking, uh we have almost got to the time, but shall uh we pick it up like this? I raise a trigger a discussion with gary uh asking that. Can we get access to the log files? Is this an option and his idea that how can we optimize our pipeline means from the infrastructure side? Do they think something like kafka can be?

B

Because when I spoke to him very briefly in one of the coffee chat with him, they said that they are planning to use kafka in application side as well somewhere, and he said that if your team wants to use kafka as a streamer to consume the data, then you need to talk to there. Some infrastructure head someone I didn't got his name, but I that was like a very early stage.

B

We were just thinking that we can consume the data and we write in-house pipeline extraction mechanism where we just read the bill log and just dump data into snowflake.

B

So but I will check the first thing I should be checking with him is that access to the log files and how easy it is for them to give it to us or do we have any uh restriction and if he says to raise an issue I can create an issue and then we tag all of us and then we all can discuss through. That. Would that be the next first step to be done or you feel something different.

A

Yeah, I like it great, I think we are done and wrap up everything needed for now and we know what to do next. Definitely, thank you all. I think, and I will upload this recording session to youtube and inform you about that.

C

Cool thanks for organizing uh thanks for that.

A

Thank you very much.

C

Cool have a good day.