GitLab Data Team, 24 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Monthly Catch Up Data Platform, PI & Infra Team 24th March 2022

Description

Monthly catch up meeting for Data Platform , PI and Infra team

A

Okay, so uh in the agenda, just uh gary is not here, so it's fine, he will watch async uh updated. So also we are in a position that decomposition project is about to go live. I think, next month, sometime there is an issue created. I I think I have access, if not I'm just putting it over in the chat.

A

This was created by dylan to look for us to start migrating to the new one. uh Do our changes as well? So, yes, so.

B

Just a heads up about the decomposition work we are working in an environment. To do some logical testing, I believe, will be some delays that I would be. I would say that one month is really optimistic.

B

I expect some delays there. Okay,.

A

Yeah, that's fine! So what we were after, basically that we have another instance, but there is already two databases, one for ci1. Can we get a clone of the ci database as well? As can I because it was basically a question to gary that I go and create an issue and put all the information, so he can spin up a clone instance for us one so that we can test it as well that, yes, we are able to connect.

A

So whenever it is go live then we will turn on the switch in the production, but we can test in our local and be sure that we are able to talk two different instance at the same time and do all the fine tuning. What we need is that something which we can expect next year from getting some.

B

I have good news for you and we have already choose the fs replicas. One is stating one in production from the ci cluster. Why? I want to keep everything running properly and when we do the switch over you already have this. What I need to follow up with jerry with gary is about the snapshots, the scripts that he uses for snapshotting, and I think it's not installed there, but the dfs replica is there. I can give you the names if you want, and rather and after we can talk about it,.

A

Yeah, so we need that, basically, that a snapshotted environment so that we start picking up, and since we talked about zfs there is a query we have so this jdfs snapshot. It is like a thin database, that's what I have so. What does it contain? Is it contain like, from the last instance, the snap it was taken to this one? Is that does it contain the delta changes or is it contains everything inside it? What you have.

B

Is a snapshot from production because you're cascading? So let's say: if I have a snapshot at 1pm, we have all the data till 1 pm. If we have snapshot at 1 30, we have all the data until 1, 30. and.

C

Sorry so it's from the beginning, it's not just incremental right, but for that, okay, everything is inside. Okay,.

A

Snap thin database, which, what's the reason of calling you know.

B

Inside, what's what reaffirms does exactly that has a um initial image and then you have the deltas of what is changing? Isn't the internal. But for you you see all the changes you just see. Oh you don't see only the delta.

A

Okay: okay, we wanted to see the delta. If there was some mechanism, just we see the delta okay. This is the insert.

C

A

Deletes perfect, we are that solves the major problem for us as the overall whole performance thing, because we just have a thin layer to see. Is that something possible? So technically, logically, yes,.

B

Yes, because the snapshots you will see, for example, suppose that you don't have any changes. You have a second snapshot. You will have like a few kilobytes only to generate the difference. Let's say a few updates. If you have a massive update, it's including this, for example, we have one database of one terabyte and suppose that you have a hundred gigas of changes.

B

The thinkload, the next clone, will have one 100 gigabytes and how to look to these two separate updates instead of delete. I don't know because, as far as I know, this is written in this blocks, so you know what happened the best way to do. That will be with logical decoding, but you can get what happened at that. Then you can have a log of this but like from zfs.

B

As far as I understand is much more disk wise disk block mapping of changes, it doesn't look for the it's not so smart to understand the technology, but that's running up and supposedly so to know what happened. Everything is snapshot to the other. I believe you will need to have some experiments with logical decoding.

C

Yes, but jose said just to be on the same page, what you know from oracle roles, it's more friendly than postgres. What gary said talk to us uh positive scenes, not so friendly in this in this task, I would say right: yes, comparing to other database providers, yeah.

B

Article they're talking about raw device, no like oracle, is like it has a lot of sugar over you know so yeah like imposes when you need to work by yourself. Another thing you can do is have a stat statement reset it.

B

No, but, like you will not see like you need to what you can see is in production in your replica restart you start statements and see what happens during 30 minutes in the logic, not in the snapshot, what is being replicated then in 30 minutes you have a snapshot of what happened there.

B

We usually export page start statements to promise you, so you can see this at least in live live database, but I want to tell one thing quite interesting here we have a timeout of 15 seconds in production, so if the transaction is long enough and it has a timeout, you process a bunch of data. You generate a lot of wall, but it's not committed, so it doesn't show that statements. So what's up with that, that's also.

A

Like a catch over there, the product.

B

C

A

Okay good thing to keep in mind when we move towards, but for now we will raise a request to gary to create the snapshot environment for us and we can put it in the firewall, the vpc and network and all so that we have access from our kubernetes cluster and then we test it that we have access to both of them and our changes. Because till now our changes are theoretical because we have done all the configuration changes, but we didn't have two databases to connect to so we reached to.

A

We should reach to that stage by this month, at least in another one or two weeks, and whenever then the composition is ready. We just flip it in the production of that. So we do it.

B

Guys I saw a question in the some production channel search interrupt about some data that was stale from production. Do you need any help here.

C

B

A

It was from me right.

C

A

C

Bad time stamp yeah, if I'm not wrong, yeah, what's happened. We retrieve the data if it is not stable and for a couple of records more than one. We find some time steps like before christ like easily be before our age. So if you filter out the data, we have work around and it's working fine, but just want curious. What's going on what you see on production, do you see an issue because of that or not.

B

Okay, I can look production for you. I had an issue as well to recreate the main ci main cfs cluster.

B

So I don't know if something happened in the css cluster, but if you are reading so but like please can we you can create any infrastructure issue and put me the table and the ideas that you want to be checking. Yes,.

C

Actually I'll copy paste everything through that slack channel because it's like, I is equal, then, and we have this timestamp so for you it will be easy to search. It's not only idea also mentioned that, but yeah no problem uh infrastructure issue right, yeah.

B

And sent to me, I look for you.

C

Please I will do that thanks a lot for your help. I mean. Probably it can.

B

Create some issue.

C

For you, we have workaround. As I said it will be perfect because no one intentionally put this date for sure, because it's impossible tricky but yeah.

C

B

Like imposes like you, can do arithmetic uh operations with data, for example, one is class one day or like this is quite interesting because you need to cast thing properly throughout. But if you have like plus, let's say ten thousand, you can get some strange dates because you'll have something ten.

C

Thousand years, probably this is four thousand years before current moments, so something suspicious happen and it's couple of records, but yeah create an issue for us. We expect time stamp, but I I reproduce the issue and pause the database and table and be able to insert on timestamp that value. It's legal value, legit value for time stamp for positives, but for us for snowflake it doesn't seem legible, but yeah we will see. I will prepare everything for you after the meeting thanks.

B

C

A

Thanks, I have actually added a point for you to.

A

The last one who says that we had this instant last weekend that our clone was down for more than like it was down for like more than 12 or 30 hours, because the recovery took more than 60 minutes, and then we didn't had access. So what we understood from the on-call people that this service, this postfast instance, is not in as part of on-call support.

A

So until unless we analyze that the failure has happened, we are not able to use the clone instance. They don't pick it up because it doesn't do a pager to them. Saying that this is this instance is down.

A

How shall we approach this that to bring this whole service of clone instance into the on-call support, so that, because they monitor the we don't do like overnight, 24 7 support of our data business. By the time we realize that there is a database issue, we have 400 slack alerts, saying that all the 400 tasks have failed. So so we just want to like kick start that process. We don't have to like get an answer now, but how shall we approach? That is that?

A

Do we have idea is gary, should be the one telling you you are unmute your mute.

B

Yeah, so let's go by part: the zfs node is part of the production cluster dot. So a whom told you that this is not only on call how you got this information, please.

A

So I got this uh so in that incident there is a write-out as well from you.

B

Please paste me the link and I will talk about exciting with jerry as well, because this is part of the production cluster and the srv team has to support this as well, and they have plenty of support from dbre to work on it. So I will just clarify useful.

A

Yeah so they said that they will support it, but they will not pick up the failure when it happens like that, so it doesn't do a slack pager alert to them so like that. So this is something so they supported when we ping them, they jump to the call we get a solution, but by the time we ping. It is like six hours below after six hours, close to like four hours it. The failure happens like 11 45 utc, for example, pm upc.

A

We pick it up by 3 pm, utc or 4 pm utc, since I am in india times when I look at that. But if we don't have people in india time zone, it is like close to like six pa beauties. So we have two bad two batch run by that time and we have like 500 600 alerts to look after so we wanted it to be like picked up by that, like slo or sla kind of thing. So we need to move your.

B

Alerts to the main cluster and we need to alert infra directly, for example, if your batch fails, we need to raise an alert to srv in the same moment and notify you as well, but in the same moment right because from what understand your pipelines later are failing and you're getting notifications like three four hours later right, then you are looking the root cause and then you are paying a survey or someone to support you right. Yes,.

A

So gary already has stated that alert mechanism. It goes to the alert channel, but it is like an alert challenge. You are like thousands of nerds, so that is not being picked up. So we see that our pipeline failed. Then I go to the alert channel, see that is there an alert triggered for this. Yes, there has been an alert river, so it is an incident over there.

A

So we just like uh make it a formal thing that if it gets failed instead of we picking it up, if an on-call people pick it, it gets fixed up then and there, because if there is a rundown, it is a repetitive step. Then we redo the exercise and by the time our task kicks in the things are restored back.

B

Okay, my suggestion here, let's create an issue as well pick me me and jerry on this, and we can try to work out this with a service or with infra.

A

Yeah, that's what I would so that's the reason I wanted to just discuss it. Just I can create an issue and tag you and vary on this.

B

Possibly when you just adjust the alert it should be under the sre radar and then we are fine and we put like the rumble. Please put the run book there as well. The reference that we can put in the alert is usually the best way is you have an alert and you have the rumble. The service can react better because they see what you do.

A

So that's all in the agenda from my side today. These are the two main things one to get a clone for the for the decomposition and the second one is. This radon has already got one for the infrastructure thing where there is a backdated date, so he will create an issue. I will create and two issues one for this- to bring in the on-call sre thing and to get a clone plated, bikery and map it to the vpc.

A

Just anything else dennis.

A

No, I don't have any point.

A

I think I'll get give back everyone 10 minutes of time. So thanks everyone, thanks for this and I'll post, the video for you to have a look. Thank you and.

C