GitLab Data Team, 30 Jun 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Marcus, Dennis and Radovan discussing of how to approach the PostHog backfilling process

Description

Marcus, Dennis and Radovan discussing of how to approach the PostHog backfilling process in order to overcome low performance with the PostHog ingestion.

A

Never hurts yeah, that's true.

B

I put some current status and we just finished the regular weekly meeting, the main problem or main issues with low performance. What I spoke with post hoc team and jeff martin from our side. He is able to bump this to 50 ports instead of three ports. It means, theoretically speaking 17 times faster and also I want to check with you about cost efficiency and also in slack threat, he's playing everything. It's not a costly, because under the current setup anyway, they will pay the same amount of let's say money for the current cluster.

B

So for that reason I I want to check with you: is it okay to bump into 50 pots instead of three pots as it's now and try.

C

Are the ports on our side.

B

Yes, everything is yeah, just one step back the entire setup and cluster installed on our gcp resources. Let's say locally.

A

But in the cloud.

B

We have a dedicated, dedicated name space, a dedicated cluster core called posco psc.

C

But is there an impact there is that impact to have more pots, because I can assume if you have more pulses, go quicker, but you spend more money because you have more pots running concurrently, but in the end you spend less time so net there's, I think, eventually, nothing more expensive or cheaper.

B

C

One more thousand pounds in here- maybe split some overhead, of course, but that is not 50. 50 pots is not 50 times more expensive than one pot, of course, because you're finished earlier.

B

Exactly this is what jeff martin nicely explained. He said about that vm instances in pods and how things are done and how it's executed on on gcp cluster. I was scared and was taking care about cost efficiency. It's always one angle. We need to to decide so what he said: kubernetes magic based on the current pod usage.

B

We have plenty of capacity on the existing cluster nodes to add more ports without incurring a new node cost, and don't worry if you get into adding more nodes, the cost is a drop in the bucket for our infra cost overall yeah, it looks like.

A

C

But it's not yes, that's true, but even let's say even it cost us 25 000 dollars. uh If you say yeah, you also already paid 6 million yeah. That is indeed a small and there's no material. On the other hand, it's still 50 25k right. So exactly.

B

That's true and anyway, it will be scaled automatically up to 50 ports. That's a maximum if, if it's idle, no cost at all, but it's scale, I would say on demand, so because current what what posco team explained to us, they can put out to scale automatically. But it's not the case for our cluster, because it's scale, that's the reason why everything is so humble, it's poc and you want to save some money. Of course, don't don't spend it very very quickly.

B

So this is the one of possible solutions for us, because I checked again with them: python, post hog library is the only or the best way to do that. Any other option is a workaround will not play.

B

So that is what we discover like okay. Potentially, this can be solution for our issues. We should try that I just want to check that. You want to see okay, how you approach this. Is it okay or not? What is the more important here.

C

But for me, if we go for 50 bucks for what I see in your issue, then we are done in 45 54 hours, correct for 90 days of data.

B

Correctly, roughly but yeah as per what you know, I need to check that, of course, but if spo por this scale- and there is 50 puts- I can check it tomorrow morning immediately see okay, is this done? At least I will check one one day of data or one hour of data, something small, so I can scale and calculate it more precisely. This is like, where is our sweet spot.

C

And marcus, do we now already have visibility? How much days of data do we want to have for the evaluation.

B

Yeah, that's another thing.

B

They said three months of data, so the my total calculation is here, but I think he can survive at the last date. That is not possible. I think one month will be sufficient. I don't know that's what I read between the lines on the last meeting, because I spoke with him and also deadline is next friday for this.

C

B

Dave and carolyn.

C

Okay, so kevin said one month.

B

Is okay dave said three months is actually hits here.

D

That being said, I view the balance of data range use versus quality interactions, more informative relationships.

D

But here he says three months right.

B

Three months, but in the previous meeting he said, okay, I know he's aware of the issue about the performance, but he said okay. If there is no three months, maybe we can negotiate something else like small decrease or decouple smaller amount of data. But, as I said, if you can scale this to 50 pots, I can try one hour. One day of data be sure everything is done.

B

If it's okay yeah, we can update the steps and say: okay, we are able to do three months or we are able to do or provide two months or one month update.

C

Now that the thing is said, the reason I said is because if you go for three months and then based on 50 pulses takes 50 50 hours, which is two days, so it is not done before uh your end of week.

C

You need to hand it over and that's not ideal. So if it was one month, then we get started right now and then by the end of tomorrow we would have all the data in and we can close the issue and the evaluation can continue. So that's why I'm asking, but let's let's say: unless we need to go for three months, then I think indeed there's no other possibility to bump up the cluster to 50 plus.

C

Please ask also uh jeff yeah um what the cost will be. If we run that, then, for three days, let's say yeah, it takes you set for 50 hours, but let's assume would take three days. 72 hours had to be on the safe side.

C

If, if he doesn't say that it will cost us 1000 dollars, if you say yeah, it costs us 3000, then I think yeah there's a good good thing to go right. Then there's a no-brainer, of course.

B

So yeah three days, 50 pots, also we check 100 pots for three days.

C

B

Three months of data: yes, three months of data, but we need to ingest three months of data, but it it can be done in a couple of three four days right. So I need to ask.

A

Jeff for the cost.

B

Also also one more thing very, very important. I checked.

C

So one step back, and so if we go for three months of data, it takes maximum three days of loading, but we need to have 50 pots.

C

Please double check that you have what the costs are and to see which extent that's manageable, and then we are good from a project perspective or not marcus, because then we have data in time before the eighth of july and then we have enough data, which is three months and we all can live with that.

B

Yes, also what I want to say and rise a question mark. I need to check everything before we move on. Okay, I will check with jeff martin about the cost. I just need to test everything before we go live theoretically it will work. That's one thing. The second thing is I check with the post hoc team. When I try to insert one record, there is a problem on post hoc installation because there is a bug on their side.

B

It's about. We push the record and it's in kafka, and there is some problem between kafka and inserting date and click house. Actually, click house. Is there snowflake?

B

So I don't have an error, but data is not there where I expect, I can't see it in post hall, so they need to sort out this as well. From my point of view, my action point check with jeff about the cost of 50 hundred posts just to have the measures for three days of usage, four days of usage plus check with posco team to fix this issue, and then I need to test everything and say we are ready. My estimation is fine.

B

Later on, if it's okay, I can run the duck and if something fails, someone can just rerun it. You can do it. Energy markers, it's very simple, but we need to ensure its measure is correct and it's fine from end to end and eight is there in the test uh test project, then we should switch to the real project and say: okay, this is the data for you and of course, uh schema is okay. For now they have some additional requirements, we're talking about uh carolyn and dave, but that is what it is.

B

They will get the original schema later on. We can do fine tuning, but for now the schema is as it's now.

C

Good and then the third one uh we need to have him back up uh because you're on holiday from friday, let's say tomorrow, um who can actually get of emergency? Hey, you say: yeah, we can run the deck, but if we have an error with with with that deck, then I cannot solve the north markers can solve. So we need to have someone as a backup.

B

So I would say from my point of view, anyone of indeed engineer engineering team can do that. What is my intention is to fully documented everything it's already done. Code is, let's say self-descriptive it all specification. Each function has explanation usage input output. I think it's in a good shape if something fails any moment or any part can be debugged and fixed, because it's building components so also. I take that action point for me with me to create a backup what I can do can organization.

B

At least I don't know which web let's say or with paul record the session and provide all info.

C

Now exactly that was my proposal. Indeed, please have a sink in one hour or 30 minutes sync, I don't know how much your time you need dude, please, with fat, tell in the status, so the status as of now or essay. Tomorrow is yes, we have a technical issue that needs to be tackled with postal, that's something we cannot solve ourselves.

C

Lastly, push one record and it doesn't um exist in post hoc so that that's what indeed, then please check indeed with jeff about the cost impact bump it up to 50 points if it is not too much and then if this technical one is uh solved by tomorrow, start the thing and then back fent is a backup in case of an emergency that something breaks, but that really is something great right. If there is a schema change needed, uh I think that's too bad. That needs to wait up until uh you're back.

C

Otherwise we will have.

B

Infrastructure issues and any kind of infrastructure issue any kind of issue like triage issue, nothing more than that. That is what it is. I prove it works. I need to test everything hope it do not be. An additional problem also need to sort out this with the post hoc team sort out with jeff create a backup everything.

A

B

David and carolyn, meanwhile, I will just create a list of specifications, will not do, will not touch anything and also, I think, can provide the the latest update with you.

C

Yes and then my last question- and I look at both of you- let's say happy flow- uh this technical one is solved by tomorrow. We bump the server up to 50 pots. We run it tomorrow and it's finished on sunday yeah ideal situation, let's, uh let's hope for the best.

B

C

Nothing else is needed from the data platform. Then right. That's all we need to deliver and.

B

Everyone is here, yeah,.

A

Yeah, that's it! Then it's up to dave and carolyn yeah.

E

Sorry, what did you say, marcus.

A

After that, it's up to dave and caroline provided the data gets loaded successfully. Yes,.

E

There's nothing.

A

More to do for for our site, if there's no issues yep for the psu.

B

Yes, but life happened then is an other thing is what can go wrong right.

C

Sure other things can go wrong, but was just for me to get clear overview that are we now covered with our plan and apparently we indeed we are covered with our plan and let's see so, I would not have the surprise saying. uh It was also already a surprise for me that you need to do this backfield. I thought it was only installation-wise, so that was my checklist, so the installation is done by it by kev.

C

We now make sure that the data is there with all the constraints and all the challenges and all the difficulties that we have right now. But if that's done, then no action is needed anymore and the evaluation is unwrapped.

B

Also, you've been a pickup, the unlucky guy, for the backup will consider who the key players are. If something is wrong, so I can do takeover like, for instance, something screwed up with postcode installation. You know it's buggy like we can't answer the data, what to do pink, post hoc team if you need to scale something ping, jeff martin, just to create a state machine of scenarios if something gone wrong can be wrong.

B

Is it okay for you to cover that part also as a part of taking over or knowledge transfer? Definitely and.

C

Maybe not a smart thing to say: don't make it too difficult and we can't think about hundred scenarios, but indeed like, like you said, if you have just proper documentation, we know who's involved. We will find out, of course, so of course, if you have some time to do these kind of additional documentation that that would be fantastic, but key, of course also. We said that we need to tackle that technical issue, uh because that's a broker of course. So that's that has more priority for me to get that one solved.

C

Instead of listing down 50 scenarios, what could happen in the end? Yeah.

B

Exactly just keep players like okay for this issue, ping jeff for this issue, pink post hoc team in slack channel or everyone- is there anyway.

B

So ping drop a message in postcode channel and, as I said, I I hope I will be able to do the testing first thing in morning like okay, I have end-to-end testing for one day and I can see it's done or I hope there is no bit more problem, but anyway I will do all additional steps and let you know asking honestly about the latest type I have this current status will put also for tomorrow. So you have the full picture.

C

I am pleased can we schedule the session with fed tomorrow morning? I don't know if that's smart to do, because maybe we don't have the latest stages. I don't know what time you will log off tomorrow.

C

Let's change the question: what is the best time to pair with fair tomorrow.

B

Probably before lunch, I need a couple of hours to check everything to know the full shape of the pipeline, because.

A

Theoretically, it's there yeah. We are.

B

Fine, but still I want to be sure and also I hope I will ping jeff immediately after the meeting to scale everything he's in us, so he can do it today. Let's say I just wait for he's just waiting for our confirmation. I want to align with you about that.

C

Yeah, so if you can include me tomorrow as well, that will be very helpful if it's before.

B

Yes, so so you can get the full picture and you can support us from from your position.

C

Exactly so, I cannot take the tackle technical issues, but no.

B

No, no just to see what's going on yeah, so you can easily connect the dots all the strings whatever so so. The two key things for us for today is to ping jeff: do the pump and create 50 pots before that? I will ask for the costs of course put it here. This is a confidential issue. Oh yes, then ask possible team to fix the issue with kafka. For now.

B

Maybe that could be a problem. I don't know. Maybe kafka can scale because also what he said very important things to consider.

B

This is the point one, and also we discuss about this, and also he mentioned now- heads up for a number of partitions for kafka. It's only one partition at the moment. He can take that action point with him, but I don't know when he plan to to do this because you have one partition for kafka. So the question is you're able to push everything to kafka, but are you able to push efficiently to post hoc to sharehouse click, sorry and upgrade to the latest version to get rid of some bugs? So what can happen?

B

Everything is perfect. It's happy flow, but it's probably with kafka or it's a problem with this issue about events. Pod can't accept the issue, the requests, so that can be a potential problem. I would say I hope everything will be fine, but in case of any problem we need to ping or drop a message in in postcard, slack channel.

C

So postdoc suggests now already during the evaluation that we need to upgrade, because the current version is super buggy.

B

Yes or not, super buggy, but buggy, probably superb,.

C

Okay boogie, but um you cannot backfill data.

B

Yeah because what's happening.

C

But did they confirm that as soon as we migrate to 1.3.7 sorry, 1.37.

C

That issue is solved or not.

B

They claim this issue is solved in the latest version.

B

We spoke on the meeting, but there is no guarantee as they guarantee their guarantee. They said that they're able they're able to handle 50 million events per day with- I don't know, 200 spots yeah, which is really, I would say, professional environment.

E

That means a gf needs to upgrade the post of instance, correct.

B

uh Yes, uh no jeff should press the button, but before that that guy from postcode should create, send him a helm file for the upgrade and apply that file.

E

How can we put that into motion.

B

Pink push the guy from post hoc to send that to jeff and say we have green light for upgrade.

C

Now, let's do so right or not? Sorry, let's do so, but I don't know which extent it can be done today, but if it can be done today- and I think we need to have force a lot- maybe marco you can can push that because most of the folks are in the us. If we can make it happen that we are 1.37 by the end of the day. We can start tomorrow with new fresh.

B

C

Yeah or at least the smoke test perform one insert if that's not working yeah, then we're blocked again. Of course,.

B

Yeah we can easily be blocked. It's black box for us to be honest about possible, guys read on their website. They are established two years ago, so you can't expect very stable system in two years or less.

C

It's not blaming them, but just no.

B

No, no, not blaming them, but that's the nature.

C

Of software development, you know.

B

Also, if you consider gitlab.com product 10 years ago,.

B

That's how things are going yeah.

C

We write down some action somewhere because we are discussing a lot but yeah yeah. Yes,.

B

Did you ask for this permission.

C

Yeah I hadn't access to.

B

B

You can see it right.

C

Action item number one: um yeah.

D

Sorry, do you type or do I, when you type.

A

It please, you also give me access, I don't think I have it.

B

I will I will thank you so remarks. You're good.

D

D

B

I would say I need to test it before I start and also, if they upgrade it probably, I will be able to run and insert one record.

C

So your system is crashing again a lot of fun, but got the message. Of course that's already starting, but testing is starting.

C

um Handover, oh yes,.

C

Okay and then I think you will see where we are end of day tomorrow. Okay,.

A

So, let's have a few things to add here so yeah, the first one asked for jeffrey up for the pump, the the uh the bumping. That's good, the upgrade to to recall jeff needs to do that right. So postdoc needs to send you after.

B

A

Needs to do it is that correct, okay, this is you know, I'm definitely gonna ask. I will I'm not sure if he will be able to do today, whatever he's currently on display, because we're throwing this last minute, um let's go with the assumption. He won't from the call that we were on before is like well, they recommend us upgrading, but it should still load regardless. Now it might not, but it's it's a lot to do before before running and goes on.

A

Vacation obviously, and the chances of I don't know if jeff is anything like most people are whether or not he'll just have a chance to do today. What if he doesn't.

C

If jeff doesn't have the chance to upgrade today, yeah then hopefully tomorrow, so as basically yeah.

C

That's. I think that the biggest blocker that we have currently, of course, and the the the performance is an issue, but if we cannot even insert one record because of an issue on post help site that needs to be get out of the way. First, of course, and if I understand you correctly, that regardless of the upgrade you still can should be able to insert one row or not.

A

I can't hear you no, but you data is loaded currently in postdoc, with the current version that may have been all of it. I'm not quite sure what is and isn't but data is loaded currently. So it's not completely blocked to my understanding.

A

They just strongly recommend it because of all the bugs, but it doesn't mean that, isn't um it's not like a black white, it doesn't work at all or with new version as well. I think it's just like some stuff was buggy, but it still is somewhat working because I felt going right now. I see data yeah, but.

C

Is that the life, or is that the backfill data.

C

Then I'm confused right now. I don't understand it anymore.

A

I think it's a reliable. It could be completely wrong in the run of a bluff players. I think it's a reliability thing. I think it's what it is, it's buggy, but not broken, which means.

C

As radhavan explained earlier on also posed in the issue right, he tried to run and he just won the record to postdoc and no record was uploaded.

B

Yes, yes, that's true. Can you hear me now yeah.

A

B

I pick up zillion microphones, because computers crashing sorry for that, let's go on uh the problem is now. I can't insert at least one record something crashed on their side in the current version, without probably aggregate they claim upgrade will sort this out if there is no upgrade, I need to ping them and wait for them to fix this.

B

A

You able to load the other data because there's currently data in there that you have been able to load. So how was that possible.

B

uh I loaded two days ago and ever something crash in the morning. Yesterday.

A

Okay, so it was working and now it is not. Yes,.

B

Yes, so it wasn't stopped blocker from the beginning. I was able to do that slowly. I have one million records there in my test project, but from yesterday morning nothing can be inserted and it was a black box. For me there is no exception, just silence, so it means possible library from python. Accept my record, push it to kafka and stay in kafka forever, and it's not processed to the click house and it's not visible in postcode application.

B

So I agree then it's like super buggy you're right, okay, okay, even with the current version, they need to sort out this to be unblocked. So I can test everything before I move on to the real data and really ingestion.

A

Okay, um let me back up and go to worst case scenario, we'll have postdoc to give the right stuff to jeff and and we'll ask you have to do it, provide it. You can do it now, we're great. If you can't do it before you leave, we cannot start the load. Somebody else needs to kick it off or we say we're not going to make our okr. Yes,.

B

A

But I need to test.

B

Everything in order to be sure, okay, this is tested code. This tested pipeline. I can insert the data without that, it's just guessing.

C

And I don't I don't want to go, go all in with an operator um if the bread is not needed, don't let let's do it, then, let's take all version 36, but we need to have that loading issue out of the way. Anyhow, and maybe we need to push this one back to postdoc, saying hey, this is issue right now we cannot insert it. How do we solve it? Can you solve it or do we need to upgrade it to a next version?

C

Let them make the call, and if we need to upgrade it, then they need to provide the helm and jeff has to execute the upgrade if they can solve it. Otherwise, I'm a huge fan of it, then, because upgrading takes more time, more people will make it more complex than it already is so right. You need to have the loading issue out today.

C

That's true: how can we get that loading issue out of the way? Also because of the time that we have now left.

A

Yeah, I can ask, but it'll be easier right above you dispose exactly and put it in slack and I'll I'll compress them afterwards on it, because.

B

I can ask them: I can ask them, because I already put my code here. You can see the screen. I will just remind them. Okay, what the status can you sort it out or not?

B

If not marcus will do the rest? It's just like algorithm. Okay, if, yes, if not and until when, of course.

A

Perfect, if you can send it to them I'll, I can press on them further today as well. If they really.

B

A

To be the upgrade the same thing, I can press on them: okay,.

B

So, thank you. I will drop a message immediately. Put it here.

B

C

So marcus and radovan get the loading issue out of the way by pinging postal. You will do that rather fun marcos will.

A

Yeah I'll follow if there's no, no, not enough traction or whatever the response is I'll, follow where needed uh so overnight for you for you all overnight. Hopefully something will be done um similar if they say it needs to be the upgrade. Similarly I'll have them uh sync with jeff um youtube. Your point dennis indeed, it feels like a big scary thing to do right before somebody goes like. If it doesn't work, it doesn't work and they were out and that's that's not it.

A

So if it really is, that's the only option we have and then uh yeah, hopefully all this could stop before your time tomorrow, which is, uh I don't know, 12 hours or so um otherwise. So what is our? I know we're out of time. What is our, what is the do? We have even a backup plan what what if it doesn't like? Well, if, if it doesn't get solved in the next 24 hours, but preferably like 12 hours, really to kick it off to test, it basically needs to be done with 12 hours.

A

If it isn't do we have any any? Are we just out of luck or are we uh throw our hands up? We can't do it or is there anything.

C

Then I have to pull that out of other projects and he needs to take it over what the.

A

C

He's less experienced also a ton of other things to do so. Yes,.

A

C

If we can force.

B

That's that's the backup option dennis. I think you're right here, probably that or anyone, as I said or poor, can can follow up on what I did actually and just try to do the same next monday, tuesday, whatever and then to finish before next friday, if happy flow failed for some reason or for any reason, but as I said, we will do something in acts, 24, 22 or 21 hours for sure, and then we will take it over the last status.

B

We don't deeply understand what's going on. Okay, this is the data. This should be there. There is one slack channel just to keep it simple.

B

I will also stay and try to provide documentation, and this message immediately just to be prepared for tomorrow and tomorrow. I'll do my best and hope my my laptop will survive because I don't know, what's going on really there's some crazy. You should see kernel task hit my memory three times 3.5 times, but anyway I will prepare what is possible tonight.

B

Marcus can chase them in case. Something is near then it's needed and tomorrow morning also can be probably ready for me, including for jeff set up for us and also what possible team sort out for us with or without upgrade right.

B

Makes sense so this is the action point for tomorrow and also this, and this can be sorted out today, us time or night for us.

C

Yeah and since since it was working on the existing version, I don't will put all my hopes on an upgrade because for me it seems unrelated that software is working on overworking and, of course, if there is an issue you need to upgrade, but if it was never working then I would say yeah upgrade to another version. Maybe that will solve it, but, like I said I don't want to go all in on updates, especially not where we are right now, with the evaluation and the timing that we have.

C

So, let's see how postdoc will respond to the technical stuff that we can provide to them.

B

C

Well, let's, uh let's rumble.

B

Yeah, let's do some, let's have some fun have a great evening, greedy and speak tomorrow. Somehow I think honestly also we'll include you in the session dennis okay.

E

Cool have a good.

C

B

Thanks yeah just to restart my computer and we'll send the message see you bye, bye thanks for advance.