GitLab Data Team, 21 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Engineer Internship: Chris Sharp and Radovan Bacovic

Description

Chris and Radovan discussing about next steps of how to decouple dbt DAG in Airflow

A

B

Yeah great rather increase regarding data engineering, internship, we just start about DBT airflow infrastructure things here. So at the beginning, as I said, DBT is one graph or one doc from this list and it's executed exactly in the same way as any other thing here. Our airflow is based on or it's deployed on cluster, and it has three components: it's database, it's web server and it's casual.

B

If you go to the cluster and it belongs to default cluster.

A

B

B

You can see in the notes, not here sorry.

B

And you see airflow deployment.

A

You're not sharing your screen at the moment. No.

B

Yeah I will do that. Thank you thanks for reminder. I stopped shared and start recording and yeah. Yeah great stuff starts one more time, DBT running on airflow. For now it's just one monolith structure with a lot of tasks inside and actually just one step back about the airflow. It has three for components as I said, web server, schedule and database for us what is most important. Actually, this is component managing the web server behind that the schedule and how things are orchestrated and all data are persistent in database block or database layer.

B

Long story short any of these dug is set up to be spin up in the same way, then DBT is no different. In that case, as I said when we call the task, it's pin up the kubernetes spot in our cluster and everything is done inside that Port I mean everything is executed. That is a little bit more difficult way to maintain and manage, but great advantage of it is you're not limited to this Machinery deploy airflow. So you detached orchestration from execution. In this case you have airflow, you say: okay, this will be done.

B

I will want to execute DBT and then plot both spin up and everything is done on that pod. It can be scaled directly speaking infinitely, so we have no problem with any any issue when it comes to memory or things like that, and also then take a look into the code here, you will see command, we run when we run the DPT, let's say for non-product models and small later Warehouse.

B

What we're running is actually this command. Everything is around here regarding I, don't know, Target Broad exclude these stacks and also we need to need to exclude our X attack when we decide what we want to do. Regarding the the couple one set of tasks so actually speak speaking about this.

B

This will run one the port on kubernetes. This will run the second one and also, let's say DBT test- will run the third one, etc, etc, but long story short. We actually run DBT command as a script and it spin up on the cluster and on that cluster everything is executed and data are transformed. Also we Define which environment, which profile what we exclude. What we want to include some additional commands as per documentation from DBT and that's it.

B

So actually you can do the same, and actually you can create a mimic of production environment on that what you said, all local clusters, what we talked last last time and also you're able to run DBT locally and, of course it will Target, also automation, last time, not production, but your private schema or like Chris sharp prepared products. Okay,.

B

Yeah yeah any question.

A

No, no, it makes sense yeah.

B

Yeah I just want to give the introduction to this park and also when it comes to our next task.

B

Is we need to detrimentativity sharding part and actually what we want to do? I think you what you mentioned last time it can be which one as a good candidate.

A

It was the Salesforce one splitting that out so that we could run it more frequently. Mm-Hmm.

B

I, don't know, did you find any obstacle or any show stopper blocker for us or we're we're good to go with that with that, let's say set of date or set of tables right, because I think it's increment or it's imported or extracted using uh file, training I'm, not wrong, and it's in raw layers. So from that row layer we need to decompose DBT jobs.

B

So what from my point of view, if you can go with Salesforce, what we need to do is to do a couple of things to analyze the models to tack, all of them to see how dependency are going and then to create a command. We can run it first locally and then to wrap it up inside the new dock or in airflow right yeah. That's how I see this so.

A

Let's see so where the Salesforce is the uh Stitch integration with, because.

B

Let's check, let's check usually.

A

B

Not here, but when I look for any source of the data be sure from where it's going everywhere is coming I'm going here, yeah.

B

Not here we want to see.

A

B

Here, if I'm not wrong, yep.

A

Yeah Stitch yeah yeah.

B

Everything is nice to described here for each data source we have, but we will stick with Salesforce and also it's Stitch. We will also have one lesson to cover: how to determinate it or or Define, something or predefine. Something is teaching five Trend. It also can be a good candidate to take a look in brief in Stitch what we have regarding the sales force, but actually data is landing in Europe's schema and also you can see what is the ETA for this data.

A

The Stitch use airflow for orchestration as well.

B

No actually Stitch is a start platform where you define all of things, source and Target, and everything is on there. Resources. So typical sauce story, like okay I, want to get the data from Salesforce here. This is my credentials, and this is set of tables. You can actually Import in your data warehouse and then you said, Okay I want this. This table don't want this. This D table I have a Target like snowflake in our case and I want to put everything in database.

B

Raw dot, schema name is Salesforce or something usually, and then you define I want a frequency or loading the data two hours six hours, and actually everything is automated.

A

Record yeah I've got someone at the front door, starting.

B

Don't worry take your time. Sorry, no! No, all good.

B

All right, yep, we are back.

A

So, actually, a way of um triggering the DBT tag like on a stitch. So once the Stitch process has run trigger the use as an input for the DBT stuff, yeah.

B

Actually I think DBT is blind. In that case you can tell DBT. Usually you know if you go to stitch van, just trigger your data importing and also you approximately know what time is needed for finished extraction. So, as per that, you can, let's say, run your DBT model yeah or you can be more smart and use some sensors or something like that to see. Okay, is this done or not, and then based on that parameter, run DBT?

B

This will be a fight tuning for us because for as of now, what I know usually extrapolate at the time and when we want to start something and DBT is just one more structures of now. So in that case it's running after, like okay, P postgres pipeline is done. This job is done. These five five integration is done and then DBT will run. So you can consider Stitch and five Trend as the same type of tool like sus pipelines for extraction part of ETL processing.

B

Like move the data in our resources in our warehouse it not be necessarily need to be SQL Server. It can be any files, Json and CC whatever, but usually Yeah. In our case, it ended up in Snowflake and from there we'll mod it from raw down the road in Downstream modeling yeah.

B

So let me share my screen again. I think I stopped it.

B

So, usually for each description about data sources, you want to use go here instead of self-force. I will also want to focus here to try to predefine what needs to be done and.

A

It's like requirements.

B

B

So we need to do what Define the candidate.

A

B

B

We agree, this will be Salesforce, then analyze all models related to.

A

B

uh Then Doug.

A

B

Related to Salesforce, of course, then do the DBT run locally.

B

To ensure everything is working, fine, locally, DBT run that commands. So when you hear there we have the command actually what we want to run, then we incorporate the command in in Duck in new acrylic duct, but we need to be sure everything is cool here and then move on to that creation, which will give more context about data engineering stuff here, yeah so yep.

B

On the DBT level, we should exclude Salesforce jobs from Main DBT jobs. Otherwise we will have loading same things two times you.

A

Know yeah, so we need to.

B

Exclude it physically, but on this level I would say for data DBT sharding, it's enough to Define, okay, actually, what we want to load see do we? Can we somehow improve something? Is it okay or not, but main point and main focus is to extract it from that big bunch in DBT and have like something like DBT, Salesforce right and then on that level we will schedule everything when we want to run how we want to run. What is the command? Actually we'll have the command as outcome of this issue, and and that's it, foreign.

B

And try to optimize the process we'll do that here.

A

Do the material local.

A

A

That's what that's your name.

B

And this does block the next one regarding the deck.

A

These resources.

A

Looks fine foreign.

B

Looks clear, I'm just thinking what is the best way to kick off this? Maybe just my proposal, I'm on triage the entire week we switched to that our.

A

Weekly scheduler, and which is.

B

Okay for us, because you can focus on that and also have enough time to to catch up with you in one work one hour two hours working session, whatever is needed, so I would say: maybe what we can do if you, if you can do the next couple of days, check the models guard, Role Models, then we can start doing together from here yeah and just to have a extreme programming pair session like okay, we can do it together, see together one two hours and iterate on on this problem, but I would say if you predefine and gather all information regarding Salesforce I, don't know much about that.

B

Then we can exclude run it locally on your own, reminds him or whatever and ensure we include everything needed for that run. Yeah and after that months, when you do a test, we can move to the creation, which will be a separate, let's say, working session for us, a real real data engineering stuff in main python coding and plus airflow. So just my assumption and anything you think you should add here, please add, feel free to rephrase to add more details. Whatever you needed, but we can start from here if you're, okay, yeah.

A

I will do I. Think um Dennis mentioned that there was some requirement. Some some people were looking for the Salesforce data to be delivered into a certain table more frequently than once a day. So great I think that would be the so find out where which table they were looking for it and whether it's yeah and then maybe take it that far, because we wouldn't need to do the whole everything that relied on Salesforce data more frequently than once.

B

Again because it.

A

Was only a sort of subset of it, something to do with um opportunities, maybe yeah definitely I agree. Salesforce data load, it sorry transform it.

B

A

Check the instruction right and see for sales force and Stitch.

B

Why this is important to check the frequency of extraction in stitches, because, let's say if you load the data once per day, you can run your DBT in each hour, but it doesn't make too much point right, because that is not fresh.

B

So we need to align with that part and also once when you drop off everything regarding the team, epic think, okay, we can put Salesforce loading more more frequent than it's, not because, as I said, you can schedule it in in Stitch, very, very simple, so we'll cover and touch that part also we'll start from that. But still I would say what we can do is is done right.

A

B

If you can then quickly do here.

B

So for from here we will do it together.

A

A

So actually we.

B

Have at least specification now.

B

Now our issue is defined for.

A

B

So what do you think, which day is fine for you except Thursday? I'm super super busy with some conferences online Luckily. Everything is fine for me, Wednesday Friday, whatever you want, we.

A

Have enough time it's it's quite quiet week this week because of Thanksgiving in the US, so yeah.

B

When is that by the.

A

Way, I think it might be Thursday, so I think a lot of people would have a.

B

Holiday Friday off as well yeah I, would say I would say: let's look.