GitLab Data Team, 21 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Engineer Internship: Chris Shar, Radovan Bacovic and Dennis Van Rooijen

Description

Session number 5: Chris Shar, Radovan Bacovic and Dennis Van Rooijen - pillars of successful DBT sharding

A

Yep good day date, engineering, internship Dennis is with us accept through my main increase and also I, want to share my screen and.

B

A

What is the agenda for today? I I've been have time, Chris, sorry, for this to check this airport running locally, why pod is running and that is not spin up properly, but we'll check it probably tomorrow on a Friday, it will be much much whether with my free time and also then it secured us to try to cover the uh let's say his Explorer. He want to share experience about late time in the measure. Principles.

A

Well, not sure key when you have more frequent load in fact and slower load in their marriages, if I'm not wrong, so I would focus personally to this topic. If it's not a problem but I think.

C

B

Please, whatever.

A

Yes and sounds good.

B

Yeah yeah, then.

A

Let's give you the brief introduction about. Actually what we cover up to now. Chris did a great job about analyzing the models for Salesforce. What is as difficult SDS, Chris I'm, not sure about that. It's Salesforce. It's.

B

The Salesforce opportunity data.

A

Opportunity date yeah, if I'm not wrong, I check below the data every six hours in the extraction part, and also we want to get closer to that part when it comes to transformation, but also to open some doors to do it more frequently and, as I said, Chris did a great job to analyze the models and approximate time of execution for this set of models. We want to to focus and actually do extraction from the main deck and put it into the separate deck. For that reason, also, we discover a couple of good options.

A

uh I would I already mentioned earlier today to Dennis about the selectors option, so instead we have a long sausage with commands. You want to use, you can say: DBT, wrap, selector and selector name right, and in that case it can be a good General approach for the entire transformation part or all DBT jobs, but.

B

Yeah, it means you can start combining tags and paths included.

A

To do whatever you want right, yeah, yeah and actually Dennis. We need your help here about what we talk about this late driving the measure principle. How we should implement this, because the main ideas try to convert from Full to incremental load each model, where it's suitable and makes sense, do not over complex everything, but it's a sharing at the top. In my point of view, we need this principles with later I will imagine so we want to hear from.

C

Your experience.

A

About that, okay.

C

Cool and later arising, Dimension principle only makes sense. If you use incremental models. If you do full refreshes, then there is no.

A

C

The the idea behind arriving Dimensions is that that's most of the time has higher frequency data. I have to take a look at the same opportunity and I. Don't know what more Dimensions we have there uh sorry and one more thing. It only applies to text and dimensions, not to Mark labels, not to report tables not to prep tables, so really effects.

C

Okay, the idea behind this is is that texts are updated more frequent than Dimensions. Let's say you have an Tech table opportunity out of salesforce.com and he has also a dimension table that I don't know which devices are applying to this effect table also on dim table, which is called client, for example, the client itself. All the attributes will not change frequently yeah. uh The name of the client will not change frequently. The the contact version of the client will not change frequently. They do maybe, but that's why it's also called slow changing dimensions.

C

Text data, most of the time, especially if you have an event-driven fact table it's updated more frequently. So there are a lot of opportunities, hopefully um put into Salesforce, so you see a lot of new opportunities. So from that perspective, it makes sense to update also the effect table more frequent rather than updating the dimension more frequently. If you want to see an eye level number high level kpi the number of uh opportunities. As of now likely, you don't need to have all the attributes who are in the dimension.

C

Maybe you want to divide or aggregate or group by on some of those those those Dimensions uh on those attributes, but because they will not change that frequently. It makes perfect sense to only focus on the fact table. The problem is as soon as you put in a new effect line, so in this case a new opportunity into reflectable opportunity, and you don't update First Dimension table. You cannot create a join between effect and the dimension table, and then you make a missed data if you do an energy one between effect table and dimension table.

C

So the principle behind leading the right dimensions is that you already put in placeholder in the dimension table with, if you have sugar, cookies or shortcut key the natural key, but none of the attributes and let's say once per night, you're going to update all the attributes following the normal frequency. What we have already right now. So that means that you don't have an eye loading process where you need to capture changing Dimensions.

C

Here you can only do inserts in the in the in the dimension table, and you only can do the inserts into the fact table, which is much more efficient. Rather than do effects do doing the full update on the dimension table and then going to do an update or insert on the fact table.

A

So again, it's like I want to practice here. I will put it on regarding, as you said later, I've been dating with the placeholder like fake, not fake record, but where I could only populate it with that surrogate.

B

A

Key technically how we should implement this in in DBT, let's say in our case.

C

Yeah basically create a new job for that Dimension table. So that's why it's also important to do an incremental and not a full update, and the question is I. Don't know exactly how this is possible and if this is possible in DVT by the way. So that's something we have to investigate.

A

So this is like, like a conceptual level of theoretically speaking, how to feel but I'm just wondering.

B

Maybe you have.

A

An experience in real life how to technically sort out this? What is the best practice.

C

Yeah I have but not in DVT, so I, for example, Talent. uh If Market Power, Center IBM data stage for doing this, but not via DVT yeah,.

A

I know in talents etf2, you can Define okay. This will be a cd-1 STD to type in automatic when you determine the type mechanism is built like that under the hood. That is what I know from my experience.

B

Yeah, the the all the um models that we've looked at for the Salesforce opportunity. They are all full refreshes um and, given that they're short, they run pretty quickly because there's only like 250 000 rows or something so um yeah.

A

Because we just want to create some kind of exercise if we wanted to play and pretend we want to switch from. Let's say four refresh incremental, but also you spoke with Pete rimpy right Chris about one model we found is kind of more most complex in that case. If we try to reorganize it, it will be very, very yeah.

B

Right yeah, we looked at putting making some of these incremental and it would I mean incremental models are really useful for large models, but it does increase the complexity of the the code that goes into it. So um and a lot of these have got many different inputs, so it was we sort of felt that it was more more trouble than it was going to be worth. It would have increased and complexity and don't think you would have seen much Improvement in runtimes.

A

Yeah, that's true, because I think this fact model we analyze it's com. We have a couple of components on the various sources. As you say, it's not only Salesforce and also some, let's say psych tables- to fill the data and everything like that right, yeah.

A

So then it's in that case what you should propose, because probably this is not the best design model in the world like I saw, can be optimized, but, as Chris said, it can be over complex, like you will put a lot of engineering effort and do not have great benefits.

C

That's true, on the other end, if we want to roll this out. uh Furthermore, um can also be showcase to show that this is a good work to watch the future, so uh when when, if we don't do it right now, when do we want to do it, then that's basically, the.

A

Other question yeah, that's true, that's true, but because we start optimistic like okay, let's, let's pick up one model and try to reorganize it to be incremental instead of or log, but as Chris said, it's super complex. So when you try to touch this, actually you need to to write it from the scratch. Probably or most of the things will not be used or does discarded. So.

B

In this case, I want to see how to reconcile.

A

This light driving data against complex model under the hood, because for now Chris label everything needed for for the new deck and also he excluded everything from the existing deck. So now we have a material to create completely separate deck. I just need to check why DBT model run is not spin up on the cluster on testing environment and also the main question is okay: how to approach this to optimize this. That is what do you think?

A

Should we put more effort here or how to create a good Showcase in let's say, cost efficient manner when it comes to time and resources.

C

Yeah well I miss one thing: I'm going to switch from Full to an incremental out here or not.

B

I, don't think so um to looking into some of these prep CRM opportunity, there's many different sources, so we just make it quite complex to update something to be incremental.

B

C

Another thing I'm asking you, because what I from what I hear right now, basically is we exclude a certain flow from some Source models to the to the to the both layer. So, let's say from raw to prep: we exclude it from the regular one and we do it in a new one and we scheduled it four times today. So I try to see. Where can we raise the bar a.

A

Little bit here, it's.

C

Not proper engineering work also around.

A

Yeah, that's true, but this is the model we talk about right well,.

C

Can we zoom in a little bit I, don't know.

B

uh Yeah yeah, uh let me close yeah.

C

This is a little bit better.

B

Okay, all right.

B

um So this this is actually this is the model and it uses a macro. I I wasn't involved enough to speak to Michelle about how this has put together, but.

B

These this macro I've never made a macro into incremental, but um I mean there would be ways of doing it. What was the name of the actual?

B

It was sfdc opportunity source. It's the first one right.

A

Also, the pro one of the problem with this model, then, is because many Avengers are hardcoded here not exposed to the table but expose in hard code manner as a table like one two, three, four, five, the meaning of four all these facts. So that's the catch here. One of the issue, I, would say- and this is built based on marker right, Chris and I- think it's using two times. C2 live and one another parameter right. Yeah.

B

That's that's right um if it's.

A

B

This has an inner join on prep, CRM user, so I suppose we'd have to see perhaps here and user updated as well to see this data coming through.

C

What's the source for perfumer CRM users do we know.

B

Oh, this is another.

B

Safety series, if needs.

B

Sftc users source.

B

So why would it be.

B

So as long as it's not a new Salesforce user that we don't have in that day, so I, don't imagine that you know someone would be created and then opportunities would come in and get that cell and assuming it works.

A

Yeah in this case, theoretically, probable. No, but what then is pointed out is perfect sense for me, because sometimes you can. You can tell late arriving data from some other size site and you miss that information and if you apply the principles with late driving data and fill the facts completely, because you have the facts. But let's say you: you miss new user new player, new customer. Whatever, then you need to put this place called hold the record in dimension.

A

In that case, this is in any drawing, so you will not lose the data. But, theoretically speaking, let's say we isolated this dag and data coming from another side and it can be late. But now.

B

We are focusing.

A

How to make this simple or incremental from Full law then, is because you see it's a bit complex.

B

A

B

It's only the opportunity models. That's on the six hour release schedule from Stitch. Is that correct? No.

C

Everything all salesforce.com is scheduled every six hours, yeah.

B

A

Yeah extraction is based on that because we have one job I, think it's in Stitch and more all of the data every six hours in incremental fashion.

B

B

Yeah I'm not sure how um I think you know if it was. If we were looking at a larger longer running model, then definitely, and we do generally make those into incremental, um but these small sort of very quick running ones. It's it's often better for the Sim Simplicity of a full refresh.

A

But what we want to show, what then I said is that you use case for us. But what stop us to use this model or convert it to incremental, is kind of complexity, so I'm thinking, what's the best way.

B

A

That short case, is there any option to move forward in this fashion and just rise in that bar.

B

I'm, probably not within this sort of lineage I I, don't I, don't think.

A

Let's see documentation Maybe.

B

So we've got the selectors file, which is the new functionality that would be used in the dag.

A

B

That's um so we've got a selector called six hourly Salesforce opportunity and then the dag itself.

B

It's called dbt6 I really and the code in there simply just as.

A

A call selector, instead of all.

B

These groups yeah as a DPT run, and you just specify the selector and that will run the models that are tagged with six hourly tag.

A

Yeah this is this: is we Implement, then it's for information uh to make everything flexible. Let's say we want to run it. There was discussion, okay, how to touch something to run hourly, but actually our needs to have six salary in a first iteration and later on, to be able to decrease the time up to five minutes.

A

Half an hour one hour whatever and yeah Chris found this very nice and elegant solution with these selectors, so you can easily just find and replace tags like six hours, so we combine Salesforce six hours tomorrow we can combine Salesforce with five minutes run. If you know what I mean, so it's very easy to keep everything in control, so that part is also covered and I think it's also a very nice feature to help, because we will run this.

B

A

uh Airflow tag, but the main question for us is how to pick up one showcase for increment a lot from from full load. That's that's the question mark for us because.

C

The concern right now, which is the flow, is too complex with too many dependencies to make it incremental right.

A

Yes, that's true because.

C

We pick up the heaviest.

A

Model here, even all of them are kind of I'd, say I'd like to listen fast and quick in execution, even in full load, but we want to expose one example: okay, we know how to do this yeah, but yeah. As you said, the main concern here is about over complexity, for this showcase and also contains the several sources can create.

C

A problem, another question: what happens if this deck is six hour? Deck runs at the same time as the extraction is running.

A

It will pick up the latest data it has yeah. Also one thing we consider is come up with some to have some sensor to check is extraction done or not, I know from the previous projects. I was working on. That was a catch like, theoretically speaking, not to be locked it will. You will just load the data from some point of time. Actually what you find in the wrong layer in Snowflake, because it's this DBT.

B

A

About Stitch extraction at the moment, yeah, but we consider the tops like okay, we have some Stitch notification, we have something in in row layer and we can use that information to.

B

A

Or tell this dag what how to behave in some, let's say special situation if there is overlapping, but, theoretically speaking, you will pick up what you have and next time, you'll pick up what you have a start from that point in time, yeah.

C

A

Everything is full the.

C

Problem is in the salesforce.com extraction. We extract multiple tables, we.

B

C

Expect a customer table like I, saw and also an opportunity opportunity.

A

C

But I don't know what what the specific order is here, but let's say um the extract is ready. We extract the opportunity table then the deck kicks in, but the customer table hasn't been loaded which can happen. That's.

A

The way traveling Dimension principles we need exactly.

C

Exactly so, in my opinion, uh what I would like to oh well, my personal Vision here is that and downswing dateable data model.

C

Should run properly regardless, what happens Upstream so it needs to be, let's say: flexible, can I, call it flexible or or adaptive to what happens because anything can happen. What also can happen is that the extraction for the opportunity table goes well and there is an error on the user extraction right. So if we now have the philosophy that we do a full refresh, because we don't know exactly what happens for a Hocus, Pocus I think that that's not the right, the right thing to do and I think the impact is less.

C

If you do it once per day, because then the chance that it runs in parallel is is, is lower.

A

C

Even it's sorry.

A

Even it's lower. You also have most of the data yeah. You will probably miss a couple of hours because you know data six hours, every six hours any transformation. You run it once per day. Yeah.

B

You're not missing full day.

A

Of date, you're missing one to three hours: maximum, in some crazy case.

B

How is the data loaded as a truncated mode, or is it just appended inserted to the end.

A

Actually, when you, when you define the speech integration, you can say which way you want to do as Dennis said in some other courses called std-1. Std2 here is called like not tracked key based incremental for refresh something like that, but usually I'm. Looking now, I can share my screen. If you want just to show you, um they just call it in a different name and Baptist with different method, see I, say table account. It's key based incremental, so.

C

A

Most of the these definitions are key based incremental, so probably seven is incremental. You have ID and if ideal as low this 1000 I will start from 1001 next time, save it and start from there. So it's always going going and later on.

A

On our side, we switch the logic and philosophy and do a full repair. All the time right because here is the fact is. This is incremental load right if I'm not from Key based incremental.

B

A

And then on our side we do a full load is then he said because of course, this Focus blockers don't want to to mess with this, maybe good time to challenge that approach and try to beat it somehow differently and as then, it's explained nicely can happen that something is going late or something is screw up with low probability, of course, but we need to be prepared for that. All that's just I'm thinking about the best use case, how to rise the bar and expose.

C

Careful experience and, of course, I'm not in favor of over engineering anything.

A

C

Please boring, as you know, understand that even model runs in two seconds if you do a full full load here. Why should you engineer an incremental out here? I completely understand.

A

C

In this case, where we do an explorational thing of DVD sharing, maybe for this one we can do it super boring and just scheduling, scheduling it four times a day. Every six hours, but I'd also see this as a showcase to roll this out any further. So from that respect, I would say yeah, indeed, don't over engineer it, but.

C

I think if we want to do TBT, sharding or if you want to shout out the Big Deck, what we have right now run it. The second use case run it also multiple times per day, to give our consumers data more frequently I think, then we have to come up with something. What we don't have right now, because a a a a more frequent load on the models that we have right now. I think that's not doable, because it already runs for eight or nine hours and if.

B

C

Do it that's not possible so from a shorty perspective, I think that selector mechanism is fantastic because it gives a lot of flexibility to short out decks, but to make this next step to to raise the bar a little bit. I also would like that's my opinion here. uh If we also can implement something that leads to more efficient loading, a more robust loading to increase the frequency, I think that would be great as.

A

Well, despite the provider and despite type of the data right, I would say our.

B

First takeaway.

A

For now is uh sharded plan to how to Shard one deck for any provider. Second takeaways switch to selector option and make it very very flexible, and then you open a very various options to decrease the load increase. The load do whatever you want, include, exclude and the first takeaway what we are actually missing here. We want to agreed on that to Riser bar and provide comprehensive, Swiss knife how to Shard everything. Is that incremental load from full and also later running Dimension right- and this is how I see this. This is.

C

My vision exactly and not a great example is kitler.com data. Previously we had just one database instance where all the data was in now we already have two main NCI. If you want to combine those data sets basically you're combining data from two different data sources. Technically, so, if one pipeline fails, we have this now. Sometimes you also need to prepare to a situation for a situation where the models in DBT are robust enough to handle these kinds of situations where a pipeline could fail, and now we have two instances.

C

Hopefully in the future, we have hundreds of thousands of instances if we go to a portion post, sharding architecture, where we have thousands of database instances for our soft platform and I get guarantee. If we have hundreds of ports where we need to extract the data from some of them will fail in the end right. Every now and.

A

Then so everything can fail all the time. So that's the first assumption right. So you need to build a system. The.

C

More you have the likelihood the something will break. We also increase here. So the data landscape will only be more complex and the question now, of course, is yeah. To what extent do we want to prepare ourselves for that? uh Well, that's basically, a little bit up to you. uh I think. Indeed, the selected mechanism is already right. You can do good charting with that mechanism.

C

So thanks for that, I don't know how much time we have left in terms of developing a little bit more, but if you also can make a Next Step raise the bar a little bit to uh efficiently uh support more free to data loading. I think that we do an even greatly of the video already right now, but.

A

I think selectors are kind of red. We just need to implement this summer, but we have. We know the mechanism like it's very fairly simple and we can simplify the process, as I said, first takeaways to establish a good process of how to do a sharing. You have a specific steps like we have a method and.

B

Also selectors.

A

Are good good option.

B

But generally even now, even.

A

Without charting, you can use selectors instead of this long long commands. Okay,.

C

Because that that selector will make a decision and putting that in a yaml file, uh I think that's also very beneficial for the analytic Engineers, because right now, if there are new models created um at least, if I do it, for example, for certain source to uh provide the data in a workspace model. For me, it is unclear how I can scale that one.

C

So I bet that it is a DVD job, but if you can explicitly Define when those models have to run for your deck chart, if you had a selected mechanism, I think that's fantastic. So there's a good good option to use. Indeed,.

B

Yeah I think Peter's working on something related to the selector yeah as well, so he's also looking at it. So let's get those merged together and.

A

Everything is in one place in that case, which is great. You have one ammo file with all management when.

B

It comes to DVDs.

A

You can do it generically, you can run one class use, this configuration genetically can do whatever you want and also you need to change. Just one file, not compiling, not just change gml and you're there right yeah, so.

B

A

Say for now, we're running out of time agreed on three main takeaways use a good process to establish a process of how to execute a sharding, not sharing and we'll share our share. The.

B

Lectures saying how to leverage usage.

A

Of them, also not in sharding but generally speaking and the first, the pain point for us at the moment, how to find the optimal way for robust DBT load Creation in case something is pale or something rapidly radically grows like from 1 to 100 databases, which is really possible scenario. Next, couple of years right, we spoke with Robin good company. Then, as you remember, they have 30 databases which they are fairly small, but you have a lot of components. It also lead you to high probability of something will fail.

A

You have 30 components instead of one right and.

C

Maybe the fourth takeaway number D yeah um find a optimal way to provide high frequency loading or data processing in DBT and optimal weight, provide more frequent data loading or data processing. I think we should call.

A

It processing writing process today, yeah, usually that this contains in number one, but can I put there's a number b or two.

A

Because I I connected with the first one, but just my practical order- how to put here so I, will also put this in issue thanks for help. Dennis you're welcome, Chris I'll just put everything in agenda and also in issue. So what's our next steps? Now it's holiday season, so probably a few weeks. This will be like a bit and after that.

B

B

Pick up in the new year and then New Year.