GitLab Data Team, 21 Apr 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Pump Demo for Salesforce Team

Description

A demo from data engineering reviewing the data pump framework to discuss as a solution for a Snowflake to Salesforce integration

A

Word great, so let's go ahead and jump in we're going to jump into an example, uh a working example of how data pumps work today, um all right, so the list that I have here in the agenda is: let's make this bigger we're going to start with the model, so you can kind of see how here this corresponds to this list. um So first we're going to go looking at a model. I broke my thing and then so this model is here. So this is our dbt documentation.

A

Dbt is a tool that we use to do all of our modeling. We also can use it for our docs. So here is the dbt talks for the pump marketing contact model, which is the model that we're using to pump data from snowflake to marketo. You can see we have the column names here. What the data type is and then the way that we've been using the docs for this is that we use this description field to just tell you what field we're mapping to right.

A

So this way the integrations team that daniel is on they have a point of you know they can look at this for a reference and when they're building out those fields and those mappings. If we're working with someone like jack or jim, on the system side or the technical owner's side, they can have a source to see what we're building and how we're imagining a mapping. They can help us contribute to this mapping. Here we sort of have a documented version controlled this.

A

This exists in the ammo file, also in our uh public repository for the data team, which we can point to later. If you guys are interested um but yeah, this is the model, let's say, um and this is, if you go into snowflake, you can query pump marketing contact, um and this is the sequel that generates it, etc.

A

You can, even if you really care to see how our warehouse worked. You can see a lineage graph of where the data comes from, so uh we're trying to leverage and resurface as much of what we already have in place as possible. So this is the model. Then the next step is in pumps.yaml.

A

We have this pump marketing contact model at you know this model record, and then we have some attributes. We have a timestamp column. This is the column that the pump framework work will use to increment the data. So uh I want to query for since the last time I got data to now and get everything. That's changed. We're going to use for this model, the last changed column.

A

If there isn't is there's not an incrementing timestamp column, meaning we always want to get all the data every time we query, we would just put a null here and uh data pump will handle that if the data is sensitive in this case it is- and I think in many of the cases where we're going to crm- we'll probably have some sense of data as well.

A

We're going to mark that as true and what that means is that this will look at the sensitive schema so in the um process here we'll go through this more detail in a second. The first step is to create a data model um once that's done or during that process, we'll evaluate what the data is like. If they're sent to the data part of that model creation process, we make sure that that goes to the right place in the warehouse right.

A

So if it has personal information in it, we have to be careful about where it goes. We have a dedicated schema in our warehouse for that sensitive data it'll go there and then this just tells the data pump. That's where it is. This owner field is not handled at all by the data pump framework um by what's happening on in the orchestration. It's just for documentation in this record.

A

So if something goes wrong, we want to know who to reach out to it's in the same place as this other information, so it kind of keeps it all together. um That's that any questions about pumps.yaml.

A

Once it's in pumps to ammo, then it'll be available in the airflow dac. So this is what airflow looks like each of these squares is a as a task instance. So this is a time that it runs. You can see that we've been running this pump. Marketing contact this pump. Hash marking contact is a hash de-identified version for testing. For some time. Just last week, michael walker on the engineering team added a product usage pump. This could be what we use for salesforce by the way.

A

I think this is the data that we're actually interested in. You wouldn't have plugged it in. So we can start using that to test if we'd like once it's in this, this is going to be running. We can look at a log here um and it's going to hood. You know it's going to do. I don't know if it's really obvious here what it's doing- maybe not!

A

Oh, it's right here, so it says copy into raw dot. Public s3 data pump, slash, and then you get the right now we're getting into s3 stuff here then it then it it truncates it. So you can't actually see it well, let's jump into s3 uh and then and then we'll see it so.

B

uh Let's assume it queries everything and uploads everything just not.

A

Joke it's a joke, yeah just assume! I guess I could make the log verbose, but I don't think we benefit much from that. So here's the s3 bucket, where it's landing right, so here's uh this one's actually from previous testing. So this isn't getting anything actually. This might still be getting stuff. I have another task running in the background from testing that does this.

A

We have the marketing contact and then we have this new one that michael just added palm subscription product usage and it's just going to spit out csvs um the if you're curious, the name here is the query id from snowflake and it'd be fairly easy for us to add anything to the object naming here.

A

So one of the things we'll talk about at the end is we need to add some testing development and monitoring to this as well part of the development that, we'll probably add, is we'll probably have a separate uh object, name and s3 for when we're doing when we're in a testing or development phase right. So these are some things we haven't added yet, um but we will- uh and it's really easy easy to add right.

A

So let's say we wanted to add uh the targets to the object uh name in s3, so we wanted to be. You know, let's say: pump subscription product usage is relevant to a bunch of places which it might be, and we maybe want to like split that out into separate files. For some reason, maybe we don't, but maybe we could. We could quickly, you know, add just like a hey, maybe we add a target variable to the pump.yaml and then we can just feed that into the way that this name is constructed.

A

So it's all happening via python. It's pretty straightforward to make those changes. um Now, after that point, like I said, workado can can read s3, it's got permission. um I just have a link here to all the integrations that are currently supported by rocato. Anyone who's ever used a a tool like workato before whether it's uh mulesoft or ricato or what's the other one informatics.

B

A

Knows that like not all of these are created equal right, so there's still always even if something's in this list, it doesn't necessarily mean that we could do it the way that we'd hope, but this is just a list of the thing there, daniel parker's on the call he's the technical owner for workato at the moment, um thanks for joining daniel, um I don't we can ask. If you guys have questions about ricato we can. We can jump into that now or, if you kind of get it, uh we can keep going.

B

Well, I I have a whole slew of questions about the the csv output into the s3 bucket, uh but that's kind of you know well hold on daniel might be able to answer something by showing off daniel. How are you ingesting those csvs as they the format and the timing? I I it wasn't like one per day or that's what it looked like. How are you ingesting those into ricotta and doing anything with them? If you want to show it off, that'd, be amazing.

A

uh Yeah, so uh I can share my screen if you like yeah, while you're pulling that up too, do you have something ready or do you want me to like.

B

You can go first.

A

You're already sharing, so you can go first and then we'll jump to mine yeah, so the way that work, auto works last time I set something like this up and I imagine this hasn't changed is workato basically lists the objects in the bucket at some interval. I think it's like five minutes or something and then if it notices a change, so I know this is a new object. Then it will go run. You know it's. You know the the.

A

uh I guess it's the download um to get the or the read and then it'll use that batch of that way. All right.

B

So yeah well dave's got. Maybe I can ask because I'll just drill in we're right here. um So currently you know the way we're doing it again is is a snapshot model which is we have the complete data set, we're snapping it once a week, just arbitrary amount of time. That's what made sense again. It's a manual process so when this runs justin before you said that it's actually in a pending process, meaning when we clicked in these csvs is that a portion of the data set is that a snapshot of the data we.

A

Could do a running.

B

List you could do either. Okay, that.

A

Would be determined by two things: the model that's created right, so what's in that model, for example, um because you can have them, we could have a model that you know replicates data. So every time you know, you've got a time stamp, and this is all the data for yesterday that you could have rows in that same table for the same data, but what it was yesterday. So that could be one way.

A

The other way is, it could just be the current state of the table and then what we would do, let's see if I still have it up and pumps.yaml here again we have this timestamp column. So if you know, if you give a null value here, then uh our pumps, job or our pumps framework will just say: well, there's not one. So I'm just going to clear the whole thing, and I can show you really quick what that looks like it's just in this module.

A

What is it and well actually never mind, keep going I'm interested yeah. So basically, all this is doing here is um it's generating a copy command, which is the command? You'd run the snowflake, and if it doesn't have this timestamp column, then where is it right here? If timestamp equals none, then it's just select star from the table right. If it does have a timestamp, then it will add the where clause, which would have the the time frame.

B

That it's querying got it so that appending delta you're actually piggybacking on top of stock snowflake functionality. Is that what I'm hearing yeah I mean.

A

There's no flagship exchange you're, just using sql on it right, so we can do whatever we want.

A

Anything that sql can do is what we can do here and we've we've set this up in python in a way that again, if you pass it a time stamp to increment on it'll use it, and if you pass it a null it'll just take the whole table.

B

Every time for sure so, you're just you're doing off of time, stamps based on the snowflake data itself, snowflake doesn't have any type of like point in time. Compare it would.

A

Be based off the model.

B

Yeah, okay, you're you're doing it yourself. That was this. Is me not knowing I thought. Maybe snowflake has something cool so.

A

Just imagine snowflake is a is a database like in the other database, like my sequel, and you can have arbitrary whatever in it right, whether it's good or bad? So we, you know, because that's arbitrary is the reason why in pumps.yaml we make you tell us what the name of the timestamp column is, because you could name it. You know gerald and, like maybe gerald, is the column that you have. You know the last time that data's changed or when you want to pull.

A

It is actually what it really means when you want to send it in the pump. um You just tell us the name of that column and as long as it's a date or timestamp field, it'll, work, yeah,.

B

And you have the ability to say, listen, we'd, like a full snap of the data at an interval, and the only thing that s3 kind of fails at is is the only clue we have to say like this data. This snap is for this date is like hiding it in the title right. I know you can append. Can this append metadata to anything they get to the files dumped into s3 or do we have to rely on the title.

A

um It depends on what you mean by metadata. I mean there's a few things in what you're asking, and maybe we could do this in a separate conversation because it could take a while to get through this. But basically what's happening is the way that we can tell what we're sending when would be a combination of.

A

Is it incrementing on a timestamp, and if it is, then the way that it's working is airflow is actually passing in a token or a variable, via it's a ginger template to to to to generate that query, and so we can tell by this run, I can't mouse. I can't mouse.

B

Over on the other side, I believe that these things know when they ran. What I'm saying is that the data outputted into s3- let's say that was the product usage run for today. How would I know that.

A

So if it was using a timestamp column like incrementing and like the one that we showed, then the the data would.

B

Always be the most recent one, I'm saying if it was a snapshot style, so that was the entire.

A

B

That's correct.

A

Then it would be whatever it would be, based on the time it ran. So is the question: how do we know what the data like when that was run or is the question? How do we know what the data looked like when it was run or something.

B

How do we we say that this is an uh now? It's a two-parter, let's say it's a single table. The goal is just to get that table into a csv where someone can consume it up on s3 awesome. We just need to know metadata about what this is such as. Is it complete? When did it fire um the the rows, even obviously we'd get that in about two seconds reading it, but you know any.

B

Is there a better way to your knowledge to attach that metadata, or do we have to literally smuggle it into the csv title.

A

We, wouldn't, I don't think we would want to- I mean if you wanted, you have metadata ns3 about when the object was created. That might be sufficient for what you mean. I'm.

B

Almost positive and again we're using it to use the real example of us firing off an apex job that can go up to s3 and talk. uh I I would hope we were able to get at the metadata. That's why I asked I would much much rather have it here than like parse a csv title. Sadly, in the past I've you know resorted to the former. We can't you know. No, thank you. Can the can your code that pumps into s3 set this metadata.

A

um I wouldn't want it to right. In my opinion, we want to be as agnostic as possible to what we're doing here. So if the question is, when was the data dropped or copied or updated, I'm going to rely on we're going to.

B

Go off when it arrived and that's: okay, that's, okay, and we can we can roll with that. It's like our data. As of this date. It's it's a it's a fact. Awesome bring me my next question. You mentioned this very quickly for data that might be dependent upon even intervals right if there was a problem or if there was something where people are expecting it to come in on a certain day, and what's it's not it's not all the time. It's not always right. What would we you know?

B

What is there any a type of retries failure? Monitoring? How could we go and say? Oh man? Well, something went really wrong. We need to go and rerun it as if it was yesterday so.

A

There are two ways and daniel can talk, there's two parts of this, and I can talk to one partner and start the other part. The ideal way. The way that we want to do this is we want to keep workato and s3 as agnostic staging intermediaries and mappings um within airflow. I can you know what would happen is if there was a failure in airflow. This would be red and we would get an error that would pump out to our monitoring.

A

It would show up enough for us in slack it would be up to whatever data engineer was triaging that day they might end up sending it to me at this point, since this is new um to go and resolve it and fix it, and then it's really easy for us to rerun these jobs, whether that be through this interface or I can also exec into the container. This is running on and I can run a command that can give me like a window of like run all of the tasks between these time frames.

A

That would be the ideal way now there's a possibility when this is very new, so we're not sure exactly how it's going to work out, we're confident, positive or confident and optimistic, um but there's also some possibility. We don't know what it is that everything works well in airflow and in snowflake and an s3 but workato for some reason has a problem. Maybe it's with the api or something else sure absolutely rocado does have a way to do. Error handling and they've got some retry setup over there, but I'll.

A

Let daniel speak more to that because he's I.

B

Would imagine that would all be on the consuming sides problem if it got safely to s3 how you get it from there to wherever you need? It is up to that. Like.

A

Like, let's say someone deleted the field in salesforce right, that would be the sort of thing I would imagine.

B

That would create this but yeah. Oh there's about a billion things that can go wrong all right, excellent. So my is there a. Is there a thought here, just while we're on and then if we want to go on snapshot versus the appending? Obviously we have a live data set, that's appending inside s3, meaning we just when we show up and ask it. We do the snapshotting if you will on our side, meaning that's when we pulled it here.

B

The data set updates, updates, updates, that's when we pull it here and so this now the um the consuming system is now the only keeper of those slices right, because it's just one appending table up on s3.

A

Right, yeah we'd, have it we'd, have it both in s3 and in work auto, and then they would stay in there based on whatever retention policies we have in those systems and I'd have to double check what those are but yeah. That's correct.

B

Right and so, but what I'm saying is because I don't think we're workato is gonna cut where kano's gonna move the data. I don't think it's gonna store the data it will for a time daniel can speak to it better than I can okay neat whatever whatever that thing is that goes through there? Okay, but there's, that's not gonna, be the new keeper of the snapshots.

B

What I'm saying is is that if we did it where s3 is constantly just here's, the most recent data set that you can get at mr consumer consumer awesome is that now that if the consumers are using it in the snapshot model, not just like a live, oh here's, the data, but we need it for trending, a very important part of a lot of the stuff we're doing is. We need to know that answer at a specific point of time. So therefore we need snaps.

B

We need to do in a snapshot model, but then starting there. It sounds like we're going to be the only people that have those snaps and actually mean to keep them around right. Not the engineer.

A

Like because of how agnostic I mean, this is just up to like take the date like take the data from here and send it at this interval, it's all.

B

All it's doing right.

A

um So if, in my opinion, if we had a more rigorous need for auditability- and we wanted to do some sort of snapshots, I would actually prefer to handle that up in the model in snowflake right. So the idea would be hey. Maybe we do a snapshot model in the sense of like we're, always sending all the data that we have available today, but maybe we still send that on a timestamp and what we have in that model is the actual full history right and then that would give us full audibility.

A

So we would have the ability to do that based on the model we had there, but I want to keep in my opinion. I want to keep that out of the uh the orchestration and the data push here.

B

I'd rather have that really.

A

Easy to spot places like in the data model where we can use sql and we can map that against. What's you know that we're finding in the targets right.

B

Which- and I love that answer my question currently in snowflake, if someone asks uh I need to know the number of licenses we had uh 10 days ago, how is that solved by the.

A

B

A

Yeah so there's two ways: uh there's like the emergency way and then like the way that, like we knew about it, so we were supporting it right, yeah, so snowflake. Let's.

B

Do the second one, the one that you knew it was coming? How do you do that? One.

A

We have so dbt our transformation tool has built-in capabilities of snapshotting the data. So we basically just point the snapshot feature at the model. We want snapshotted and then every time you run the job. It's like hey, just keep just keep copying snapchat and time stamping and it there's a valid.

B

A

B

Stamp on it, is it just like.

A

It has two timestamps, it has a valid valid from, so it's actually keeping track of like when the last time I ran was and then it'll give you that frame, so you've got two columns and then between those is the validity of that sure.

B

I think you could do one day, but I like that they trade car, that's nice yeah. It's.

A

Called a type 2, slowly changing dimension.

B

Love it so, okay, that that is a fantastic bit of discovery on how we would come standards because, like today, we are consuming like there's no snapshotting to the table that jim and I that we built forever ago, there's no.

A

B

Doubt you're snapshotting it right, but we are snapshotting it by downloading it once a week. That's right, yeah, right dude. What do you do it yourself? Snapchat? You know okay, cool and we.

A

B

Support that in a.

A

B

A

B

A

Let's factor that into.

B

Going forward because I love the idea of solving the model- love it, because there could be things where hey we see it differently, things change and what we don't want to do is become the keeper of what is effectively. You know. This is the license dot data that we have just moved over into salesforce. If we're the only ones with the historical people like how did you get? This number be like it's?

B

What was on the s3 bucket on that thing, right, yeah that that that's the best answer we got, which I'm not saying that's a terrible answer. You know we're we're doing this, to attempt to inform our end users, the sales folks on how to best serve our our customers, it's all for for good, but uh the the it person in me deep down, says like okay, that there's a issue where we're now the keeper of data that we don't own and we're the only one with the copy of it and that's bad yeah. We could.

A

Avoid that awesome.

B

Excellent um cool, I I'm just I'm now. We've talked about workato, so I know how we do it in apex right. It's java me and jim are java engineers spoiler alert to everyone who didn't know that, and so you know awesome. We can do just about anything. We want.

B

Workato is obviously a more on rails. I don't want to call it low code, but it could be called load code uh style thing. So I'd love to see like how you get around all the kind of the humps and hurdles justin. I just talked about yeah like how does it? How does that handle for? I know how like a mulesoft would handle for it. That's right, because I have experience with that, but workato. How does that work.

A

Well, given that we're at time now, um maybe it'd be worth uh having a separate meeting and I'm happy to help or be involved with daniel parker he's he and his team he gruner went through and they set up a pretty smart retry framework for how they want to handle this in general in work, auto and obviously like you're kind of pointing at instead of building it ourselves.

A

We sort of can rely on what's built in ricato and then also beat up them if it doesn't work, but so far, I think it's been working pretty well right up the.

B

A

Term, that's right.

B

All right, excellent, we are at time we'll go in, uh did you want to stay on and we could we could catch up? I haven't talked to dan in a long time um cool all right. Justin. Does this conclude the recording yeah? I think so. Yep.

A

And then just for everybody to in the agenda, there's a link to the epic. If you want to see sort of what's left right, so we want to add testing monitoring things like that and some of that's being fleshed out uh still from a project management standpoint.