Dagster Dagster Community Demos, 29 Jan 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Prezi: Migrating Data Pipelines to Dagster

Description

Tamas Nemeth presents why and how Prezi migrated their production data pipelines into Dagster from a homegrown orchestration solution.

🎞 Slides 🎞
Prezi & Dagster (Tamas Nemeth)➡️ :
https://prezi.com/view/kveaLi8KasReSs4pyP5l/

🌟 Socials 🌟

Follow us on Twitter ➡️ https://twitter.com/dagsterio
Checkout our Github ➡️ https://github.com/dagster-io/dagster
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Check out our Documentation ➡️ https://docs.dagster.io/

A

Thanks for the nice words, uh I'm super happy to be here and showing you what we did in this dexter world and what's our journey to getting to dexter and migrating to dexter at prezi. So, let's start how we started.

A

I think the like the data data engineering and the data team started uh around like eight years ago at prezi we started like I think, like most of the companies, where we have a bunch of shell script and scheduled with chrome, of course, at some point, as the utili job started to grow, we basically we figured out that it won't scale, so we had to come up with some kind of solution uh on that it was like. Six years ago we were looking around on the open source bird.

A

What other tools there are what we can use and to be honest, we couldn't find a.

A

Orchestra orchestration, which would work for us, so that's why we decided. Okay, let's build our own one. uh We call this flow keeper, that's what you can see on on the screen this. uh This is our homegrown uh orchestration and one of the main design decision why we decided to create a new one and not going some existing one. There is simplicity, so one of the requirements from our users was basically to not they.

A

Basically, they don't did not want it to write a code to have a pipeline and that's why we come up with a with a new orchestrator where we had the json-based config.

A

If you wanted to create a epi job or for our orchestration, the only thing you have to do is basically creating a json file and that's it and I will show you how it looked like.

A

Basically, this is the this is a pretty simple, uh json descriptor. What you can see here, where you can basically define the scheduling type you have to type is daily and hourly schedule. You can define the inputs what your job is using and also you can, as you can see there, you can give some kind of friendly name, and there is a path what you can define in this case. This is an s3 pass.

A

What we pass and and also you could define what kind of data sets your job will generate. So in this case, this job as input is some kind of s3 location, and then it will produce some kind of redshift table and when- uh and you should know that, so these input and outputs is really you. We use this to build up the whole dependency graph in our orchestration and we did we.

A

We did not go with that concept, but you can see somewhere else or like a like other orchestrator, where you can just basically uh define your job names and that's the way how you define dependencies between two jobs. Here we basically went in the past that that you only have to know what kind of data sets you want to work with and based on that, we will figure out uh the dependencies and which job it needs to be connected.

A

So, basically, if you define, if you said that your input is this s3 location, what you can see here- and we saw that the other jobs generated the same s3 location- we connected the two jobs. Basically, this this was or how we set up the dependencies between these jobs. I think it's pretty simple and it's it's. It's worked for us because usually the user knows what they are working on a bit.

A

uh What kind of data sets, but they are not really aware of which job jar is that and we defined a couple of uh predefined job types. These are what you could use, and we here in this example is a restrict load.

A

Basically, what it does you specify input and we load the input data into redshift with the property, with the parameters, what you can see down there and we have a few job types like retrieve lord redshift transform, which was basically running a sql script, and we have spark jobs and and like python jobs and a few others, and we also defined uh the the tiers. So every data set put in some kind of tier, which basically is the priority. What does it mean?

A

uh You can imagine that you have a bunch of data set as especially, if you have like hundreds of data sets, then it can happen then, and hundreds of epa jobs. Then it can happen that that you have two jobs and and two jobs can run at the same time but like on the resource. What you want to run there. You can load like running two in parallel. In that case, you have to make sure uh the more important uh data set will be ready earlier, and this is what uh tiers means here.

A

The lower the tier, the jobs get will be scheduled earlier, if possible, and another thing what I failed to mention: it's basically, the job type and in the job type. This is a redshift mode and jobtite also define the resource uh what we are going to use. So in this case uh redshift and even in our homegrown uh scheduler, we had these uh q uh resource cues, where basically, we made sure that you can't overload the resources what the job is using.

A

You can imagine, I guess if you have like hundreds of jobs which can run in parallel, but if you would run these hundreds of jobs, heavy jobs in on redshift that you most probably would cure that. So this was the state this. This was our own schedule. What we built- and we built some nice user friendly ui, which is a pretty simple grid, where you can see the jobs which was finished and what the status and, if something fails, you can see there as well.

A

So think things are looking good and it seems like a user really liked it, and we ended up with a dependency graph like this. So we had around 900 jobs and if you have 900 jobs, then you will face with a few issues and that's why we were really thinking if we want to fix those in our current homegrown orchestrator or we are looking for some open source alternatives and why we decided to not improving our homegrown orchestrator.

A

Basically, one thing is the maintenance overhead. So the data engineering team is a handful people at prezi, so we did not really have the capacity to fully focus on uh working on the orchestrator another. Another thing is what you saw uh before this grid, so you can only see the actual job which fails, but you can't really see the dependencies between the jobs. So if you can't really see from that, if a job fairs, then what other jobs were affected as well due to that that failure.

A

Another thing is: uh these forecasts are currently running on one ec2 machine and which, if dies, then we are in the trouble. Then we had to start a new machine and setting up everything there and also there are problems that, because we are running all of our jobs in one machine, it can happen that two jobs interfere with each other.

A

You can imagine if one job basically uh generates too high cpu load or just eat up the disk space, or even first, when when, basically, you have some with users and they just uh start expecting that they can write to a temporary folder and one job without defining the as a dependency between each other. One just put down some file there and the other one expects to pick it up and, of course, the infrastructure that prezi is moving to kubernetes so or data. You should structure needed to move as well to kubernetes now.

A

Another thing was we. Basically, we brought it like six years ago and we had less really.

A

Not much time to work on that fully, it was really written in a you know, not very extendable way, so it was hard to add new job types and etc, and the last one is with the lack of testing limb instead. So it's one so before users wanted to test their jobs, mostly, they had to log into one machine, copying their file and trying it out from that specific machine, and we want to.

A

We wanted to provide a way better user experience to them, and basically that was the time when, when we talked with dexter team and they convinced us that let's try out their tool and try to and let's see if it how it works for us and that's when we decided okay, let's try to migrate to this new system. But of course, if you want to migrate to a new system, you don't want to write all of your epi jobs from scratch.

A

So uh what what was our first requirement when we tried to move to dexter is basically to being able to keep or descriptors and migrating and using it for generating solids in dexter. So, basically, what we wanted, we had a car and we wanted to replace the engine a way better engine and a very more reliable engine, and this is what we did so keeping our job descriptors.

A

uh First of all, we used the job descriptor and started to generate solids from it. How this looks like first of all, we generated a solid config, which I now saw that it should be config schema which basically, if you treat solid as a function which has parameters then config config are the parameters and its types and, as you can see here, we had the original json descriptor, but you can see down there. It's a redshift transform and we generated a nice schema. A config schema for that.

A

What you can see on the right side, this screenshot from dexter. So as you can see, the type can be there or and for every descriptor we generate one specific solid for it. So that's why it's uh so legit. So here you can change. Redshift transform any other types, because their inputs and even the processing, wouldn't make sense. So there, as you can see, you can only specify directly transform and you can- and there are all the parameters which can be used in the relative transform.

A

So in this case uh the uh like the sql file, which basically says resecul file, needs to be run on redshift when you're running this job.

A

Of course now so now you have a function and you have all the parameters, so you have a solid and and the config schemas, but you need all the parameters or the values that you want to pass. That, and this is this, these are the presets uh we also generating the preset yaml and from or json descriptor like. If you check here on the right side, this is this one is generated, one.

A

The left side is basically one which is in or json and, as you can see there, we generated a nice uh preset uh where we say that, where we pre-fill all the values, what's uh what are what are in the json descriptor and later on? If you want, of course, on the playground, you can, you can change it if you want to run some test run, but but basically you don't have to do anything, we do it. We pre-fill it for you like.

A

In this example, you can see that these ratchet transformed and previewed with the sql file we want to load.

A

We have the preset, then we have the solid body.

A

Basically, the solid body is predefined by us and you are and it's when it's get all the properties uh from or from uh the solid pre presets and then based on that, we decide okay, what kind of job times we need to run. So if it's a registry transform, then we will run relative transform and we do some other steps as well. So you know solid body, basically what we do. It's checking the inputs doing the actual job execution.

A

So in this case like if it's a ratchet, transform, they're running sql on redshift and then validating output, if if it was generated or not or if it's failed, etc, one more thing what we do as well, we we are doing some kind of templating.

A

So basically uh in the solid inputs you can define in your like in your sequel. You can say the friendly name of the input and then we will replace the friendly in your sql with actual table names.

A

And now you have a nice solid for the the config, the presets and the body. But you have to define dependency between the solids and what kind of dependence in the input and outputs there are, and here we as well are using the json descriptor and, as you can see there, we are generating a typed uh input. So in this case, because it's a our redshift table, that's why we generate a redshift flowkey purpose.

A

It's called in this example, and and that's where we are generating for the second input and also we generate the output and when we are generating the dependency.

A

Basically, what we do uh we do, the same depends and dependency set up. What we what I mentioned earlier, basically based on the inputs and outputs output paths and and table names, we look up which job generate that and we do the connection between the solids. Based on that- and here you go, there is a nice small pipeline defined.

A

uh And last but not least, we also add some solid metadata which not needed for the solid itself, but it's more like uh like dexter uh as the orchestrator, and also because we want to add some nice tagging onto these solids. So just a few examples here when we set the max retries. Basically this. This is what what which says that, how many times we want to retry a failing job before failing, actually and and stopping retrying, and also we set the tier here and based on the this tier.

A

We also said the dexter priority for the orchestrator and also we set like the dexter salary queue based on the job type. What I mentioned before for resource based scheduling.

A

Or for the resource cues, so now we have a nice solid. We could do this for one specific job, uh but we wanted to make this migration, the less painful so basically.

A

Basically transparent to our users, so let's give you an exam and we wanted to see our daily pipeline as one huge pipeline. uh That was one one of the requirements from the beginning. What we want to achieve, because what we saw like other orchestrators, uh that they they start to have problems if you have like hundreds of of uh jobs or solids in in one pipeline, and that's why you have to basically strip uh your pipeline into multiple pipeline and doing the connection between those pipelines.

A

But the problem is that, with usually most of these tools, that you can't really see the connection between the pipelines and uh that that was one of the reason why we really wanted to keep everything in in at one place and not basically being uh taking apart and another thing, of course, that in the current state, we are not really able to do this because we have now 900 jobs, and it would take a significant amount of time to do this.

A

So what we did we get all of the descriptors and load it into dexter, and let me show you how this looks like the whole pipeline holding into texture.

A

So, as you can see, here's a huge graph you can see it looks pretty nice, but it's not very useful in this way, because you can't see much about that as you can see. So it's very small everything, even if you zoom there it's very hard to find anything, but luckily down there.

A

There is this nice selector, where you can just select uh sub select of the pipeline, which can be super useful, especially if you try to understand your pipeline or if you want to change some job and you are interested, what other jobs will be can be affected with that change or or even if you are doing some kind of debugging where you are interested in if this job failed, what others can be affected.

A

So I think it's a pretty cool thing, so now we have all of the jobs and uh and we we can load into dexter uh all of these and we can generate from our jobs, solids and all of the solvents can be loaded into dexter.

A

But another thing what we wanted to achieve, like the scenery user experience, what we have currently or even better.

A

Here as well- and here is the workflow, what we come up, how you, how you develop your eti, a new video job, so basically the workflow is the following: you as a user. You start working on your new shiny ether, job. You start uh local development, environment, local development environment is basically uh doc dexter running in docker and locally so and there you can start working on your job testing and even you can go to access services with your own credential.

A

When you are happy with your jobs with your job, you have to create a pull request in github, and then somebody reviews that and in the meantime, as well jenkins, runs a check on this job. What we do, what we are actually checking it's another. I think pretty nice feature indexer, that you can introduce modes as well, you you can create multiple modes and we introduce this test mode where which actually not touching any of the resources, but what it does. It just runs the whole pipeline and basically checks.

A

If there are circular dependencies, if there is any config issues and and if we are able to run the whole pipeline without running on actual resources, which is cool, if that's passed, then you can deploy, we are using the kubernetes executor with uh kubernetes salary executor. So what's happened in this case, so you have the your job. Basically, this uh in the end, what you do is basically just committing an adjacent file into a repo based on that we run.

A

We do a test run and if everything is fine, then we create a docker image from all of these descriptors and and we and deploy it to dexter as a user code separately.

A

And then, when you start or or for a new pipeline run schedules, then these jobs goes into salary and and basically in in salary, in the salary queue uh in various uh resource resource queues, like you defined uh a separate queue for redshift for presto and hadoop and python in hadoop that one where, like spark and jobs, are running and basically in this way, we can make sure that these queues, when when a job is executing from the red shift keys, then we can make sure that only like five parallel jobs is running and it can happen can't happen that we were overrunning uh a ratchet cluster and no one else can be basically querying it, which is not a good thing.

A

If it would happen.

A

And another benefit running on kubernetes. All of these jobs are running in a separate pod, which is nice, because jobs can interfere with each other if it's using to manage memory cpu, whatever the pod got killed, but the other jobs can run, which is cool and also another cool thing in the salary executor that all this prioritization is there. So it's even treated or priority settings uh incident treat or uh priority settings, and it's super nice, but- and we also got a few nice additional values using dexter.

A

One is this nice data, lineage visualization, but I showed you before the other one. Is this pipeline performance monitoring, which is pretty nice because, most of the time, if it turns out you want to figure out if your pipeline is running slower than expected and also then you want. You are interested in why and which solid your job runs longer than uh before, because maybe there was some kind of change. Somebody committed a change there which caused this or or maybe there is some issue with your hadoop cluster or whatever.

A

Another thing is like a easier pipeline debugging. uh I think it's a pretty nice ui, where you can basically see the logs immediately and you have this nice filter as well twittering down what you are, and here the solid selectors, where you can only see that portion of the pipeline. What you are really interested in.

A

A

A

And the testing capability, it's super nice. Actually, that's what I showed you in uh in the in the github example or the jenkins example. So it's super cool and and we we can make sure that we are letting way less garbage in with this uh running the whole pipeline in a test run and, of course, this nice type and config checking which comes automatically.

A

So if you go to the playground, as you can see, if you start specifying parameters based on the job time, it will hint you what you can use and also you get some nice type checking if, for example, this is external should be a boolean, but somebody started to type a string, then it will fail immediately, which is super nice.

A

So this is where we are and actually but we still, we are still working. So this migration is still in progress. So basically we are now at 10 percentage.

A

So we migrated that 10 percent of all of our jobs. We are slowly migrating. We are basically migrating a few jobs testing if it works fine and then going back and and trying to migrate, more jobs.

A

We need to do more extensive user testing and also onboarding all of our analysts, and and in this way we can basically speed up uh the migration because we can create and actually that's what what we are currently working on some kind of migration guide, which we can, what we can hand over to them and they can do on their own.

A

uh This type of to migrate their own jobs, improve backfield capabilities. I was super happy seeing that there will be a bunch of improvements around that. uh We really would like to see that and and yeah. This is something what we as we are working on, to improve that and introduce better quality checks. So currently, as I told you about, quality checks is basically if there is a file or not or that if there is a table or not or if there is at least one row in the table or not, but we can.

A

We would like to introduce more sophisticated uh quality checks as well later on and uh last but not least, thank you dexter team. I think it's super nice and I I'll be really happy with the cooperation and all of these things what you implemented. I think it's super nice and I we started to work on this almost a year ago and when it was the extra year again, but now it's incredible where you get there and I think you are like really in a ludicrous mode, so releasing new features.