Dagster Community, 8 Sep 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: 0.9.0 "Laundry Service" Release | Dagster Community Meeting | Sept 08, 2020

Description

Slides:

0.9.0 Slides: https://docs.google.com/presentation/d/17ts4HfRUEqLbBF52DZ66G1QHI1PutK2MnY_cnlTWbug/edit#slide=id.g9380d0b962_0_0

0.10.0 Slides: https://docs.google.com/presentation/d/1qwV2i_4wp-72HsCeQza1ZEP02QPoz8KLIC4czcNzw5s/edit#slide=id.g94347a5a1e_0_14

Prezi Slides: https://prezi.com/view/kveaLi8KasReSs4pyP5l/

A

Alrighty, I think we'll get started here. uh I assume everyone can hear me um but uh yeah. Thank you and uh welcome to the first dexter community meeting uh we're planning on making this a monthly affair. um Yeah. We want to keep it pretty lightweight. For today we actually have quite a bit of prepared content, but as we go on, we want to make this more kind of community driven lightweight.

A

So we just kind of give updates on what's going on, we can get feedback from you and answer any questions for everyone, um we're also interested in getting more kind of community-created content. So, if you're interested in speaking or want to share what you've been working on uh peace, please feel free to do so. So everyone can see, I I'm doing an old school slide.

A

uh So using my whiteboard here, it's the second tuesday of every month, uh and then today's agenda here we're gonna, have max go over kind of what were uh what we worked on for own.

A

I know what we think are relevant features for you um also like feel free to give uh and we'll q and a at the end, and then sandy is also gonna talk about our plans for oteno, uh which you know we're doing these kind of major releases every few months, so we're targeting that to mid-november and then tomas from prezi is going to talk about how they're, using dagster at prezi and uh using that scale on kubernetes and then we'll wrap it up and have uh time for qa so lucy agenda.

A

Today, oh by the way, I should probably introduce myself, I'm I'm nick schrock, I'm the uh founder of elemental um and uh and yeah. Thank you all for coming and for being users um if you're, not users. Thank you for coming anyway, um and with that I will uh hand it off to max who's. Gonna talk about um relevant features from oh no, and what we've been working on the last couple months.

B

Hi everybody um thank you for coming. uh My name is max I'm an engineer at elemental, and I want to talk a little bit about, I know and what we did and introduced some of the new features so.

B

uh Our own, I know, release codenamed laundry service was a shorter release than 080 before it or than oteno after it is going to be. We focused on cleanup and hardening. um I think that we should mention at the top level that we've we've officially dropped support for python35 for two reasons: no one in the community was using it and uh supporting 35 is actually more work than supporting python 2.. So we've also added full support for python 3.8, which is largely thanks to pi spark, finally being 38 compatible.

B

um I want a short shout out to all of our community contributors. Thank you so much um these contribute. Contributions are ramping up. If you would like to contribute, uh please uh let us know, and we can help you do that, make it easier for your contributions back to the code base to merge. So I want to start with some of the the sort of internal work that we did. No, no, no, the biggest thing is much better user code isolation. So the broad overview here is we'd, really like it.

B

If user code, like the code inside a solid definition or inside a resource, init function, could not crash, the framework could not crash tag it even if it said false, and so we've done. We've we've isolated user code at the process level and now all the code that you write in like a solid definition or any other function. That's executed by the framework is executed in a separate process and that process communicates with the framework over grpc for inter-process communication.

B

So this is pretty exciting. It makes uh operating the framework much more robust. It also opens the door to a couple of cool infrastructural things that are coming down the pipe or in in some cases already here, the first of which is um executing user process in containers uh executing user code in containers. So that's something that is already present on master and you can start playing with the containers will communicate with the framework over grpc in the future.

B

There's the possibility to do uh sort of more extravagant things like write, daxter native code in other languages, for instance, you could have a scala solid, but as long as it communicated over grpc with the same protocol, it would still be possible to execute that user code alongside the python code that you expect. So that's a pretty neat internal change. That's going to enable a bunch of features coming down the pipe.

B

um We also did a lot of work to harden the kubernetes deploy strategy, not because kubernetes is the blessed or only way to deploy dexter, but because we think it's important that people be able uh to deploy in a robust way in kubernetes. So that starts with a bunch of small improvements and bug fixes the helm. Deploy is much more flexible. The scheduling is much more stable.

B

We've also implemented run termination on kubernetes. So if you have a run that you wish to interrupt, that's running on kubernetes deploy that is now possible.

B

We're building towards seamless, no downtime user code deploys some of that has already landed, and some of that is coming in oteno, and I think sandy will talk a little bit more about that, and we've also separated out some of the code into separate packages. So previously we had a little bit of a concerns were a little bit mushed together between daxter celery, that is using celery as a run, queue using celery to launch uh user code in separate containers and the kubernetes specific stuff, and that has all been separated out.

B

So now there should be a cleaner sort of dependency management story between your kubernetes deploy.

B

If you have one your celery deploy, if you have one listed a bunch of work on the scheduler to make it a little bit more robust, there are some some bug fixes there and stuff, but the the big change and the thing that we're going to be building on the most going forward is the new instance-wide view of the scheduler and daggett that lets you review all of the schedules that are running and all the historical runs that are associated with them and there's a bunch of stuff.

B

That's going to come down the pipe in nuteno. That's a pretty neat. On top of that, including specialized uis for backfill, longitudinal views of your past schedules and partitions and so forth.

B

We also have introduced support for kubernetes native cron. So, if you're running on kubernetes, you no longer need to use system chrome, I want to talk about two sort of user-facing api changes as well that we think are pretty exciting. The first of these is experimental. It's hooks. We've had a lot of demand for the ability to run arbitrary code on solid success or failure, and so hooks is a way of doing that.

B

So here's an example of writing little hooks that on failure, success of a solid send some message to slack and hooks like this can be attached either to pipelines or to specific, solid indications inside of a pipeline. So this is still an experimental api. It's very much in flux. We really appreciate you playing around with it and letting us know uh how or if the api needs to change. I'm very excited to see what people do with this, it's going to be very powerful.

B

The second feature I want to talk about is the sort of configured wrapper, so this is a very flexible idea, but basically, if you take some configurable object like a solid, you may have chunks of config that you really don't want to repeat every time that that solid is invoked. You want to package the config up with the object same for resources and the other configurable objects in the dextre api.

B

So what configured lets you do is is exactly that you can take some solid and some chunk of config package them together and then that basically constitutes a new, solid definition that you can reuse, but that already has the config fragment packaged with it, and this is a very flexible idea. It also includes the possibility to have custom config functions which take in some config specified on the configured object and then transform it to uh to config. That is valid for the wrapped object.

B

So we think this is really going to enable a lot of code, reuse and reduce the possibility for errors in, like repeated config, fragments, again very excited to see what everyone does with this api um new in I know, or rather newly committed to, we want to kind of formally express a deprecation policy. I talked about hooks as being experimental, that's part of what we're we're doing here, but we are formally committing to you guys not to break public apis in minor releases by public apis.

B

I mean everything: that's at the top level init in the dexter package. So if you are seeing breakage in a public api between major releases, that's a bug and please surface that to us, because we're really serious about trying not to break you guys.

B

We will mark apis as deprecated before they're broken. We have new code level support for that. So when you use a deprecated api, you should get a warning print to the console. Please enable warnings in your test suites if they're not already enabled that'll help, you keep track of our plans to deprecate apis and we've introduced this experimental wrapper when you use an experimental api you'll also get a warning.

B

Api is marked experimental. We do plan to be able to break. So that's how we're going to distinguish between things that we'd, like the community to start experimenting with, but where we're not certain that the api is correct yet and things that we're really committed to.

B

As always, if you find yourself using internals to get meaningful work done, please get in touch. That's a sign that there's a missing public api and we're very curious to know what you're trying to get around or what new functionality you found a way to implement.

B

I want to briefly touch on a couple of notable public api changes in 0.90. These are mostly just renamings and I know that they induce a degree of thrash. Thank you for putting up with it, but we think that it's going to make the platform much easier for new users to start to start using and easier for them to understand. So some of these are just renamings like config to config schema, we've renamed materialization to asset materialization in part, to highlight that materializations are meant to play with the asset system.

B

Some of them are sort of signposts of more thorough, going conceptual changes that are coming down the pipe, for instance, the system storage, which is, is, I think, a very difficult thing to understand, and a difficult thing to build. A custom system storage is becoming an intermediate storage and that's paving the way to changing slightly the way that it it works with the internals so that it's it's easier to understand and easier to build them.

B

Similarly, from engine to executor, we think that the role of the engine is going to change a.

C

B

And eventually, perhaps um incorporate some of what is currently done by the run, launcher and calling it an executor is a little bit more true to that, and then there are some just terrible naming choices that we've changed like input, hydration config, which really um really is a barrier for someone trying to adopt the system. We're now calling that loader, which makes a lot more sense. But thank you for putting up with these. These are some of the things you'll have to grep, for if you are migrating tonight, so thank you.

B

If there are any questions about, I know I'm very happy to talk about it uh later at the end of everyone's presentations and now I'll turn it over to sandy to talk about, what's all the exciting stuff that we're planning to do over the next couple months. Thank you so much.

D

Hello give me one second.

D

um Okay, can you all see these slides, I'm gonna assume that's a yes.

E

D

Haven't even hit share yet really good at zoom.

D

Awesome um so my name is sandy, I'm an engineer at elemental and I'm going to spend some time on what we're thinking about for upcoming hoteno release.

D

uh For those of you who aren't familiar with dexter's release cycle, we do a major release like otenno, roughly once every three months, uh so I'm going to be talking about here is the set of features we're thinking about roughly for november, um we're not going to get to every one of these items. So don't interpret this as a concrete plan. Instead, it's a set of areas that we think are important and we're talking about them, because we want to understand on what you think is important and then also what we're missing.

D

um So, how do we decide what to work on? um The first thing we asked ourselves was: what are the parts of our system that give our users the biggest headaches? There are some difficulties that come up over and over again in slack questions, github issues and direct conversations with users.

D

Some of these panes just require sanding down some rough edges, with the docks that others require. Bigger architectural changes to less mature parts of the system. uh The second thing we asked ourselves was: how can we push the envelope of an orchestration system? So what are the situations where the dagster way of looking at things lets us offer some powerful capability? um That's maybe even beyond what our users would have expected. So these are opportunities, really double down on, what's unique about daxter and try to offer something really exciting.

D

um So where are these pans and opportunities located in the system? We think of dexter in roughly three layers, the top layer is the world of data assets. So these are the data, warehouse tables or machine learning. Models that we're actually building our pipelines to produce uh the middle layer is the world of execution.

D

um So that is solids pipelines. These are the set of primitives that dagstr actually uses for running stuff um and holding it all up is the world of deployment and instance, management, um one. Second, uh everything is required to keep uh to basically set up a production, dashboard installation and keep it going.

D

I just realized that my face is hidden there. We go. Here's me um so for each of these layers, I'm going to give an overview of the improvements that we're thinking about and then um maybe dive in uh deeper to a few of the interesting ones.

D

um So, let's start with the assets layer, um there are two big pains that we hear about at the assets layer um and the first one is backfills.

D

So right now it's difficult to backfill over a subset of solids, it's difficult to track the status of a backfill and it's unwilling to manage large backfills that kick off jobs with tons and with like tens or hundreds of partitions. So basically we want to mature this whole experience.

D

um The second pin we hear about a lot is pipeline is the right intermediate, so some sort of data warehouse or data lake, um so many users have expressed confusion about the relationship between the asset and intermediate concepts that we have in dagstar. But both of these concepts are representing data, that's produced by dagster pipelines, but they function in slightly different ways. So we think there's an opportunity to smooth out some of those wrinkles uh and then our in our exciting opportunities bucket.

D

The big thing we're thinking about is what we're calling version based memorization kind of a mouthful, but what this means is tracking the version of the code at each step in a pipeline. So then you can avoid re-running steps whose outputs you already computed in a previous run, the same capability could actually also make backfills easier to manage um so uh talking about backfills for a minute. This is something that we actually added to dagster recently and we call it the step partition matrix um uh for any uh partition run.

D

You can find this view and dag it, um and we have uh for the rows the set of steps for that pipeline um and then for the columns, the set of partitions um uh that that pipeline has been run over um and then you can click on any particular partition and understand all of the runs that affected that partition in the past.

D

So what this is really useful for is basically understanding. Where are the gaps? uh Where did I have failures that made it so that certain steps were unable to run for certain prior partitions? So I can come in later. um Look at all the runs that affected those particular partitions and then under and then rerun ones that are likely to um cause problems.

D

um Sorry, just catching up on my slides right here um there we go um okay, so let me jump on a little bit in a little more detail into this version based memorization stuff. uh Consider this simple pipeline when you run it for the first time, each of the four steps is going to generate some data.

D

um Then you might make some changes to the compute function for the third solid, um when you want to try out these changes, it's kind of a waste of time and resources to rerun the solids that haven't changed, but both engineers actually just end up re-running the entire pipeline because difficult to track what's changed and what stayed the same.

D

um Ideally, we would only auto run uh the set of steps that are actually stale, um because those are the ones that we're sort of interested in the new results from so the version based memorization feature allows you to tag any solid with a version. The idea is that the version should stay the same as long as the solid's compute function stays the same.

D

So if a solid is the same version as a same solid from a previous run, they're expected to produce the same outputs when provided with the same inputs with these versions, dexter can actually determine which steps are going to produce results that have already been produced and which are going to produce new results.

D

And if you supply this used cash option, dexter can decide which steps to run based on that determination.

D

um This is particularly useful if you're sort of in like a tight dev loop, where you're going back and forth between making changes to a pipeline and then rerunning that pipeline, um uh especially if you know, as is the case in most pipelines, uh changes that you make in one step have impacts on uh downstream changes, and you end up having to make changes to multiple steps to make sure everything works. um Instead of re-running the entire pipeline each iteration, you can simply rerun the stuff that has changed.

D

Another place, as we talked about the versions can be helpful, is managing backfills. So often we run backfills when we made a code change and there were a set of prior partitions that we generated with an old version of our code. um So we can leverage that same step, partition, matrix view uh that we showed earlier in the flesh um to give us an understanding of which partitions um are stale after a code change. So uh you know in in this little visualization.

D

All of the uh yellow steps are stale, because we've made a code change to the um solids that generate those outputs.

D

So let's talk a little bit about execution at the execution layer. Two of the areas that we hear a lot about are what we call dynamic, orchestration and crosstag dependencies.

D

They haven't registration refers to situations where, for example, um someone wants to kick off a task for every file, that's discovered in a directory, so right now, dexter is limited because it requires users to specify a fixed set of tasks when they define a pipeline. The idea behind dynamic orchestration would be. They could actually determine how many parallel tasks to run at execution. Time based on information. That's only available at execution time.

D

Crosstag dependencies are when two pipelines don't make sense to merge into a single pipeline, but the latter has maybe some data dependencies on the former, um and we don't want to begin execution of the ladder pipeline until the former pipeline is completed.

D

In the exciting execution opportunities bucket uh one- and you know this is a little bit more speculative- uh is multi-container orchestration. um The idea here would be to enable users to assemble pipelines out of tasks that each live in their own containers. This could make it so that the set of um dependencies for different tasks within a pipeline aren't tied to each other. So.

E

D

Little bit more speculative feature that we have to think about um what it would look like to build, but could also be a very cool capability, uh and the second more speculative feature is what we call event driven scheduling, um and this refers to launching pipeline runs instead of at a fixed tick. uh In response to external events, like maybe, new data has landed in a storage bucket, and we want to kick off a pipeline um to process that data.

D

So, lastly, for the deployment layer, we don't have as many exciting new features, um but it's maybe the most important area to address pain, because if you just have trouble managing their attacks or instances in production, there's not a ton, they can't accomplish with dexter at all. um The main things we've heard in terms of difficulties in deployment are difficulties operating the scheduler difficulties deploying on kubernetes and difficulties. Managing large numbers of pipeline runs, so I'm going to dive a tiny bit into uh what we've observed about the scheduler.

D

um We observed a couple of main challenges with axiorys current scheduler, so first of all is the fact that it depends on cron for the actual scheduling. This creates kind of a split brand scenario where you have to manage both the dagster process, that's managing the crown process uh and and then cron itself.

D

um The second one is sort of a availability concern where um node failures can actually make the scheduler miss ticks. So you know, while the scheduler node is down, um it won't be launching any tick. So when it comes back up, it doesn't know that it missed those ticks.

D

uh We think that the best way to address both of these is by making the scheduler a first class component of dagster itself, so, instead of sort of um trying to outsource it to kron, we want to build the scheduler into dagster, because scheduling is so important uh to what dagster does um so. This will both make operating the schedule.

E

D

And it'll also enable us to build a fault tolerance schedule or that when it comes back after a new failure, it can catch up with these dicks that it's missed.

D

um So that's what I had for you so far. um We're really interested in hearing from you uh what's important of the things that we talked about and then what we're missing.

A

Great thanks sandy and uh thanks max for putting that together, um yeah there's a ton of exciting stuff uh going on and um uh yeah it's gonna be great. I think out of 10 is going to be pretty epic actually um in terms of things to uh um yeah. I highly encourage everyone to play with both hooks and configured, especially, I think, for it solves a lot of common issues that a lot of people uh you know mention, and I think it can pay enormous dividends um like, for example, with configured.

A

If you have tons of repeated blocks of config in your config files that never change between runs. I highly encourage you to look at configured to kind of capture that in code and it kind of, makes everything simpler um and then uh I'm personally really this version based memoization. This version based memorization stuff, is going to be incredible, so um both for dev loops and for backfills. So it's great stuff um next on top, is tamas.

A

So tomas tomas is a staff engineer at prezi, they've been working with daxter since effectively the beginning of the year migrating their existing entire production system to dagster and they've been unbelievable partners through this, and if any of you are using the kubernetes uh infrastructure, uh you have tomas and crew to thank for grinding out a lot of the rough edges before you had to, um and so we are all personally indebted to tomas and crew.

A

So uh without further ado, I will hand it off to tomas to both show off the features of prezi itself in this presentation, as well as explain the data infrastructure behind it.

E

Yeah hi, thanks for the nice words, uh I'm super happy to be here and showing you what we did uh in this dexter world and what's your journey uh to getting to dexter and migrating uh to dexter at prezi. So let's start how we started.

E

I think the like the data data engineering and the data team started around like eight years ago at prezi we started like I think, like most of the companies, where we have a bunch of shell script and scheduled with chrome, of course, at some point, as the uti jobs started to grow, we basically we figured out that it won't scale, so we had to come up with some kind of solution uh on that it was like. Six years ago we were looking around on the open source bird.

E

What other tools there are what we can use and to be honest, we couldn't find a.

E

Orchestra orchestration, which would work for us, so that's why we decided. Okay, let's build our own one.

E

uh We call this flow keeper, that's what you can see on on the screen this. This is our homegrown orchestration and one of the main design decisions. Why we decided to create a new one and not going some existing one. There is simplicity, so one of the requirements from our users was basically to not they. Basically, they don't did not wanted to write a code to have a pipeline, and that's why we come up with a with a new orchestrator where we have the json-based conflict.

E

uh If you wanted to create a epi job or uh for our orchestration, the only thing you have to do is basically creating a json file and that's it and I will show you how it looked like.

E

Basically, this is the this is a pretty simple, uh json descriptor. What you can see here, where you can basically define the scheduling type you have two type is daily and early schedule. You can define the inputs what your job is using and also you can. As you can see there, you can give some kind of friendly name, and there is a path what you can define in this case. This is an s3 pass.

E

What we pass and and also you could define what kind of data sets your job will generate. So in this case, this job as input is some kind of s3 location, and then it will produce some kind of redshift table and when- and you should know that, so these input and outputs is really you. We use this to build up the whole dependency graph in our orchestration and we did we.

E

We did not go with that concept, but you can see somewhere else or like a like other orchestrator, where you can just basically define your job names and that's the way how you define dependencies between the two jobs. Here we basically went in the past that that you only have to know what kind of data sets you want to work with and based on that, we will figure out uh the dependencies and which job it needs to be connected.

E

So, basically, if you define, if you said that your input is this s3 location, what you can see here- and we saw that other jobs generated the same as your location. We connected the two jobs. Basically, this this this was or how we set up the dependencies between these jobs. I think it's pretty simple and it's it's. It's worked for us because usually the user knows what they are working on a bit.

E

uh What kind of data sets, but they are not really aware of which job jar is that and we defined a couple of uh predefined job types. These are what you could use, and we here in this example, is a reshift load.

E

Basically, what it does you specify input and we load the input data into redshift with the property, with the parameters, what you can see down there and we have a few job times like redshift load registry transport, which was basically running a sql script, and we have spark jobs and and like python jobs and a few others, and we also defined uh the the tiers. So every data set put in some kind of tier, which basically uh is the priority. What does it mean?

E

uh You can imagine that you have a bunch of data set and especially if you have like hundreds of data sets, uh then it can happen then, uh and hundreds of epa jobs. Then it can happen that that you have two jobs and the and two jobs can run at the same time uh but like on the resource. What you want to run there. You can load like running two in parallel.

E

In that case, you have to make sure uh the more important data set will be ready earlier, and this is what tiers means here: the lower the tier that jobs uh get will be scheduled earlier, if possible, and another thing what I failed to mention: it's basically, the job type and in the job type. This is a redshift mode and jobtite also define the resource uh what we are going to use.

E

So in this case uh redshift and even in our homegrown scheduler, we had these cube resource skills where, basically, we made sure that you can't overload the resources what the job is using. You can imagine, I guess if you have like hundreds of jobs which can run in parallel, but if you would run these hundreds of jobs, heavy jobs uh in on redshift that you most probably would cure that. So this was the state this. This was our own schedule.

E

What we built- and we built some nice user-friendly ui, which is a pretty simple grid, where you can see the jobs which was finished and what the status and, if something fails, you can see there as well.

E

So think things are looking good and it seems like a user really liked it, and we ended up with a dependency graph like this, so we had around 900 jobs and if you have 900 jobs, then you will face with a few issues and that's why we're really thinking uh if we want to fix those in our current homegrown orchestrator or we are looking for some open source alternatives and why we decided to not uh improving our homegrown uh orchestrator.

E

Basically, one thing is the maintenance overhead. So the data engineering team is a handful people at present, so we did not really have the capacity to fully focus on uh working on the orchestrator.

E

Now that another thing is what you saw before this grid, so you can only see the actual job which fails, but you can't really see the dependencies between the job. So if you can't really see from that, if a job fails, then what other jobs were affected as well due to that that failure.

E

Another thing is uh these orchestras are currently running on one ec2 machine and which, if ties then we are in the trouble. Then we had to start a new machine and setting up everything there and also there are problems that, because we are running all of our jobs in one machine, it can happen that two jobs interfere with each other.

E

You can imagine if one job, basically that it's too high cpu load or just eat up the disk space, or uh even first, when when basically, you have some vp users and they just uh start expecting that they can write to a temporary folder and one job without defining the as a dependency between each other one just put down some files there and the other one expects to pick it up and, of course, the infrastructure that crazy is moving to kubernetes so or data usage uh structure uh needed to move as well to kubernetes.

E

Now. Another thing was basically, we brought it like six years ago and we had less really.

E

Not much time to overcome that fully, it was really written in a in a not very extendable way, so it was hard to add new job types and etc, and the last one is with the lack of testing with the lim mint stand. So it's one.

E

So if our users wanted to test their jobs, mostly, they had to log into one machine, copying their file and trying it out from that specific machine, and we want to. We wanted to provide a way better user experience to them, and basically that was the time when, uh when we talked with the dexter team- and they convinced us that let's try out their tool and try to and let's see if it how it works for us and that's when we decided okay, let's try to migrate to this new system.

E

But of course, if you want to migrate to a new system, you don't want to write all of your epi jobs from scratch. So what what was our first requirement when we try to move to dexter is basically to being able to keep or descriptors and migrating and using it for generating solids in dexter. So, basically, what we wanted, we had a car and we wanted to replace the engine a way better engine and very more reliable engine, and this is what we did so keeping our job descriptors.

E

First of all, we use the job descriptor and started to generate solids from it. How this looks like first of all, uh we generated a solid config, which I now saw that it should be config schema which basically, if you treat solid as a function which has parameters then config config are the parameters and its types and, as you can see here, uh we had the original json descriptor, but you can see down there. It's a redshift transform and we generated a nice schema. A config schema for that.

E

What you can see on the right side, the screenshot from dexter. So as you can see, the type can be there or and and for every descriptor. We generate one specific solid for it. So that's why it's rigid. So here you can change. uh Redshift transform any other types, because their input and even the processing wouldn't make sense. So there, as you can see, you can only specific transform and you can- and there are all the parameters which can be used in the register transform.

E

So in this case, uh the uh like the sql file, which basically says resecul file, needs to be run on redshift when you're running this job.

E

Of course now so now you have a function and you have all the parameters, so you have a solid and or and and the config schemas, but you need all the parameters or the values that you want to pass. That, and this is this, these are the presets uh we also generating the preset email and from or json descriptor like. If you check here on the right side, this is this one is generated, one.

E

The left side is basically one which is in or json and, as you can see there, we generated a nice preset where we say that, where we pre-fill all the values, what's uh what are what are in the json descriptor and later on? If you want, of course, on the playground, you can you can change it if you want to run some test run, but but basically you don't have to do anything, we do it. We pre-fill it for you like.

E

In this example, you can see that the this ratchet transformed and previewed with the sql file we want to load.

E

We have the preset, then uh we have the solid body. uh Basically, the solid body is predefined by us and you are and it's when it's get all the properties uh from or from the solid presets and then based on that we decide okay, what kind of job times we need to run. So if it's a retro transform, then we will run a relative transform and we do some other steps as well. So you know solid body, basically what we do. It's uh checking the inputs doing the actual job execution.

E

So in this case like if it's a redshift, transform, they're running sql on redshift and then validating output, if if it was generated or not or if it's failed, etc, one more thing what we do as well, we we are doing some kind of templating.

E

So basically in the solid inputs you can define in your like in your sql, you can say the friendly name of the input and then we will replace the friendly in your sql with actual table names.

E

And now you have a nice solid for the uh the configs, the presets and the body. But you have to define dependency uh between the solids and what kind of dependence in the input and outputs there are, and here we as well uh are using the json descriptor and, as you can see there, we are generating a typed input. So in this case, because it's a redshift table, that's why we generate a ratchet flow keeper pass.

E

It's called in this example, and and that's where we are generating for the second input and also we generate the output and when we are generating the dependency. Basically, what we do we do. The same depends and dependency setup. What we what I mentioned earlier, basically based on the inputs and outputs output paths and table names, we look up. We job generate that and we do the connection between the solids. Based on that- and here you go, there is a nice small pipeline defined.

E

And last but not least, we also add some solid metadata which not needed for the solid itself, but it's more like like dexter as the orchestrator, and also because we want to add some nice tagging onto these solids. So just a few examples here when we set the max returns. Basically, this. This is what uh what which says that, how many times we want to retry a failing job before, uh failing, actually and and stopping retrying, and also we set the tier here and based on the this tier.

E

We also said the dexter priority uh for the orchestrator and also we set like the dexter salary queue based on the job type. What I mentioned before for uh resource based uh scheduling.

E

Or for the resource cues? So now we have a nice solid. We could do this for one specific job, but we wanted to make this migration, the less painful so basically, basically transparent to our users. So let's give you an exam and we wanted to see our daily pipeline as one huge pipeline.

E

uh That was one one of the requirements from the beginning. What we want to achieve, because what we saw like other orchestrators, uh that they they start to have problems if you have like hundreds of of uh jobs or solids in one pipeline, and that's why you have to basically strip your pipeline into multiple pipeline and doing the connection between those pipelines.

E

But the problem is that, with usually most of these tools, that you can't really see the connection between the pipelines and that that was one of the reason why we really wanted to keep everything in at one place and not basically being taking apart and another thing, of course, that in the current state we are not really able to do this because we have now 900 jobs, and it would take a significant amount of time to do this.

E

What we did uh we got all of the descriptors and loaded into dexter, and let me show you how this looks like the whole pipeline put it into texture.

E

So, as you can see, here's a huge graph you can see it looks pretty nice, but it's not very useful in this way, because you can't see much about that as you can see. So it's very small everything, even if you zoom there it's very hard to find anything, but luckily down there.

E

There is this nice selector, where you can just select uh sub select of the pipeline, which can be super useful, especially if you try to understand your pipeline or if you want to change some job and you are interested, what other jobs will be can be affected with uh that change or or even if you are doing some kind of debugging where you are interested in if this job failed, what others can be affected.

E

So I think it's a pretty cool thing, so now we have all of the jobs and and we we can load into dexter uh all of these and we can generate from our jobs, solids and all of the services can be loaded into dexter.

E

But another thing what we wanted to achieve, like the similar user experience, what we have currently or even better uh here as well- and here is the workflow- what we come up, how you, how you develop uh your ether on your with a job?

E

So basically the workflow is the following: you as a user. You start working on your new shiny ethereal job. You start a local development environment. Local development environment is basically uh doc dexter running in docker and locally so and there you can start working on your job testing and even you can go to access services with your own credential. When you are happy with your jobs with your job, you have to create a pull request in github, and then somebody reviews that and in the meantime, as well jenkins, runs a check on this job.

E

What we do, what we are actually checking it's another. I think pretty nice feature indexer, that you can introduce modes as well, you you can create multiple modes and we introduce this test mode where uh which actually not touching any of the resources, but what it does.

E

It just runs the whole pipeline and basically checks if there are circular dependencies, if there is any config issues and and if we are able to run the whole pipeline without running on actual resources, which is cool, if that passed, then you can deploy, we are using the kubernetes uh executor with uh select as a salary executor.

E

So what's happened in this case, so you have the your job. Basically, these uh in the end, what you do is basically just committing an adjacent file into a repo based on that we we run. We do a test run and if everything is fine, then we create a docker image from all of these descriptors and and with and deploy it to dexter as a user code separately.

E

And then, when you start or or a new pipeline around schedules, then these jobs goes into salary and and basically in salary.

E

In the salary queue uh in various uh resource resource queues, like we defined uh a separate queue for regis for presto and hadoop and python in hadoop that one where like spark and uh jobs, are running and basically in this way, we can make sure that these cues, when when a job, is executing from the red shift keys, then we can make sure that only like five parallel jobs is running and it can happen can't happen that we were overrunning a ratchet cluster and no one else can be basically querying it, which is uh not a good thing if it would happen uh and another benefit running on kubernetes.

E

All of these jobs are running in a separate pod, which is nice, because jobs can interfere with each other if it's using to manage memory cpu, whatever the pod got killed, but the other jobs can run, which is cool and also another cool thing in the salary executor that all this prioritization is there. So it's even three dollar priority settings incident treat priority settings and it's super nice, but- and we also got a few nice additional values using dexter. One. Is this nice data, lineage visualization?

E

What I showed you before the other one is this uh pipeline performance monitoring, which is pretty nice because most of the time, if it turns out you want to figure out if your pipeline is running slower than expected and also then you want. You are interested in why and which solid your job runs longer than before, because maybe there was some kind of change. Somebody committed a change there which caused this or or maybe there is some issue with your hadoop cluster or whatever.

E

Another thing is like a easier pipeline debugging.

E

I think it's a pretty nice ui, where you can basically see the logs immediately and you have this nice filter as well twittering down what you are, uh and here the the solid selectors, where you can only see that portion of the pipeline. What you are really interested in.

E

And oops sorry.

E

And the testing capability, it's super nice. Actually, that's what I showed you in the in the github example or the junkies example. So it's super cool and and we we can make sure that we are letting way less gear begin with this uh running the whole pipeline in a test run and, of course, it's nice type and config checking which comes automatically.

E

So if you go to the playground, as you can see, if you start specifying parameters based on the job time, it will hint you what you can use and also you get some nice type checking if, for example, this is external should be a boolean, but somebody started to type a string, then it will fail immediately, which is super nice.

E

So this is where we are and actually but we still, we are still working. So this migration is still in progress. So basically we are now at 10 percentage. So we migrated that 10 percent of all of our jobs. We are slowly migrating. We are basically marketing a few jobs testing if it works fine and then going back and and trying to migrate, more jobs.

E

We need to do more extensive user testing and also onboarding all of our analysts, and and in this way we can basically speed up the migration because we can create and actually that's what what we are currently working on some kind of migration guide, which we can, what we can hand over to them and they can do on their own.

E

uh This type of to migrate their own jobs, improve backfield capabilities. I was super happy seeing that there will be a bunch of improvements around that uh we, we really would like to see that uh and and yeah. This is something what we as we are working on, to improve that and introduce other quality checks.

E

So currently, as I told you, uh quality checks is basically if there is a file or not or that if there is a table or not or if there is at least one row in the table or not, but we can. We would like to introduce more sophisticated uh quality checks as well later on and uh last but not least, thank you dexter team. I think it's super nice and I I we're really happy with the cooperation and all of these things what you implemented.

E

I think it's super nice and I we started to work on this almost a year ago and when it was the extra year again, but now it's incredible where you get there and I think you are like really in a ludicrous mode, so releasing new features.

E

And yeah: that's it.

A

Yeah thanks so much tomas. That was excellent. um You guys have done a ton of interesting work and uh thank you likewise for being great partners. um We, uh and also for setting a new standard with presentation, uh production values is that was an unreal. You know journey through time and space, so uh you know we have to up our game. I think, um but uh yeah just wanted to open it up.

A

You know we have about five minutes left and wanted to open up for any questions uh feel free to either put it in the chat or to speak up. There's not too many people here. So I think we can manage it, um and you know questions about. Oh, and I know stafford tenno plans are anything that tomas has worked.

A

A

Okay looks like everyone's a bit shy. um Wait a few more.

A

Seconds, okay, cool, well, everyone feel free to you know, obviously we're in slack all day every day, so you can feel free to follow up. Oh here we go regarding the version memoization. Will this include changes in run, config hugo? Could you uh do? You may want to speak up and uh uh that's a fairly generic question. I'm not sure exactly what type of friend config changes.

C

Hi can can hear me.

A

C

Yep, um so the question essentially is um so you illustrated how um like, using that feature, you'll be able to rerun pipelines and only re-run solids that have had any code changes.

C

My question is: would that functionality also apply say if you changed configurations for solids down the pipeline, would you be able to use that to kind of only rerun those um solids which have had configuration changes so like? If does that make sense.

A

Yeah, it makes total sense sandy. Do you want to uh weigh in on that.

D

Hey, I'm sorry, do you mind uh saying that one more time.

C

Sure yeah, so the question is: if um so, if you're updating making code changes like you'll be able to rerun the pipeline and only rerun solids that have had code changes, would that apply to making changes in the run config of solids down the pipeline.

D

Got it yeah, that's a great question, so the answer is yes, um uh we would. The version for a particular um solid would be based on um a particular step would be based on the version of the solid definition itself, and it would be also based on the all of the run config that affects that step, so configuration for that particular solid um and then configuration for any resources that that solid depends on.

C

Okay, great, that sounds awesome.

A

A

Cool, thank you, hugo um yeah, and with that, uh thank you for coming and uh thank you for all the presenters tomas. uh You put an incredible amount of effort into that. So thank you very much and then um uh you know sandy max we're gonna have to up our slide deck game right, uh but you're both of your guys did a great job as well. So thanks the whole team and uh yeah we're looking forward to producing all this stuff for you in the next few months.

A

I think it's a lot of exciting stuff, so thank you all and uh we will be posting this online as well.

A

A