Dagster Community, 16 Jun 2020

Previous Meeting

⏯

youtube image

►

From YouTube: 0.8.0 "In the Zone" Release | Dagster Community Meeting | Jun 16, 2020

Description

This is a kickoff meeting for the Dagster community announcing Dagster 0.8.0. This is the biggest release for the project since inception.

A

There we go, and now we can start recording okay, technical difficulties, the first time alright. So for those you don't know me, my name is Nick rock I'm, the founder of elemental, which is the company that hosts daxter and, like I, said this is our first community meeting.

A

So first order of business is I just want to thank everyone here for being early users, and you know our purpose and mission is to serve you all and ensure that you're successful, but we also couldn't do our jobs without the feedback and without your willingness to kind of take a bet on a system, that's still in development and we're deeply appreciative of how positive your engagement has been, how patient you've been with us and how get your feedbacks it has been. So thank you very much for that.

A

So what are we going to talk about today? Kind of the bulk of the meeting will be updates on the OEO release that we've codenamed in the zone, and it really is the biggest and most significant release since project inception. Both in terms of the features exposed to you as well as kind of the core architectural changes that we think will improve stability as well as open up our design space for the future. Also kind of want to get a sense of.

A

One second, apparently, it's not.

A

Okay and sorry about that- and you know so I hope you know a minimum of you see each other's faces.

A

You know- hopefully you guys can start talking to each other, because a lot of our success will be dependent on developing an active community where people can help each other out support each other build tools for each other, and then we want to talk about community growth, we're about to enter a new phase of the project where we're about to be more public- and you know that's in the interest of all of us, and it also requires help from the community. So I want to talk about that a bit.

A

So let me summarize kind of the big topic areas that we're going to talk about. So what are our core architecture? Changes as you'll see they're pretty significant, even though it didn't require breaking changes and we're really excited for our direction. There we completely revamped agate and you think it's a really positive change. We also added a totally new feature which is still in its infancy, but I think indicates in a lot of ways the direction we're gonna go. I'm gonna describe kind of a grab bag of things that we think is relevant to everyone.

A

The fact that we kind of now support tags through native execution, as well as some API changes. As you all see, it's been a busy few months. The team over here has really done a lot of great work. It's pretty extraordinary. Actually, so, let's first dig into the core architecture motivations.

A

Why do we do the things we're about to describe here? Well, one we wanted to make our deployment story a lot more reliable.

A

Currently, if you know in particular, there is one issue which was kind of obvious and blatant that if you updated user code, you have to restart daggit, but that was kind of one symptom of an underlying architectural mistake or deficiency.

A

Likewise, you know we want to continue to improve our operational stability. We think a lot of these changes improves. We also really, you know organically we've had a few different archetypes of users and teams approached us, but one that we're very excited about our teams that want dagstar to kind of enable other teams that they serve to interact better with each other.

A

So, very typically, it's like a Data Platform team who has data science, constituents, analysts, constituents and data engineering consistence, and we think you know these changes that we have will really enable that multi team platform. We also learn innate some duplicative concepts. You know this kind of happens throughout time, and then we also think this like opens up our design space significantly for future developments. So what did we actually do to execute on those motivations?

A

One is host user process separation, so this specifically solves the problem daggit being in process with user code, and this is a major architectural change. We enable multiple repositories, so this will be the natural seam by which you know a data platform team can serve a couple different teams who have kind of their own namespace of pipelines.

A

We consolidated the notion of start and launch, which was confusing both to our users and actually the elemental team itself. So that's very useful and we've also started to serialize our pipelines and execution plans that are metadata, which has some applications. Okay. So let's dig into the actual features here so prior to oh, you know. The architecture was like this: you had a user defined repository, which was loaded in process by daggit or any sort of host tool is kind of our terminology here, and this had a number of problems.

A

One is that the user dependencies were intermixed with the system dependencies. So in particular you know we have a fairly heavyweight graph, QL stack and Dagon, and that brings in a ton of dependencies, and sometimes it was a conflict where the user dependencies, those user dependencies also were intermixed with each other. So even data signs scene, which has a wholly different stack than your data engineering team, that was totally intermixed. So the big problem there, as I mentioned before year, that we start Daggett when the user code changed and as I just mentioned, I.

B

Forgot the order.

A

B

A

Me you cannot separate different user environments from each other and then just in general hosting user code in a process that should stay up forever is kind of risky right. A user crash could bring down Daggett and also this allows us the opportunity to provide like fixed images for tools such as daggit that can be used off the shelf without modification, so Mabel's live exciting stuff. So what did we end up doing here?

A

Well, you still write your code and you have to load it into a user process, but instead of it instead of Daggett loading it into process, we communicate over an API. Currently, this is kind of just a shell, a command-line tool. We will quickly be moving this to G RPC. Actually, so this API is used to both query metadata, so Daggett will query the user process for the shape of pipelines and other artifacts in the system.

A

This API is also used to instigate computation, meaning both execute pipelines as well as execute sub parts of the plans. So, oh this section reminds me for those who are just joining or those who are listening on YouTube I should have stated this beginning. This presentation is targeted towards people who have quite a bit of familiarity with the system.

A

So this is not some sort of intro, so if you get lost along the way, first, with some stuff, it's totally understandable, but I think you can still glean some value out of it, especially as we get to demos so totally forgot to mention that so post daily audio.

A

The other thing that enables is like a multi repository world, so we have an API, but now we can actually, you know, query to different repositories that live in two different environments, and this is really exciting because, for example, these can be totally separate, Python environments and, in fact, different Python versions. So imagine you have one team that he can my Python 2:7, but you want to migrate the rest of the system. Well, this allows you to do that.

A

So in this example, you have on the left Dexter using the flat modern design, which indicates Python 3 and this other environment. Up top, let's say it's a data science team or something that's using the legacy: Python 2 7, environment and that works fine yeah, and then you can docker eyes it. Alright, we don't support this yet, but that expect that support, at least in early form within the next week or two, where we'll effectively communicate with a docker container over this G RPC interface. And then you know just to tease this.

A

This opens up the design space to actually support additional languages in the future that can implement the full Dexter spec. So this is really kind of an exciting direction for us, and then you can also have multi or repositories within the same, what we call repository location, so lots of folks do this, where they actually want to standardize their environment, but still have multiple teams for logical organization. So we support both of those operating modalities.

A

Ok, so to support this new whole concept. We added this concept called workspaces, so it's both a file format and an abstraction, and it defines a collection of repositories and repository locations that a host tool interacts with users express what their workspace is using. This new format works based on yamo, it replaces repository done animal and but we will support backwards. Compatibility for A+ Ianto, oh now, I know and TVD on timeline for that. So what does this new format? Look like?

A

The simple form is kind of it just says it's very familiar to the Past for posit Ori animal we're in the only difference is kind of like the words and the fact that it's- and this just says, load from the Python file. From this repository, we also as you'll, see added the notion of auto discovery of repositories in a Python module, which means you know how I'm gonna have to specify the function name where the repository lives, which certainly drove me crazy.

A

You can also get way more fancy with these workspaces, so here's an example loading from a couple, different Python environments that are actually distinct environments, and so we imagine you know this file format becoming more advanced and flexible in the future.

A

So I can just show you a quick demo of that. Let me share my whole screen down there.

A

So bring down my shell and.

A

You'll you'll notice, I'm kind of this a bit out of order, but.

C

A

Size my screen.

A

So you notice here that we have two Python environments. Here, one is kind of them, one you load from the current Python environment, it's still out of process, and here we're actually loading a legacy. One right, and so you specify.

A

Space you open this up a second here, loading across story. We did sacrifice a bit of boot up time with this, although this is slower than usual there we go.

A

And so now you'll see here. This is a preview go through day and more carefully, but you'll see here, there's this new switcher, and so you can switch between your different repositories. It actually shows you that they are coming from different python environments. So if we instigated execution from a pipeline in here, it would actually execute within Python two seven- and here it didn't say it as execution within Python, three, three, six and thirty, seven, whatever it is preset and so yeah this works.

A

It was really fun to get that working actually, and so this is a big step forward for the system. Let me go back to my other screen here.

A

Actually, okay, this works, so this has been in the weeds and the technical details. Don't matter that much, but we now preserve pipeline structures historically. So what that means is that previously, if you had a pipeline, say named foo and then you executed it and then you changed its shape. Often you couldn't view it historically and if you kind of clicked on it from the runs view, it would view the current pipeline shape, which is actually pretty misleading.

A

So now what we do is we actually persist a metadata format for the pipeline structure, and this is both content, addressable and normalized for efficient storage. That means that don't worry, we're not going to over persist. Anything if you run the same pipeline shapes a thousand times it's a serialized representation is only stored once but I. Think, like the takeaway, the big you know, takeaway from this is that you know the direction. We're going. Is that your instance it's DB? We want it to be in an immutable log of everything.

A

That's ever happened in your data application, meaning that thank you, I guess that was no longer sharing my screen anymore.

A

Thank you all right right as I was saying pipeline structures, a person historically, content us will normalize for efficient storage, but you know the big takeaway is that this is an instance DB, that's an immutable log of everything. That's happened in your data application and that's kind of our philosophy here so kind of as we mature the system.

A

What we really want to be able to do is that any time there's a computation instigated you can go back and really understand what happened why it happened, and this is a layer that will build more advanced, lineage tools, compliance tools, etc. So it's a great fundamental base to build on next. The Daggett revamp so kind of the big philosophical change with Daggett. Is that we're much more pipeline centric now, rather than functional functional area centric, and we think this is a much better way to structure the system.

A

We've improved the runs page, pretty substantially, there's new charting and graphing capabilities, and this new pipeline overview change, which I'm particularly excited about. Let me switch back screens here. Sorry for this thrash.

A

All right, so if we go back to the screen here- and we can still see this right, okay cool, so the new pipeline sees you know, centricity over here- is that we now have this list of pipelines over on the left, so you can quickly navigate to what you care about, go and load things in this case.

A

We're gonna go to the launch terminal pipeline, and here you can click and get to this overview page, and this overview, page I, think is incredibly useful, so you can go and click on a pipeline and kind of instantaneously get context on both its shape, but also like, if you were a good engineer and actually provided a rich description of this thing which I have not, and but here you can see like. Oh this thing is running on a schedule. It last ran on June 13th. It has all these assets which we'll get into.

A

We think this is just like a really exciting direction, where you can quickly navigate to a pipeline and get immediate context of everything that's going on. We also have it a much improved run page. The performance has been dramatically improved, so that's great and we've also, you know a lot of our users. A lot of you have really been using tags. The great effect and they've become like what's more important to the system and you know so.

A

We really expanded the amount of kind of real estate for those are just useful and here's a feature it's been in for a long time, but I didn't even realize until yesterday, but you can click on a tag and instantaneously filter to that tag which actually is quite elegant and nice.

A

We also have charting and graphing capabilities here. So that's good schedules go to lunch. You know demo and you'll see here. Click here.

B

This queries a little slow.

A

Actually apologize for that so.

A

We actually have you know, graphs and we're gonna we'll keep on improving this overtime, but you know we're adding.

D

A

Know kind of high-level thing: I'm, not gonna, dig into all this, but we're adding charting and graphing capabilities throughout the system, and so that is the Daggett kind of revamp and I'm very excited about switchback.

C

A

Alright, let's talk about this asset management piece, so you'll notice there's a new thing you can do and if you yield a materialisation from a solid and there's this new concept called an asset key and what this is is a much more. We previously had this thing called label which we still support, but that was as pure metadata that was displayed in your event, log asset- he has much more precise semantics and we actually index asset key and allow it to build up kind of a collection of assets.

A

Now I want to emphasize that this is super flexible and abstract. An asset key is essentially just a namespace of keys that points to metadata, and that means that you can use it for anything like we have. We already have a user that, for example, creates asset keys for their mode reports.

A

You could create asset keys if you had a scheme for identifying your emails that you send up. So this can both be more traditional data assets like a table or partition somewhere, but you can also use it for a whole number of purposes, and you know we try to keep these things not overly tailored to a use case so that the community can innovate on top of it.

A

I accepted I fully expect that we'll be use cases that you all come up with that we don't anticipate because the universe of data applications is so heterogeneous, there's novel things so the moment you start indexing these assets, you get access to this asset manager and you'll notice. That kind of the asset keys are expressed in this type of head. You can navigate to one of them.

A

You can see that oh, this is the last run that touched it here kind of the last time it happened here, its properties, then you can graph over the properties that you annotate it with. So you know obvious things are to track the number of database rows over time. The number of bytes stored. But again this is really really flexible. We anticipate a lot of uses.

A

We don't anticipate, you know, just don't give me some context about how we're thinking about this is that, as we worked with a lot of our early design partners, we realized that- and you know we knew this going in, but it really became clear that data have the applications, our multi persona systems, meaning that you know a wide variety of human beings with different skill sets and different backgrounds interact with these things and people view these systems with different lenses and frames.

A

In particular, you know engineers and ops people whose job is to keep these pipelines alive, often think in terms of the computations and most of the existing systems in this space think in terms of computations, they describe themselves as workflow engines, but then there's an entire other sets of people that think much more about data assets for an analyst.

A

The interface to the data application is the database table and what we have an opportunity to do is actually to model the computations and the assets and to link them the goal being that if an analyst can get to a we want the analyst people get to a pipeline and understand what broke but their entry point and their ingress points of that is that index of assets, whereas the engineer might navigate it through the opposite direction in that index.

A

So this really gives us the opportunity to build this novel system of record for metadata and that the novelty we believe is that linkage between computation and assets and the programming model. You know the moment that you the moment that the leaf developer yields of materialization with that asset key it automatically gets indexed in the system. Then you know this is just the beginning of this direction.

A

You know we will build an invest, a massive amount building on top of this novels system of record metadata. So just the hop back into this I apologize for the thrash here in terms of demo practice makes perfect.

A

So you can go. Can we see this great? So if you go here and go the assets page, you have this, you know index of assets and obviously think this is a limited number, so you can type ahead and navigate to something and that's so much useful.

A

This one doesn't any grass we can see like. Oh the last time, it's precise here, but other interesting things is that this actually is hierarchical. So you see poor scott cost, dashboard dashboards, dot traffic dashboard. So now I'm get down here and this ends up being this dynamic folder structure. So you can imagine, as an asset key scheme, that for your s3 data, late mimics, how you lay out your partitions, but it still you can still navigate directly to the asset key in general.

A

So we think is gonna be super useful and you know the idea is that imagine that you have some file in s3. You have no idea where it came from last time I was updated what computations that it's etc. You can navigate to this asset manager type in part of the s3 key and figure out what's going on, and we think this is gonna be incredibly powerful tool and it's fairly obvious. You know how you can extend this, to data quality, to lineage, to more sophisticated metadata properties, etc.

A

So expect a lot more progress on that front, alright back to the pre, so.

A

Alright see this again, so some additional things that we've added what I won't believe are too much, but you know kind of prior to oh it. Oh yeah, do you a with a scheduler, was kind of part of the rougher parts of the system and for those of you who have worked with us on that, thank you so much for bearing with us. You know, we've effectively built better tools for debugging and liability, so I hope.

A

Those who have you used a scheduler are seeing a better experience on that front, but please continue to give us feedback, because we know we haven't come to the final answer in this domain. We also have reax acute improvements, which I will show you now.

A

So let me go back share screen.

A

So what I mean by that is we always used to have these reax accusin features, but they were actually quite I, think confusing and buried and had a lot of problems. So let's go to the playground here, and so this is a very simple pipeline that has you know for solids, bread and we've configured it to actually persist the intermediates. Here and now we can launch this computation.

A

Okay, so that's interesting, you'll notice up here, they're just launched the execution thing, so we can re-execute the full pipeline, for example, but now once this actually in stance, computation, you'll notice. Now we have this list where we're tracking kind of the lineage.

A

We don't really express the lineage, but it's formally modeled in the system, and but we group these to say like actually, we've been retrying this stuff and you can actually say, for example, let's say I want to say like actually I want to start off with this multi eyelid and then everything after it. So you can see we've selected those two things: we've actually filtered the event log by that, and now we can say actually only I only want do these two things.

A

So we think this is going to be super useful both for local development, as well as for operational standpoint. I don't have time to do an advanced EMA, but you can actually imagine a world where there's a long-running computation and if Forks and then one of the forks fails- and you know you have to debug something fix something. Yuri instigate computation of that fork thing that failed. Now you have to running computations that are all part of what we call the same run group.

A

If you go to the runs page, you can see the lineage information here. So if you wanted to see all the runs related to some root idea, you could actually click on it by the way guys, we should probably add the root remedy to the actual, the okay. So we think that's a really fun feature again, that's kind of like just getting started. So please give us feedback on that.

A

Okay, so we have improved PI spark support, I'm not going to dig into it, because not everyone uses PI spark, but if they do use PI spark your world has got a hell a lot better. You can kind of express your place for computations more abstractly, and we actually you know you- can just express your business logic and, for example, we currently support EMR, which we wrote and then data breaks, which is a community contribution. Then, if you're on the line.

A

Thank you very much those extraordinary work, but you know you can actually shift the computation between your local machine, EMR and data bricks with no business logic changes whatsoever, which is pretty much a game changer. We also have a new kind of experimental programming model. Called lake house. I won't dig into that in any sort of detail right now, but think DBT for spark code, not just sequel, but that same sort of philosophy, and we think that's a really exciting direction.

A

So if you want to kind of, you know, be on the bleeding edge of PI spark support, we're really interested in working with users, and we think we can be like a massive game changer for those developers also want to emphasize that we now have kind of a fully supported still in development, Daxter native orchestration cluster, meaning that if you want to distribute the orchestration across a cluster in kubernetes, we do have kind of a out-of-the-box ish solution. For that this has a ton of nice properties.

A

Every step can be configured or by default, I believe executors own its own ephemeral pod. So you have total per step isolation. You can set resource management, so you can set memory limits on a per step basis. You can setup queues so that you have parallelism units. So imagine you only want to run for outstanding queries against redshift at any one time. You can limit that parallelism using this system and you can actually do inflate code update if you configure it properly.

A

So you know we we stress that were like kind of a horizontal platform, but we also you know: support kind of a nice vertically integrated solution that has a orchestration cluster.

A

We have some API and naming improvements, so you know this is continuing to evolve and we support misspelled environment, but it's actually spelled correctly in the codebase that we renamed environment, config, dict to run config. So for those of you been in the system from beginning, we used to have a run. Config then get rid of it. Now we reintroduced it, but you know.

B

We actually think this is a much.

A

Better final end state for the system for config. We have renamed on all our things like solid definition, resource definition, etc. We have renamed config, config schema. We think this makes it much more clear that you're passing a schema into one thing and then you're actually passing the body of the config that must conform to that schema to other API. It goes much more clear and then we have a new decorator for repositories, which is much more kind of similar in spirit to our other artifacts in the system.

A

This is actually very convenient, so you know you've before thing we define this function, return a thing, and now you just you know, the information is encoded in a much more efficient manner here. This also enables auto discovery, which is really nice. So now you can just point daggett to a Python module either a file or installed module and I'll discover the repositories and love it, which is a small thing, but is very, very nice. Actually, that was driving me crazy, as I mentioned earlier in the presentation, I'm sure of some of you crazy.

A

Lastly, I'm gonna kind of like go through this quickly is the relationship between air flow and daxter, which is I, think a perennial source of confusion, especially for those who are external to us and I, want to make it clear how we think about it and then so that you, when you talk to people about this, can communicate clearly about it.

A

So you know earlier in the system, we really emphasized that we can be used as a software distraction over air flow and that's still true, it's clear that we are seen as an alternative, the air flow and we're gonna reframe our communication and somewhere our systems in order to accurately address that. So you know there's ten of two components here: one we have these DAGs two native execution environments, so we have folks running on their Kate's plus Ellery package, and we have folks also using desk as an orchestration cluster.

A

So you don't need to use air flow as an execution engine, and now we have a feature where you can actually automatically ingest air flow tags and dag bags into daxter and execute it on that Dexter native infrastructure. So, let's just talk about you know how we think about this quickly, so we we've had this capability. You know, let me talk about this, so you know we're trying to account for a couple organizational dynamics here about the way things are adopted.

A

So often, sometimes it's the leaf developers as we call them who see our new fancy tool and they see the Gantt chart. They're like I, want to use this, but there's a desisting ops team that have runs all the infrastructure and they're like oh cute. They want to use another tool, we're not gonna do that, and so we have one method that we call inject, which allows you to do that. This is what is a lot already prior to beta, and this is leaf.

A

Developers can write to the texture API and then you can actually put a function, call in your dag file and air flow and compile that dag to airflow operand the air flow infrastructure. You do have to add. You know a little bit of infrastructure to support that, but you can use air flow as an execution engine and then use kind of the dexstar native tools.

A

That's useful for kind of wedging your way into the system, but then at some point maybe the ops people are the ones who instigated this or they start to see the value there and they're like well. Maybe we should move our whole system on there and then the dynamic inverts, where it's like OQ the ops people want us to migrate our own code, we're not going to do that, and so this is where we call ingest here.

D

A

I forgot that I forgot to animate this, that's fine, but the the general notion here is: there's a dag file on the left and instead of injecting the dag stirrer pipelines into the die file, we dagstar is the one that consumes that dag file and wraps it up in a pipeline, and this will actually, you know what we anticipate is people will kind of start to migrate, DAGs over to dexter and then at some point you want to convert the long tail or accelerate the conversion of the long tail in order to be able to decommission your airflow of the structure and the the first one will be driven by the leaf developers and the second one be driven by the ops team, most likely for the infrastructure team, this stacks beautifully with our new host user process.

A

So you can imagine your legacy. Airflow installation is on Python 2 7. You don't want to thrash that, but you want your new stuff to be on Python 3. Well, this really works nicely with the host user process stuff.

A

So, with all this kind of stepping back for a second we're, gonna reframe the system a bit, so we used to kind of have this like high-level, abstract programming model for gate applications, we're just kind of hard to get your hands around and we're just gonna call ourselves the data Orchestrator that indicates kind of value, propositions and principles of the system.

A

One is that we believe in deen awareness, meaning that user-defined types, the asset manager all stuff like that and second of all, we are an orchestra, we're gonna kind of own up to that, and we think this will kind of clearly define both our values and differentiate ourselves from the adjacent systems.

A

So yeah the way of kind of summarizes, daxter's grown up a bit. We believe we're ready to support a broader community and we think that's in everyone's best interest.

A

We're gonna push out a blog post a week from today, which would kind of be a big long one like last year, if you call that we love help from you all publicizing that day. So probably news retweets all that stuff and then we're also going to be started. Producing technical content and as a nice corollary to that, we believe our documentation will dramatically improve from here on out and then we are, you know we did a lot of work. You know it. Oh we plan on doing a lot of work.

A

You know and I know, and our team is growing, so you know we're in skinny an internal planning process. But if you know right now, as early adopters, you have the opportunity to really influence our roadmap and we want you to so. Please proactively, reach out to us and give us kind of your feedback. What's your biggest pain point, what you see the opportunities are, and you know we will incorporate that feedback into our planning process and with that kind of wrapping up here, I know that was probably a firehose of content.

A

But, let's see you can I see myself here that was kind of a firehose of content, but I'd like to open up the freezer to everyone to ask any sort of questions and feel free to either well. Actually, the.

D

A

The best way to ask questions will be in the chat, but if.

D

A

Awkward silence you just want to hop in you can go. Do that too. You can there's not too many people on this call. So if you can manage it.

A

Nothing all right, this is your chance. Come on. You got you got. You got the source right here, the whole team's here, so you can ask anything you want. Are you planning on event oriented pipelines? That's a fantastic question.

A

You know we haven't gone through our formal planning process, so I can't speak definitively about this, but like with this way, we have certainly brainstormed about a more generalized system for scheduling that we call instigation, of which kind of the daily schedule will be one instance of so you can imagine a system where you can kind of model, a cron or other scheduler, generating a daily events.

A

The power of the host so yeah definitely.

A

David gamba not David, Katz muted. That is definitely a direction. We're looking in, and you know also looking for feedback on exactly what you use cases if it's s3 or other other event sources. Next question Netflix has a project called microbots, so a lattice of a pipelines backfill immersed in time?

A

I'm not. Can you Steven Keeny or speak up I'm a little confused on what that means. Kind of I think the word lattice is a little bit.

C

Yeah I was thinking lattice more in the Mutulu materialized view world, but I think in the pipeline speaks issue of a deck. So if I know and a you and materialized asset- and let's say okay, the code is wrong. We have to redeploy the Python code such that we wanted to be simply just backfield that note and, of course it will human, a new materialized asset if I have tantrum deck. That depends on them at realized asset. For example, it was yesterday data on a node that I could back view.

C

I want all the downstream graph to also back field. How would the dexter fought on? How would how to accomplish that? Yeah.

A

I don't think we have anything out of the box that can do that. It's super interesting.

A

It's a super, interesting thought. You know, I do think this has set abstraction, does give us the opportunity to model those dependencies. You know I can't just off top my head and kind of think about all the implications of interrelation, but you know what are the directions that we can focus on certainly is really going for fully incremental computation. You know we.

C

A

So much of a meta data necessary to execute on them yeah. We have to tweak some of our systems in order to do that. But so ya know- and we be I- think really interested in followed up on your exact use case too.

A

So Ken for the Kate celery, executors launcher. We need you, so many workers do separate jke jobs. Why do we need to use selling workers, so maybe Nate or Alex can speak to this, but I believe that you can also just use a pure kubernetes thing, but effectively. Why we use celery is that celery implements a number of features like resource, pooling and prioritization that we felt we could just use out of the box rather than implement our own kind of reimplementation of that, but neat join us speak to this yeah.

D

That's pretty much it. We wanted to kind of a way to control the parallelism of the number of concurrent kubernetes jobs that were spawning during execution of a pipeline, and so we use the celery workers as that kind of intermediate layer. But there's no kind of reason. You couldn't have a version of this that just invokes the jobs directly.

A

Great, thank you. How do you see so Simon asks? How do you see orchestration versus choreography with micro-services and where Dexter would not fit into a choreography architecture? So that's interesting.

A

You know we're definitely targeted towards orchestration and the. But, as you know, choreography is a similar process.

A

You know I do think there thanks Phil I, do think and Demetrio get your question. I didn't see that one before I skipped assignments that you know they're they're, interesting services out there, which kind of forked focus on what you call choreography and micro-services. You know cadence comes to mind. I think the feature retains way more invasive than we are, but they have this interesting kind of durable function.

A

Architecture which allows you to kind of like sweep the entire choreography for days at a time somehow and daxter really isn't oriented to do that, but I mean I, think you're right that we are currently focused on consuming and producing data assets, but the you can actually model a number of different software systems. Using that framework, you know, like a build pipeline, for example, could be easily modeled on Dexter and under certain strains of choreography, between microservices could also be orchestrated, so I think it.

A

It also depends on specifically what you mean by orchestration versus choreography.

A

The let's go to Dimitri's question question about daggit separation from pipeline user code. What happens in the pipeline is updated in posit ori while is executed by Daggett. So this depends on the semantics of your run launcher. If most is the most important thing, I would say so and so, for example, kind of the default run launcher that comes out of the box, which is kind of basic kind of process. Is that once you have launched that run, the process is fixed and and that's just the way it works.

A

But you know there are also configurations where it kind of automatically picks up the code. That's been updated so, for example, in the dexter, celery, kubernetes execution, environment, the you know, if you upload the new docker container, that contains your user code as execution unfolds, the updated docker container will be automatically pulled on by the workers. So it's very context dependent is the answer and- and you know exactly.

B

A

Laundry is configured sefie, a short note: if running decks are interviewing building a complex pipeline, that's causing many ones to fail. Multiple selections in the run. An asset tab would be useful for deletion. Yeah.

B

That's a totally reasonable.

A

Feature request: thanks for that feedback, are you thinking about linking pipelines together for execution? That's from cat again sorry throw your name buddy that narrow logics. So, yes for sure, I think there's a couple interesting questions here. One is how we would link CK, alright, got it CK.

A

One is how given no changes in the abstractions, how you would link pipelines together, I kind of hinted at that with you know, if we move to a more generic kind of event, oriented structure for pipeline instigation, you can imagine the pipeline's emitting those events in some way shape or form which instigates further downstream computation.

A

The other interesting question to ask here is, you know, I think we still have some internal confusion actually and there's still work to do on exactly what the difference is between a pipeline and the composite solid, because they're, very, very similar, abstractions and so I think work along that line, you know, could actually clarify things as well, so I think the CK as your questions. Yes, we're. Definitely thinking about it.

A

I think it's not I'm like one of the promises of the thesis of the system, that they aren't just a bunch of independent pipelines that this is like a data application that has complex interrelated parts. So it's certainly in line with a mission having all that nice cases from Demetri. Have you implemented the doctor and doctor option?

A

Well, it's our favorite because we have to deal with it in our CIC be pipeline at all time. I will put this to Nate again to talk about.

A

Have you jump in here yeah.

D

We're pretty focused on trying to mature the facility, kubernetes integration and get that to to a good kind of you know: production grade spa but but I think we will continue exploring other other models for execution. Did we try I? Think if I? Remember correctly, you were looking at EC s execution and that's certainly something on our roadmap to to get something in place for for execution on UCS.

A

Okay, can we perform pipeline? We execution for the point of failure. Yes, that is kind of in the re execution menu. We have three options: full pipeline selected subset and then the just star me from where I failed. So we directly cover that use case sounds like we need to surface that feature more clearly.

A

A

Okay, well, let's see we're about five minutes away from the end. Oh here we go great. What would be your recommended execution environment at the moment for production, salary, Cates, yeah I think that's like our most well supported execution environment, but you know we still really like, and this is a part of the system we really mature. But you know one of our thesis is that we really want to try to be execution environment independent. So we want to be very clear to communicate that yeah.

A

This is like one vertically one vertically integrated kind of configuration we support, but you don't need it because you know for a lot of people could raise the overkill, probably for most people that use it. And you know we have users who, for example, wrote a custom, run launcher to execute and allocate computational resources on their own custom pass. So you know if you want to use communities if you want to, if you already have a kinase cluster you can deploy compute to you know that would be a recommended execution environment. Probably, but you know.

C

I think there's also feedback.

A

And betted your question that we do a pretty bad job of communicating what I just said and we can work on kind of our deployment philosophy.

A

Do you plan to develop pipeline definition, DSL Wow? So, first of all, it depends on what you define dsls. So you know internally. We call the current format that we have DSL IMO all roads in computer science lead to Yambol in 2020, but the.

A

The the short answer is I think we will almost certainly end up supporting that to some at some point. But the other thing that's interesting, I think, is that a lot of people's jamol's configuration actually ends up being pretty specific to their domain and how they structure their system.

A

So I think we want to see instances this of the from the community, and this is why we actually threw up an example on our Docs of kind of like a example, how you would build a yeah Malta pipeline definition builder and if, like like 30 lines of code like you could like copy and paste and repurpose it for your own stuff, but the other thing I'll say is I think we will end up repurposing.

A

You know: we've structured our config system with the fancy on completing the schema stuff to actually be a pretty generic component, and you can expect that to be repurposed for other contexts. I think you know I think it's pretty inevitable. That will be used for a pipeline definition.

A

Can we get two minutes on when G RPC G, which can we get two minutes? I went to where G RPC fits and wind graphed, you offense, so I think the broad thesis here the broad way to frame it is that graph. Ql is the interface to the instance itself, whereas G RPC is the way that inner process communication happens within an instance. So, for example, if you're building tools that are kind of a peer tool to Daggett, you will almost certainly be on the graph QL interface like, for example, I.

A

Imagine that we're gonna build a CLI, remote control that operates over the graphical interface, whereas something like a run launcher that is instigating computation same within a container, that's kind of in the instance. So to speak, you would use the G RPC interface, but you know this is still like an evolving architecture and it's still settling. So you know that framework may change for this broadly.

A

What's going on, why did we choose Y mo not tunnel, because I think for better or worse, the industry kind of is standardizing on y'know.

A

You know that's an unsatisfactory answer, but but there it is ha takes on users. Writing on run. Letters and schedulers should be sticking on ones provided by a project.

A

Great question: Joe I expect that anyone who is working on this for a while and has any sort of domain-specific or application specific logic for allocating compute resources will end up running writing their own run launcher. So, for example, you know some folks want to be able to tag certain runs and then based on that allocate specific compute resources say with GPUs very difficult for us to do that generically.

A

A

Expect that anyone who doesn't take like off-the-shelf tell me our mus were running on kind of custom infrastructure will end up building their own run. Launcher schedulers is interesting. I do believe the kind of instigation framework of thinking that I mentioned in order to event-based pipeline investigation will inevitably be more pluggable. I think it would be quite brave to write a new instance of a scheduler at this moment. Alex. You may want to weigh in with that, to back up that assessment or say.

B

Yeah, the scheduler API is harder to satisfy not very well-documented and likely to continue changing in the near term future. So it's certainly possible, and if you have certain constraints, certainly come talk to us. We can help you navigate it, but I think run launcher yes, scheduler yeah, I.

A

Hope that take was sufficiently hot.

A

These are great questions. Everyone.

A

Keep them coming.

A

All right Oh what happened to the PDF version of the docs I? Suppose I was a feature of read the docs, which we have not really permitted.

A

So and I don't know if we have I, don't know if that's easier hard to do, but yeah I believe it used to be a kind of feature out of the box of read the docs, and we now run our own custom infrastructure.

A

Great alrighty everyone. Well, we can wrap this up now and thank you so much for coming and again, thank you for being early users of the system, and you know keep that feedback coming. I hope that you know the this presentation was useful and if you find it useful, you know we can keep on doing these sorts of events.

A

It's also fun to see some friendly faces in a quarantine world and with that I think we're gonna sign up and feel free to follow up, follow up on slack all right, ah there's warm fuzzies in the chat. Thank you.