Dagster Community, 13 Jul 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Dagster Community Meeting - Featuring Thinking Machines and 0.12.0 Features | July 13, 2021

Description

In our eighth Dagster Community Meeting, we heard from Carlson Cheng at Thinking Machines on running Dagster for ML pipelines in production. With our new 0.12.0 release, our team presented our new features and the road to 1.0.

👨‍🏫 Today's Agenda 👩‍🏫

Introduction: 0:00
Carlson Cheng at Thinking Machines: 1:26
0.12.0 and Road to 1.0: 19:37
Q&A: 43:08
Special Announcement 45:11

🌟 Socials 🌟

Checkout our Github ➡️ https://github.com/dagster-io/dagster
Check out our Documentation ➡️ https://docs.dagster.io/
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Follow us on Twitter ➡️ https://twitter.com/dagsterio

A

Thank you for coming to the and thank you for coming to the uh july 2021 uh daxter community meeting. um Today we have a great agenda. um We have two speakers. um One is carlson chang from thinking machines and you know they're, one of our earliest and most enthusiastic users and they've deployed dagster.

A

um You know across a bunch of different companies in southeast asia and they especially use it for ml training pipelines, which is a really exciting use case, and then sandy from the elemental team is going to speak about the o12o release and in particular a bunch of new experimental but core apis. That will replace a lot of our core abstractions and fix a lot of long-running issues in the system that we're really excited about.

A

But we're also going to be asking a lot of you in the next six months to eventually uh move code uh and so want to go over that and talk about the value of that then some q a and then a special announcement after that. So without further ado um carlson do you want to take over.

B

All right sure, uh let me just uh share a screen, so I might need some permissions. Okay got it: okay, so hello, everyone, I'm carlson from thinking machines, I'm the head of the ml engineering team there. So thanks for inviting us over, uh so that we can talk about how we use dagstr for running ml pipelines in production.

B

So just a brief intro with thinking machines. We are. We are a global technology, consultancy, building, ai and ml uh solutions and data warehousing platforms to solve, like high impact problems for our clients. So our clients range from large corporations inside in southeast asia and like global nonprofit organizations, and our main goal is to empower these business users with valuable data and insights so that they can make better decision making.

B

We we're internationally recognized in the field of data science we presented in top machine learning conferences uh most recently icml and europe's 2020, wherein we've been awarded the best paper for europe's for one of the mla workshops for development, for our research in gsp geospatial for poverty estimation using satellite imagery.

B

And we've used laxar for a number of projects, primarily for big data warehousing, wherein we use dagser for more traditional etl and elt use cases, unifying data for multiple sources and stakeholders, building up data warehouse and dashboards for analytics.

B

We've also used it for big data ml for orchestrating large-scale batch predictions in the cloud for terabytes of daily data and, lastly, for emel, ops for automating key processes in the ml workflow like training and evaluation.

B

So we'll be focusing more on the third use case, which is for mlaps, so mentioning one of our projects where we've used dagster so for building a smart, unified search app. So this app in particular consolidates a number of data sources and allows users to basically search through these sources and get relevant information.

B

So one example of this is a user. Would then ask our application like how do I apply for vacation leave and then our application using ml algorithms would then give the most relevant section in the employee handbook, highlighting the steps that you would need to actually file for a leave and our use cases extend from that also allowing our users to query relevant entities like people and companies.

B

So a user could then search for the company gamestop and then they'll get relevant information regarding that company, and so our search app composes of like three main search features. uh One is semantic search wherein, based on your search, query, to give you the most relevant faq document.

B

Second, is for q, a search which is our initial example, wherein you will get the most relevant section of the employee handbook for your question and third is entity search, giving you the most relevant person or company information, and all of this, like search results, are piped into a search ranker. So the search ranker is like an additional model that prioritizes from the your search results which one to actually uh list down as the top priority and most relevant search result.

B

So this is based on like the search query itself and the user, and also like the confidence course we have for each of the three different search results.

B

So just a simple uh up architecture for our project. You can see here. Our training and test sets are piped into the automated training and evaluation pipelines that we've built using daxter and based on this. We can do continuous training for newer data and then build like new models that we then stage inside the s3 bucket and from there we can redeploy our web application servers, storing these new models and, as our users are using our application.

B

Their interactions will then give us more relevant information and give us user feedback so that we can build our models uh and create more fine-tuned models for their uh application needs, and we have like an additional set here where and we created a few dagster pipeline for meta pipeline monitoring.

B

So this is our ml automation workflow in general, it goes from poc to dev into prod and focusing on the first stage. First, the proof on proof of concept phase is primarily done inside jupyter notebooks, wherein our data scientists can fully test out their different ml approaches and run experiments wherein then, after that, they can finally finalize their ml methodology and then that's when we start migrating over their jupyter notebooks inside dijkstra pipelines, where we can do further fine tuning and polishing.

B

So how does this work in practice? In the left side, you can see here the jupiter notebook that our typical data scientists would create. So in this case, after they've done like some initial data prep, they would then start doing hyper parameter. Optimizations uh they're, using a module here called optuna which is used for uh hyperparameter search and given like a number set, a set number of trials like 100 trials.

B

They would then get like the best scoring model with its hyper parameters and accuracy score and that's what we can then uh save over and export as our best model from that number of trials.

B

So this is like a typical thing that you would get from a data, scientist's notebook and then once we like finalize this and polish it, we can actually start moving it over to our dagser pipelines and the uh convenient thing here is: you can pretty much just copy paste over your notebook code inside uh dax or solid since daxter is very pythonic. It doesn't really require you to like write anything extra since most of dax or solids are just python functions.

B

uh You can pretty much just port it over to your dax or solid, and the additional steps here that you just do in coordination with your data scientists is to add the data descriptions for your solid definition and the input and output definitions, and this is important because we'll need these definitions later on when we're validating and debugging our pipeline inside the ui, and some additional steps is just adding logs and dagger assets for ml metadata tracking. So more on this later.

B

Just some learnings that we've had when running ml pipelines alongside existing dags or pipelines. So usually when we're adding and porting over our jupyter notebooks into like more standard ml pipelines, we would already have like an existing dagster infrastructure set in place for more traditional etl and elt uh pipelines. So we don't really need to create like a new, uh a new set of platform for our ml workflows. We can just uh make use of our existing taxi infrastructure and then add in our ml pipelines there.

B

So one thing to take note is we should just organize our pipelines into logical groups. So, for example, you would have a group for your different etl pipelines for source a and source b, and then you would have other groups for your ml pipelines for let's say a specific model x and then another model y. So we make use of like the dag dax feature for repositories which helps us like isolate the individual uh groups, and this also further helps us isolate the dependencies for each of these uh pipelines. So you can avoid conflict.

B

So if say for model x, it requires a specific version of pytorch or scikit-learn. We can pretty much just ensure that that version it does not conflict with model-wise version of pytorch and scikit-learn.

B

And just some additional learnings that we've had like daxter makes it very easy to move over to prod since pipeline implementation is pretty much the same whenever you're running inside. Let's say your local machine or your kubernetes production environment, so there's very minimal changes when moving over your pipelines. So most of the changes is done inside the high level dagger configurations, but then on the pipeline level, you don't really need to change much to port it over.

B

So speaking of production, uh moving on to that, this is where we can fully make use of our automated training and evaluation model pipelines, uh producing new models and deploying them into our servers where we can do further monitoring and get new data so for pipeline monitoring for ml ops, we make use of daxer's asset materialization feature, so we can keep track of ml metadata. So, coming back to our initial example, we have a code snippet here wherein we create, like an asset called generatedmodel.

B

We give it a description and a set number of entries, and these entries are basically just key value pairs that we store to keep track of interesting ml metrics, like accuracy scores for your model, the best hyper parameters for it and the output file name of the your model.

B

So this is important because later on, when you're actually checking your dagster ui, you can view the metadata for each of your training runs. So here your latest run can pretty much show you your file name, the hyper parameters that was the best scoring model and your accuracy score there. So your data scientists can pretty much uh keep track of like the scores of each of your training runs and we get to easily uh view if, like a certain run, is performing well.

B

A certain model performed really well and then notice that uh the score went down. So we can easily like debug issues within the ml workflow.

B

And based on the ui, we also like uh appreciate like a lot of like the pipeline definitions. Since dagster pipelines are very data aware we can see like the input and output coming through each of the solids, so we can keep track of how our data is uh processed and changes throughout the pipeline, as opposed to uh airflow ui, wherein the data is abstracted away from you. uh We don't really get to see how the data is processed in our pipelines and overall, it makes it a more intimidating uh ui to work with.

B

So what our data sciences can pretty much do in the daxter ui, they check the inputs and outputs of each pipeline step validating the data schema of the pipelines they filter through the pipeline logs from the ui being able to filter through different log levels and debug their pipeline code.

B

They monitor the pipeline outputs in the daxxer assets, page validating these models and ensuring that the models meet a certain threshold before they deploy they trigger pipelines with different configurations running the training pipeline, either in a dev mode, wherein they just run on a subset of the data, or they can run the training pipeline with a production mode wherein it runs on full set of data.

B

They re-run the pipelines or just subsets of the pipelines for further debugging uh for a pipeline for additional pipeline monitoring. We also make use of the slab notifications um in general, like our company, makes use of slap for our day-to-day like communication, so it allows this allows us to spend less time manually, checking the ui for pipeline passing or failure messages, and we get to find out when something happens as soon as it happens.

B

So uh just an example here is we get to see that a dax rle ingestion pipeline is actually performing and succeeded, although it didn't really get anything new data, so this still counts as a success for us and it we can see here that the elapsed time takes this much and we can see if there's anything any issues with the cpu resource.

B

If ever this uh amount of time passes is incredibly high and we can see the s3 or uh sg link to the ml model path that we create and used for the pipeline, and similarly, we can check like pipeline errors. Whenever it happens, we can see the specific pipeline that failed and which solid actually failed there, and even the error message that shows up in the solid and further on we created like a handy link here that allows us to just uh enter and then go and which sends us over to the dagster run itself.

B

Where can we do where we can do further debugging? So just a extra step on top of like pipeline monitoring, we do like meta pipeline monitoring, we're in we've created like a dax or pipeline that checks. Other pipelines, so how we do this is we build a separate digester pipeline to regularly do a health check on our production pipelines and summarize their status in slack?

B

uh As you can see here, we have like a summary that we get on a day-to-day basis, giving us all of the different production pipelines that we have and its success scores in the past. A few runs that it was executed on. So we can easily see here that some of the pipelines are working as expected and then some are not doing as well, and some things might need to be flagged for further debugging and we even have like the last success date which helps us like further check.

B

If there's any issues uh based on that date and how we do this is we have like a pipeline that accesses the dax database, primarily the runs table, so we produce like a simple sql query that just checks the number of pipeline runs for each pipeline, so based on, like let's say the past 10 runs of a pipeline.

B

If it ran and succeeded only three times out of the 10, it would have then a success rate of 30, which would then create like that weather icon, which gives us like a visual tool to quickly verify our pipeline statuses.

B

So yeah in conclusion, uh why we think dagser works for ml pipelines in production data. Scientists overall gets a very user-friendly ui which enables them to run data pipelines without fear. They can easily monitor their pipelines and debug their pipelines from the ui and, secondly, uh dagster is very versatile, wherein we would have. We would usually have a dagster infrastructure that already uh supports etl and elt pipelines, and then we could easily just extend that to support ammo pipelines. So this uh removes the overhead of setting up something completely new for our ml workflows.

B

And thirdly, uh daxter is uniquely uh works for ml ops, because, unlike other orchestrators uh daxxer, has features that supports uh emel ops on top of like automating our training and evaluation.

B

It also has dags or assets that helps us keep track of ml metadata and also in general, like dagster's pipelines, are pythonic and data aware, which makes it very easy for us to support over from jupiter notebooks to dax pipelines.

B

So, just some extra things that we we're planning on working on next in the future we plan on migrating over to using grpc servers inside kubernetes, so that we could separate out the pipeline code from core daxer infrastructure. So this helps us update our pipeline separately from the dax or daemons like the scheduler and the sensors, so that we can avoid redeploying them all together. So this is just a step into more process isolation, but inside kubernetes.

B

uh Second, we we're planning on trying out more comprehensive data and schema validations using dagster's integration, with great expectations for better monitoring our data quality and third, this is more of a recent taxi feature from 0.11.

B

uh We want to try out dynamic orchestration for etl, allowing us to generate solids uh more dynamically during runtime instead of having to manually, define it, and this has a bonus of like making it very easy to check inside the ui, since uh it allows you to view those dynamic solids, a lot easier than manually defined solids.

B

So yeah, that's pretty much it. uh Thank you uh for listening and I'm free to for any questions later on. Thanks.

A

That was fantastic carlson. um Anyone in the audience feel free to. uh We can take brief questions now, if you want to put anything in the chat um or you can just if you're uh feeling brave you can just unmute and pop in this is a pretty unregulated zoom call.

C

I actually have a quick question. This is rebecca here. Thanks for the presentation, it's really cool, to see how you guys use it. uh I just had a question about your slack uh pipelines. The I mean the notifications about the pipelines that run uh the summary one looks really great the the ones that report on individual pipelines. I may have misunderstood that, but if that's what it's doing, does it does it get spammy? Is it something that's uh helpful for you to monitor uh systems and that kind of stuff, or how do you use that.

B

uh Yeah for the staff notifications in general, like um overall, like the most important ones, are the ones that actually fail. So in terms of like success notifications that might not be as important for us. So that's just like an additional check that we do, but then overall the ones that actually do fail. Those are the ones that we actually tag users on, so we automate like also the tagging functionality we're in whenever we have a pipeline error, if we would tag the relevant person for that pipeline.

B

So that's what those are the most important ones, and on top of that, we just have those daily summaries that we do like a quick sense check on.

B

A

A

Cool there's no further questions. If you have any more that come up to mind, you can just always pop into the chat and we'll be able to get to them at the end of the meeting, um I'm going to hand this off to sandy now who's going to present on the new corporate the new core apis that are now released in o 12o.

D

All right, hello, everybody, um my name is sandy, uh I'm an engineer at elemental um and I lead the team that builds and maintains the core dexter apis. So I'm going to talk to you about a set of changes and improvements that recently arrived inside the project we just released extra o12o last week.

D

The release includes a bunch of stuff that we're really excited about so on. The left are the new features. These are additions to dexter that make it easier to build reliable and observable data pipelines um pipeline failure. Sensors help address our most uploaded github issue of all time. It's all level.

D

Retries are a core orchestration feature um that we have been missing and are, you know, excited to include a new set of testing apis, offer uh sort of really nice and elegant ways to verify any of the functions you provide to build dags or definitions and then dbt and ml flow are two of the systems that are most commonly used with daxer on the right. We have a set of more fundamental changes. I'm going to spend the bulk of the time in this presentation um on those.

D

The goal of these changes is to address some long-standing issues with daggers for apis.

D

Before I jump into those I'm going to step back and talk a little bit about some of the things we've heard from our users.

D

So one of the things that we've heard uh is that people grasp the basics of constructing a pipeline very quickly, but it takes them quite a while to understand modes. Presets partition sets deposit solids and the link so part of what's difficult here, is sort of inherent complexity that we're helping to model with the problem domain itself, but part of what's difficult is also that many of the concepts are similar. So, for example, modes and presets are both ways of specializing pipelines to particular execution environments, pipelines and completed.

D

Solids are both ways of defining dependency graphs of solids. So the relationship between these three concepts pipelines composite solids and solids, uh can inspire a decent bit of confusion here, for example, users, ask us why they can't miss pipelines inside of their pipelines and then solids and composite solids while names similarly work very differently.

D

So the main difference is that the code inside a solid runs when the pipeline actually runs, but the code inside a composite solid runs when the pipeline is being defined, and you know that can be especially tricky to grasp, given their similar names coming from a different direction, but one that's ultimately related. um It's difficult that we've heard about using resources and tests, so one of the core goals of dag source resource system is to make it easy to test pipelines.

D

The idea is that you can um supply different implementations um inject. uh You know, pieces of your environment that might not actually exist inside of a unit test, but it can be very difficult to actually take advantage of the resource system in unit tests um and that's because all resources need to be supplied to the pipeline at the place where the pipeline is defined. So, for example, here's a test where we'd like to construct a mock resource um supply, some particular values that are relevant to that test and execute a pipeline with it.

D

But this doesn't really work because we're actually required to decide all the pipeline modes at the site where we define the pipeline.

D

um That's a problem, because we can't necessarily anticipate all the ways that we're going to want to test a pipeline at the time that we're defining that pipeline and then another separate but related point of awkwardness is that instances typically include modes and presets that cannot or should not be launched on them. So this is a screenshot from a production dagget instance, but it's displaying a local partition set.

D

If a pipeline includes a prod mode and a local mode, the dagger running in production will display both of those pipeline modes, even though in many situations or in many setups, the local mode should never actually be used in that environment um and then, last but not least, one of the most persistent pieces of critical feedback we've gotten about dashboard apis has just been the name.

D

Solid people who've spent a lot of time with dexter, mostly get used to it, but new users often find it difficult to understand what the name solid has to do with executing graphs of data computations. This is a quote from one of our users, so we asked ourselves would be comfortable shipping. A 1.0 release with these issues outstanding for us 1.0 means a stable and a set of apis. That users can expect to remain the same for a very long time.

D

We think it's important to get to 1.0 soon, because our user base is quite large now and our users depend on dagger as a critical piece of their data platforms.

D

Before making that commitment to stability, though we want to make sure we can confidently say that our apis are as intuitive and simple as they can be, um so so I'm going to jump in and talk about these core changes that we're planning on making that essentially um are into bringing our apis to the point where we feel comfortable, releasing 1.0 and um committing to them for a very long time.

D

um I want to preface by saying that we're still going to support the existing apis for a very long time as well, so you don't need to worry about migrating your pipelines anytime soon.

D

Also, none of these changes are set in stone. I'm going to make an appeal at the end of this talk for you to try these out, while they're still experimental and give us your feedback, um so we can change them and fix issues that you encounter so jumping in graph and job are a pair of new abstractions that we're planning to introduce they're going to replace pipelines, modes, presets and partition sets before we talk about them. Let's look a little bit and try to understand how pipelines are structured in dead extra's current apis.

D

Every pipeline includes a set of solids and the dependencies between those solids. So that's what that's what's included in the body of the function? That's used to define the pipeline. This is the part of the pipeline that stays constant, no matter where or how the pipeline is running um because it's not bound to any particular environment. We sometimes call it the logical part of the pipeline, as opposed to kind of the physical specialization of a pipeline that is tied to a particular set of resources or config.

D

Pipelines often also include modes, each of which supplies a different set of resources for the pipeline, so you can think of modes again as a way of specializing a pipeline to a particular environment, and they include the resources for that environment.

D

um So, in your production environment, you might include a resource that represents your production database, whereas in a development environment you might include a resource that represents your your development database and then pipelines often also include presets, each of which corresponds to one of the pipeline's modes and supplies configuration for the pipeline, um so kind of another way of specializing pipelines in particular environments. But this one focusing on configuration instead of long resources.

D

So with the new apis, instead of defining a pipeline with modes and presets, you define a graph and then you build jobs from that graph, each of those jobs of which is specialized for a particular environment. um So a graph is this logical piece of the pipeline, um it's a dag of logical computations um and then a job is a specialization of that pipeline to a particular environment. It's an operational unit. It's something that you might want to monitor, something that you might want to execute to. You know tie to development or production.

D

So if previously, you had a pipeline with three presets inside now, you instead have three jobs that all reference the same graph. Each job contains its own resources and its own default configuration.

D

So here's how graphs and jobs fit together up on top we've got a representation of the data model of a single job and down here we've got some code that defines a graph and creates three jobs that references it um to connect the diagram with the code. The graph is the set of solids and their dependencies. um It corresponds to the logical input components that we talked about when we were talking about pipelines. It's a logical object to be referenced by multiple jobs as well as embedded inside other graphs.

D

The job is a single operational unit usually bound to a particular environment, so your production job contains production resources and production, config your devja obtains div resources and dev, config, etc. um You create a job by taking a graph, um invoking two job on it and supplying the set of resources and config that that correspond to that job.

D

You end up with a job that references, the graph that you invoked your job on um and has these uh these additional environmental pieces each schedule or sensor points to a particular job.

D

We also require that no more than one schedule or sensor points to any particular job. So this results in a simpler dagget experience in the new left navigation pane. We simply show a list of jobs and those jobs can have icons next to them to represent their schedulers, their their their schedules or sensors.

D

um This corresponds to the fact that, when you're working with a when you're working with dagstart in um any production or even developed environment, you're typically zoomed in on a particular job right, you want to understand all the runs that happened of your production job or you want to relaunch your development job as part of your development workflow. So the the ui becomes a lot more focused on jobs, although it does still allow you to um connect a set of jobs that all correspond to the same graph.

D

So this change has a few positive consequences. It makes life easier in a few different ways and I'm going to go through some of them. This will be a bit of a whirlwind of code, so don't feel bad if you miss one of two of these things, the first implication of this chain is that repositories will be able to selectively include jobs from the graph. So this means that your production instance no longer needs to be cluttered with the dev modes of all your pipelines.

D

What's going on in this code example, is that we're defining two different repositories? Our development instance can reference the development repository and only show all the development jobs, and then our production instance, through our productionworkspace.yaml, can reference the prod repository and only show all of the production jobs.

D

A second benefit is solving the testability problem that we talked about earlier um on the left. We have uh what it looks like to mock resources, inside of uh to build mock resources for tests um uh with the old apis and with and on the right we have with the new apis.

D

um So you can now execute a graph with resources that you constructed inside a unit test. um It's no longer required to define all the possible resource parameterizations at the pipeline definition site, um so you can construct um uh resource inside your tests that have uh you know particular attributes that are relevant to that particular test and execute the pipeline with those. So part of the advantage here is requiring less boilerplate so, like.

C

D

It looks like there's less code on the right on the left. The other part is actually enabling usages that were really awkward or nearly impossible with all the apis. So now you can have 10 different tests, each construct, their own resources, and you don't need to anticipate all 10 of those tests at the site where the pipeline is defined and then see 10 different modes corresponding to those tests. When you view the pipeline and tag.

D

Another advantage is using pointers instead of strings target jobs from schedules and sensors, um so currently in dagstar the way it works that if you want to point a sensor or a schedule at a job when you construct the sensor, you supply the name of the pipeline as well as the mode of that pipeline.

D

This is a little bit errorplan error-prone, because if you mess up the name, uh you know if you type one of the characters wrong, it's difficult for your ide to tell you about it and then the object itself. The sensor object doesn't actually have a reference uh to the pipeline. So if you wanna, you know, go and verify something with that sensor, you have to go and grab that reference um from somewhere else and make sure those are synced up um so with the new apis.

D

Instead of providing the pipeline name and mode as strings, you now point directly to python objects when defining a schedule or sensor. So this means you can discover error errors earlier, because linters can tell you if your schedule points to a pipeline that doesn't exist. um It also makes the code briefer and then arguably, most importantly, the sensor object, has a reference to the pipeline that uh that it's targeting. So so, if you want to test that sensor, all you need is the reference that sensor object.

D

You can get to the the pipeline that it's maintained.

D

So yet another advantage is the graph can now be nested inside of other graphs, um so graphs replace both pipelines and composite solids. This used to not make sense, because nesting a pipeline with multiple modes inside another pipeline has all sorts of key implications. You end up with a sort of combinatorial explosion of modes, which is each mode in the subpipeline correspond to a mode in the in the parent pipeline, um but by exposing graphs as a logical concept that does not involve modes, we can now provide a single abstraction for composition.

D

um uh A graph can include a graph that graph can include any any number of graphs and then, ultimately, you take the top level graph um and build a job out of that graph and supply resources. At that point that apply um to the to the uh entire hierarchy of graphs.

D

um Last of all, it's now possible to uniformly apply a mode across all the pipelines in an environment without needing to provide it to each graph individually. So I suppose you want to have a set of resources. Maybe these are sort of your standard production resources. They include your production database um uh production credentials to some set of systems um in the past. If you wanted all of your production pipelines to reference, those you have to individually on each of those pipelines um include a mode that reference those resources.

D

um What you can instead do with the new apis is uh simply define a set of graphs um and then in one fell swoop, as you're defining your repository supply those production resources to each of those graphs.

D

This saves code, but also helps helps avoid errors from forgetting to include a mode on a pipeline.

D

So to sum up, that was probably a lot to digest, but here's the high level we're simplifying our core apis by consolidating a number of concepts into just two concepts: those are graphs and jobs. This has a few different advantages, including better testability.

D

um uh No, more string, pointers allowing you to embed graphs inside other graphs and writing a single abstraction for um uh execution and composition, a better dagget experience that allows you to focus on production, jobs um and less boilerplate. Overall.

D

So, with these changes we asked ourselves, would we feel comfortable? Releasing one to o um going back to our uh original criteria is of.

D

Do we feel like we're supplying supplying you know, kind of the most intuitive and simple set of apis that we can that we can, um and I think the changes I just talked about are kind of a massive leap forward for simplicity. um What about intuitiveness and one thing kept uh kept nagging us.

D

That thing was the name solid, releasing dijkstra 1.0 with solid, as the core abstraction was mean, um as the core abstraction we've been committing to a name that um most of our users have met with kind of confusion and aversion. It would mean that many more years of having to explain and turn to people and having them split their eyes, they sort of tried to understand how it relates to this process of executing graphs of data computations.

D

As I talked about earlier people, who've spent a lot of time with dexter mostly get used to it, but new users often find it difficult to understand what the name solid has to do with its function.

D

So we decided to rip the band-aid off uh and replace the name solid with the name. Pump up is short for operation which communicates to the solid as something like a function. It's an abstract unit of computation that consumes data and produces data.

D

As with the changes I discussed above, um uh this is currently experimental and we're planning to maintain backward stability, chemical compatibility for a long long time. um We don't make this change lightly, because we know that we'll mean changing a lot of code in the long term, um but also for the long run. We think it's important for making the project as accessible and successful as possible and making the core abstractions as intuitive to understand as they can be.

D

So concretely, what this means that the op decorator will be preferred over the solid decorator. The up decorator is available in otomo as an experimental api.

D

The object creator is also going to support a briefer way of defining inputs and outputs. So in the current uh apis we have these fairly revo uh verbose, um input, definitions and output definitions uh in the new apis. We just have these simple ins and outs, and that allows the emphasis to be on the actual values that are supplied to these definitions. Instead of these kind of enormous strings.

D

So here's what the timeline looks like um we released o2 of o last week and it's including graph job and off as experimental changes, um pipeline, solid mode and preset are still sort of the main uh stable apis.

D

um Our docs and tutorials are going to focus on pipelines, um solids modes etc.

D

For the o12o timeline. um Oshawa, though, includes an opt-in ui that I'm gonna uh uh talk about and a couple slides down that allows you to sort of uh focus on the job. Daggett experience that I talked uh showed a little window of earlier um in o-130.

D

Our plan is to essentially make graph job after we receive um feedback from you and I'll talk about that in a minute.

D

Our plan is to make graph job and op the new, stable apis pipeline, solid mode and preset will be um you're, learning preferred but again they're going to stick around for a long time, so that we're not forcing people to immediately change their code um in other tno, the new ui will be opt out, so you'll be defaulted to the job, focused ui, but you will be able to um revert to the old api for it um and then our docs and tutorial are going to focus on grass and offs.

D

um This is ultimately going to lead up to our 1.0 release, which will have full um versions of these apis that we commit to not change for a very long time.

D

So we think these new apis are a big improvement before we switch over to them. It's really important for us to hear how they work out for you. um We would love for you to try them out and give us your honest feedback. None of this is yet set in stone and you can have a lot of influence over what the final product looks like you.

D

Don't need to switch over all at once, so you can have a set of pipelines, modes and solids alongside a set of graphs and ops, there's a backwards compatibility story that basically takes a pipeline that has multiple modes or multiple presets and flattens that out into a set of jobs.

D

So you don't have to switch over all at once. um If you do want to try these new apis, it means two things so the first one is converting code to the new api and the second one is switching the appearance of dagger I'll post, a link in slack where we wrote a migration guide that explains how to take code written using the pipeline apis and translate that code into the graph apis like it goes. Example by example, um situation by situation, and I think you'll see that many of them um the apis.

D

It ends up looking quite a bit simpler um and then daggett now has a toggle um that allows you to switch to a view that's based on the new apis, so you can find it by clicking on the gear in the top right. um It's going to take you to this page that, uh where you can flip the switch, as I mentioned before, the big difference here is that when working in daget you'll now be usually working within a single job, um that means you're.

D

Looking at a list of runs you'll be seeing runs for a particular job. um You won't be distracted with modes permission sets from other jobs um and daggett, as I mentioned, we'll still be able to load pipelines defined um using the old apis. This works essentially by flattening them into multiple jobs if they have multiple nodes.

D

So again, we'd love for you to try it out and give us your feedback um and we're. You know we're pretty excited about the simplicity and intuitiveness that these uh changes have the opportunity to bring um that's all. I have for you uh any.

D

E

Hi, thank you very much. It was really interesting. I would have a question about the part when you showed us that in the repository you are iterating over a list of graphs and changing the mode and could we use or change the annotation yeah exactly this one? Could we use? Could you change the annotations here and set other parameters uh or it's just only four modes.

D

uh Yeah, so so you can set you can when you're invoking two job, um you can provide resources, you can also provide configuration, you can provide tags, um you can provide um hooks and you can provide loggers.

D

Does that answer your question.

E

Yeah yeah, that's that's very nice. Thank.

E

E

I have another question: can we partially use these new obstructions uh side by side with the other one.

D

Yes, you can um so yeah, you can have a repository that includes both pipelines and.

D

Jobs and then op um is just to rename, so you can include you can build a graph out of ops. You can build a graph out of. Are you sorry? You can build a pipeline out of ops. You could build a graph out of solids.

D

A

Wait a couple seconds to see if there's any more questions but alex. Thank you for putting the link in the.

A

Chat yeah. I just want to also reiterate the point that sandy made that um you know we highly encourage you to start using these today. We think there's immediate improvements in among a number of things. You know once you kind of uh unclutter your instance with irrelevant modes. You kind of can't go back, um so that feels really really good and then the especially the testing apis um are, I would say dramatically better um and for those who care about that you'll find immediate ergonomic improvements and then yeah. We will also you know.

A

This is your opportunity. You live in this tool, so it's your opportunity to give feedback and shape the future of it, and we take the feedback super seriously. So um yeah yeah, please reach out to us with feedback. If you didn't feel comfortable asking questions here and um you know start using it as soon as possible um and we're really excited to work with you on that. We also really appreciate the patience of everyone here.

A

I know we're asking you to do a lot of we're going to maintain backwards compatibility for a long time, but we will be kind of focusing on the daggett experience with the new ui starting the fall um uh new apis starting in the fall.

A

So we will be asking you to do work and we sincerely appreciate it, but we think it's good for the long-term health of the system and the community, and then you never have to explain to any of your colleagues what the hell solid means, uh which is a nice bonus, so um uh yeah thanks again and then uh and then the yeah sandy and the practitioner team have done amazing work on this. I think it's like a dramatic improvement in the system. So thank you to everyone on the team there.

A

So we have one additional announcement here, and that is so. We are a company and we have to eventually make money and have a commercial product, and so we have been actively working on that for months, um and so we are working on dexter cloud um and we're. um You know we have we're announcing a closed beta today. So that means there's a wait list and you can sign up for it and then we are, you know, looking for and working with, early design partners to improve the system.

A

So this is a hosted version of dagster and the goal here is to effortless enable our users to effortlessly, deploy and operate daxter. You know we hear feedback all the time that listen. You can like get local dagget running in six lines in python, and it's kind of immediately empowering you're learning the concepts executing on your laptop super fun.

A

Then you have to deploy this thing and kind of hit this wall um and it's really challenging to do and we want to have a centralized, hosted service that makes that as smooth as the daggett experience on your laptop, so you know this system will we will host the scheduler, the web server daggett and the metadata database on your behalf.

A

You will never have to run daxter instance migrate ever again, uh everyone's favorite thing, so we will handle version upgrades on the database side while maintaining backwards compatibility.

A

So it would not compel you to upgrade your code um dynamic workspace management so, instead of it being driven from a yaml file, you'll just be uh or less so you'll be able to just dynamically add it using command line utilities and then authorization on rbac will also be included in dexter cloud, and then your data and your code is owned by you still, so you will be able to run it. You know you can run dexter cloud and still run the actual compute on your laptop or in your vpc in a kubernetes cluster.

A

So we're super excited about this. um The team's done amazing work. It's really smooth to spin up and it's really fun to use. Actually so yeah you can become a design partner. um So right now there's a live link on daxter, io, dexterio cloud.

A

There's also a link at the top of the page, and you can just go and plop your name in a form and we'll reach out to you or you could just dm me if you want uh and- uh and we can start chatting so yeah- it's a really exciting day for us to start to uh unveil.

A

What's going to be our commercial platform to the world and with that, um if there's any follow-up questions we can talk about that um or we can end the meeting, wait one minute for any additional questions on for any of the speakers. Carlson sandy or myself.

A

Well, I got a private message from someone that they're pumped up about dexter cloud, but uh the uh thanks peter um okay, so this is a ton to absorb, um especially on the new um o120 core api changes that will serve as the core of 1.0. So again, please play with it reach out to us. um We're excited to engage with you uh on the slack to really like suss this out and to iron out the kinks, but we think it's a massive improvement um and I think we can close out the meeting.

A

Thank you and have a great rest of your.

A