Dagster Dagster Talks, Panels, & Interviews, 1 Nov 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: SFBigAnalytics_20211018: Dagster: The orchestrator for the full data lifecycle

Description

Nick will cover the principles and origin of Dagster. Dagster is a new type of workflow engine: a data orchestrator. Moving beyond just managing the ordering and physical execution of data computations, Dagster considers the entire data application lifecycle. Practitioners in Dagster build data-aware dependency graphs designed for local development and testing; deploy those graphs to a multi-tenant, cloud-native orchestration engine; and then monitor and observe the data assets produced by those computations.

This talk will also cover our new major release, which makes significant changes to our core API that dramatically improve usability and ergonomics.

A

Cisco big analytics, my name is chester chester, chen, I'm one of the organizers.

A

We have our sponsor aya kemp to provide the zoom for us, so uh we have a great talk today and before I introduce the speaker, so I just gave you like, like a 10 seconds, introduction to our upcoming events. Besides today's uh talk- and we also have another talk in october by um ryan blue who's- going to talk about the apache iceberg so uh and after that we have uh some. You know kind of machine learning, machine learning related to talk on the semantic search and your information retrieval so which is in november.

A

So uh for today you know. As I said, it's like. We have a special talk from I'm just gonna click on this.

A

Well, the meetup is a little slow, so um so we have a special talk: the uh by nick sharok who's, the ceo founder and ceo of the elemental. uh This is the company who behind the dexter, so uh he previously he worked at facebook and also you know famously co-created, the graph ql.

A

So uh as we know the you know working on the data platform and uh working in orchestrating the different pipelines that we need a kind of a workflow in gyms, so and so dexter is a one of the alternatives that you know. People probably know about airflow, but there's one of the other engines uh currently uh pretty popular. So uh with that I'll turn to nick and I'll, tell you all the story about it. I'm gonna stop sharing here. Next are yours.

B

Great thanks uh thanks for that intro. Let me share my screen here um so yeah. Thanks for having me uh really excited to speak here today, uh I'm gonna trust that everyone can hear me and see my slide, so I'm just gonna go for it.

B

Let me know in the chat if there's any problems, um so yeah uh as mentioned, my name is nick schrock, I'm the founder of elemental, the company behind dexter- and you know, as mentioned also I you know, my kind of bulk of my engineering career was spent on facebook, where I founded this team called product infrastructure and that ended up the team ended up, creating react, react native and the thing that I was personally involved with, which is graphql.

B

um You know I like to joke that. I was present at the creation of the full hipster stack, um but you know I really saw the impact of open source projects and just you know how broadly they can be adopted, and um it was really exhilarating to be part of that.

B

And you know, after leaving facebook in 2017, I took some time to figure out what to do next and I went around asking both kind of legacy- so-called techno companies, as well as companies in the valley about what their biggest problems are in their tech stack, and this idea of data infrastructure ml infrastructure kept on going over coming up over and over again, and I looked into it and you know what I like to say is that I I found the biggest mismatch between the complexity of a problem domain and the tools that support that domain, that I've ever seen in my entire career, and I'm just really attracted to problems like that and just compelled to in fact, and as I was looking at this ecosystem, it was really orchestration that stuck out to me as a critical point of leverage in companies data systems.

B

So it's worth stepping back and thinking about what is data orchestration.

B

So when a person or team wants to put a data asset into production and by asset I mean a table, a file, a trained ml model- that's not done in a vacuum.

B

Data must come from somewhere and it must go somewhere, and that means dependencies and it's the workflow manager orchestrator that models these dependencies and ensures that the computations that uh that produce data are scheduled and ordered correctly. And you know even in this example that is on the screen here.

B

This is not a particularly complicated data platform, but, as you can see, it has multiple different technologies at play here: ingest tools like five tran coming from sas, you have python, scraping the web and storing those results in s3 you have the data warehouse dbt over that data warehouse. You have census at the bottom, funneling that data back into the sas products, a bi tool ml after the data warehouse. There's just a ton going on here and it's typical at companies that data platforms are more complicated than this. And it's it's the orchestration layer.

B

That's really the beating heart of it. It's a centralized natural center of gravity for the data platform in some ways: the orchestrator, because it encodes dependencies between all the tools, it encodes the very structure of the data platform in new york itself, and it also has a sort of beast, nearly an octopus. You would say the logo I'm just joking around there, but the it grows across your entire organization.

B

It serves a ton of different constituencies and it's this natural place where all operational stuff happens as well and really you know today's workflow managers really struggle under the weight of this task. They can't handle the complexity. Dev life cycles are slow. It's really difficult to understand what dags are doing, and you know here today, I'm here to talk about dagster and we like to call it a data orchestration platform that is really built for productivity, because we think that productivity is a key underlying problem in data platforms. Everything is just too slow.

B

It's too hard to make change. So I want to dig into this notion of the orchestrator being the central point of gravity and really we did. We divide kind of the interactions with the orchestrator in terms of three different roles: role: one is the data practitioner and the data practitioner is the person that is responsible for producing data assets for downstream consumers and stakeholders. So a data practitioner could be a data engineer, producing parquet files, an analytics engineer, producing tables in a data warehouse, a data scientist producing ml models.

B

Then you have the infrastructure engineers who support all this and they're responsible for reliable data infrastructure and they naturally interface with the uh with the orchestrator as well. And lastly, you have the asset stakeholders, who are the consumers of data assets. This could be a business user who is interested in the state of a critical data asset from one of these business proxies or processes, or it could be a peer practitioner team, and you know it's. We think it's really critical to serve all of these different stakeholders.

B

So, as mentioned in the intro, you know, airflow is kind of the dominant uh existing incumbent in the workflow management space, and you know what we hear from users about that system, as well as other peer systems in the space. Is the following? You know we we hear that you can't develop your dags locally.

B

It's just a very slow developer life cycle kind of the moment that you hit the orchestrator you run into this productivity wall. Related to that is that you can't test your dags um and testing is kind of the bedrock of productivity. In my opinion, because without tests, you can't have that fast feedback loop.

B

Next, there's all these kind of infrastructure problems uh with the existing orchestrators, a huge one is dealing with dependency management, meaning that team a wants type python packages x, y and z.

B

Team b wants python packages a b and c, and they conflict in some way and it's just very easy to bring down the entire orchestrator, um and you know. Similarly, these are all kind of a function of the lack of process isolation.

B

It's also kind of difficulty difficult to reliably independently deploy code and last there's this modern observed phase it's difficult to debug, dags, quickly, kind of there's, a lack of observability and visibility into the computations inside of them, and you can't keep track of the data assets within the orchestration context and we think that's critically important, because the whole point of these systems is typically to produce and monitor and observe uh data assets. um And so, as you'll see, we we in a deep fundamental way, think that the orchestration layer should be data asset aware.

B

So the way we approach this and as you can, I didn't mention it, but the previous slide chopped up that feedback into three parts of the life cycle. You know we really try to think about the full life cycle in a thoughtful way. So the first phase is development, devon test and with dagster we want to be able to build efficiently and productively build, well-structured testicle computations.

B

Next, we want to be able to rely, reliably, deploy, schedule and execute those computations.

B

This is kind of table stakes in the orchestration domain and, lastly, we want to be able to track, monitor and manage asset production, and those words are chosen pretty carefully, because asset production implies that you can monitor both the computations I.e, the process of producing the assets and the assets themselves, which is the end goal of these systems.

B

So rest of this presentation, I'm going to go through a demo of you know a multi-phase.

B

You know data pipeline, you call a data platform and the application here is that we're building a recommendation engine for hacker news, so we're going to download the data via web api, we're going to use, spark to compact it into park, a load it into bigquery and then we're actually going to um and then we're actually going to use, split that and have a ml team, take it out of bigquery and build a recommendation model using pandas and now an analytics engineering team use dbt to build analytics dashboards.

B

So let's go through this first phase, so this is kind of a typical, a typical task for a data engineering team. We want to download data from the web, do some computation on it and then load it into bigquery, and that brings us to the develop and test phase of that.

B

So what's the big problem right now in data orchestration and data platforms more broadly- and we think it's this, which is too often too many errors- are caught way too late in the deployment cycle, meaning that too many errors are caught when the error is on the dashboard or if you're lucky in the staging environment, and so the curve looks like this and the big problem with that is that errors are orders of magnitude more expensive in later stages.

B

Right. If the data error gets to your ceo's dashboard, you might get fired or it might take a really long time to fix it, whereas if you catch it in local development, that's no big deal right, orders of magnitude, more expensive in later stages on a few different potential dimensions, and so what's the goal here, we think that a properly engineered orchestrator can bend this curve to make it so that more errors are caught earlier in the developer life cycle.

B

Now we're not going to claim that all errors can, because just the nature of this domain, that you know, data quality tests a lot of times they don't they don't data quality tests, don't pass or fail until you catch them in production, for example, but because of the dynamic that errors are so much more expensive later. This bending of the curve here represents a huge massive increase in overall productivity of a data organization.

B

Okay, let's actually dig into some code here. So this is hello world in daxter, okay. This is how we accomplish this. So we have the notion of a job which is especially a graph of computations that is bound to a particular environment, and then we have an op which is our unit of computation and then you'll see here at the bottom. You can simply take that job and execute it in process uh very straightforward, python api.

B

We can now dig into a more detailed version of the well, not detailed version, a more production version of this that actually does a real computation and you'll see here a few different things.

B

So one is that dagster thinks about data dependencies, not just execution dependencies, so every single op is a function which ingests some data and produces some data.

B

You know we think inherently, that all data pipelines effectively are graphs of functions and systems that don't capture parameterization like that are effectively not capturing some of the real underlying complexity and the true nature of what these things are doing. So every node in our graph, every op is a function. The body is completely arbitrary python. You can do whatever you want. You can do data processing in python directly or you can call out to external systems.

B

And, lastly, this critical notion of separation of I o and compute. This is the bedrock of our testability and development loop. So you notice here that we output a data frame which is a logical in memory construct. We don't output a 4k file, we don't output a csv file, we capture how to persist the data frame in a different dimension. We call I o managers. This allows for a lot of power.

B

You can test these things in memory and then you can actually shift where things are persisted on a different axis than the code and we'll go into that.

B

So the moment that you structure your code like this within the dagstar python framework, you immediately get tools uh the first one that you'll use is dagit, which is our ui in the browser. So all you do, is you hop into your terminal, you type daggett and boom. You load your pipeline and you can actually use this tool as almost like a local ide, which I'm going to show you right now.

B

So we have some code here which we've loaded up, oh and by the way, this is a completely new reskinned. You know, if you go to our website right now, everything looks completely different on thursday, we're doing a major new release with completely revamped core apis and a completely new good feel. So this is a little preview. Chester has promised not to upload this youtube video until after that- and I uh I will hold you to that chester, um but back to the topic at hand.

B

You know the moment that you put one of these pipelines in this format. You get all this tooling, so, for example, this is the code that was showing you before, and you can see that in the web ui you can get all this. These descriptions, which are in code, exposed right here. You can see what the inputs and outputs are. You can see where the outputs go, etc, etc. It's this very rich, ui for figuring out, what's going on and what's unique to dagster, is that we render this prior to computation right.

B

You can see this before anything has been persisted before you've done any real work, so this makes it super useful. You can kind of understand your system prior to things actually executing, but indexer. You can also do more than just view the dags. You can execute them locally.

B

So here we have this ui, which allows you to kind of launch ad hoc computations, um which ends up being super useful, both in development and operational context. Now I'm going to simulate an error here, so I'm going to go in here and put in an error, programming.

B

Go back here, I'm going to launch this thing.

B

Okay- and this is going to take a couple seconds um but you'll notice- how this is a live, updating, reactive ui with a you know, live updating gantt chart, you can get a sense of the performance and whatnot down. Here is a structured event, log where you can tell what's going on the system, and this makes it very searchable and whatnot. Lo and behold, there's an error here. So I'm going to click on that and here's this error.

B

It says it's a failure here. You can go and view this full message. You can see it's this nice well-formatted error message. I can see exactly where the error was go back and fix it.

B

Go back here and then I can even just you know, just re-execute the single step right here or I could launch all the computations after that error um and you'll see here this boots up- and this should take this couple seconds here-.

B

And there we go, it now completes. So you know if you're, a user of say airflow. For example, this type of local, rich development loop is, is just not in the realm of possibility in that system, and it's not because of incidental reasons it's because of deep philosophical regions and then how the system kind of manifests that philosophy um daxter was built from the ground up to enable just this use case, as well as other local testability, and we just got through that demo.

B

So working with an ide is nice, but you know you might be an engineer and say to me: that's all cute, I'm never going to use that because I work in unit testing code. Well, that's actually how I feel too. I spend a lot of time in unit testing code, so I want to be able to run this in a ci cd pipeline. I want to be able to have do tdd, if that's my thing, and so it's actually super straightforward to um it's really straightforward, to call individual ops in your jobs.

B

You simply call them like functions, and that's it just that easy.

B

Next. Here's just another example to contrast this with airflow. I know I'm talking about airflow a lot, but you know chester mentioned in the start, so I gotta, I gotta. You know shoulder into it. um You know the you know. The the example here is cycle detection.

B

So this this uh chunk of code actually is from a presentation from last year's airflow summit and it showed kind of the code you need to write in order to detect cycles in all your airflow decks, uh because airflow wasn't designed for local development and without flexible python apis you effectively have to re-implement the logic of the scheduler. In order to do that, you have to load every dag you have to assert. You know that it's valid you have to call this uh undocument api, etc, etc.

B

By contrast with daxter, you simply assert that the job exists, because if it has a cycle, it won't get constructed very simple, very straightforward and we have tons of other examples of places where you catch errors earlier in the life cycle. But I also want to talk about deeper testing, meaning that testing the actual business logic of these dags, not just the structure of them, and this is super challenging.

B

Actually, because what you really want is to be able to. You know, take your business logic and hold it constant. While you change something else about the computation and we model this desire directly in our abstraction. So we have ops, which are our environment, neutral, business logic, that input and output logical constructs like ins and data frames, and then we have resources which are responsible for binding that computation to a specific environment.

B

So what I just showed you was a test job that instigated a local spark cluster on my laptop and it also persisted information to the file system.

B

But what we want to do is hold that business logic constant and then swap out how the computation occurs, and in this case it would mean launching an emr cluster or launching the job on the emark cluster and persisting the data frames to some format on s3 like a parquet file.

B

So the way that we do that is through this notion of context and resources.

B

So you can specify what resources and op requires, and then those are attached to a context which is flowed through all the computations in the system and it's through this layer of indirection that you can decide how those resources are actually implemented.

B

So in this case the user has specified a resource called the hacker news, client and provided a test implementation and a product implementation, and this is what allows me to write, take the same exact code and run it on my laptop and I was about to show you run it on a production, kubernetes cluster and that's where deploy and execute comes in, okay, so I've kind of framed dagster as a local development environment.

B

But it's also a production orchestrator that orders uh computations in prod, and you know we really design this ground up for multi-tenancy and for the cloud era.

B

Data platforms are naturally multi-tenant systems, even just internally at a company meaning like both the data science team and the data engineering team, that's going to use, it usually is way more than that, and you know the world is cloud native right. That's is that's the default way that things are com computed and that's how we designed our system.

B

So we have these core components that are deployed. uh One is the daemon that we call it effectively. The thing that is responsible for scheduling runs. Other thing is the web server daggett, which is responsible for observation, monitoring, etc, and a critical architecture point about this system is that all interaction with user-defined code happens over a structured api, and this provides for process isolation which is critical for reliability.

B

So, for example, if one of your teams somehow pushed up a python syntax error, the system just notes that it wasn't able to load the process and everything continues on normally right, rather than having a scheduler load. Those dags directly into process of bringing down the entire system, which would happen in airflow.

B

Our daemon is also designed for fault tolerance based on a reconciliation algorithm, and it provides both time-based scheduling, which we call schedules, meaning that run this on this at this time every single day, and it also provides event-based timing or schedules which we call sensors, meaning run me every time this s3 bucket gets updated and then next we generally structure this so that each run spins up ephemeral, computational resources, both for the run and the step level, and then it spins it all down at the end.

B

So this lends itself to dramatically more horizontal scalability and the ability to do things like run them on spot instances for costs. So this is very much designed for the cloud and then the all. This is customizable and plugable so that you can run it on your infrastructure and it has a lot of flexibility, but we also come out. We also come with an out of the production grade, kubernetes deployment, so we have a very flexible nice helm chart.

B

So what does this look like? Well, I'm going to hop to a different web ui you'll see this is a demo.elemental.show, and this looks very very similar to what I just showed you. The difference here is that this these this is the same kind of set of ops, bound to a different set of resources, and that's just a code question, um and now I can launch this and instead of just launching a local little process on my computer, as you can see down here, this is actually spinning up a run. Worker kubernetes run workers.

B

Our concept is it's spinning up a kubernetes pod, there's a runworker inside it, and now it's kicking off these computations. This is hitting the public hacker news api. Instead of a local test stamp snapshot.

B

Every single step here is booted in its own kubernetes plot for process isolation. So all this infrastructure stuff has changed, but the business logic has been held constant and that is really the critical component to testability and I'll be spending a little more time in this ui as time goes on. Let me go back to here.

B

So we've deployed the computations and we've kicked off a computation improv while holding the business logic constant and now, let's talk about the monitor and observe part of the life cycle and what we're going to do, how we're going to do this is we're going to add data science. um So the computation ice kicked off ad hoc.

B

What it did is it kicked off a spark job, effective, downloaded stuff. It picked, kicked off the spark job to compact, it into part k and then uh load it into bigquery. Now we want a different team to fetch that data out of bigquery and use it to build a recommendation model before we get into the details of that.

B

I want to step back and think about how a data platform wants to think about the different roles that it serves and the way that we think that a platform should think about it is that there's this full life cycle and every single stakeholder has this full life cycle, meaning every practitioner stakeholder. They develop an asset. They need to monitor it. You want to enable end to end ownership, and really the job of a data platform engineer is to fill in all the boxes here, so that each role has its own end-to-end life cycle.

B

So you know you want a uniform surface for both deployment execution and monitor observation right. You should be able to use this unified substrate there, and then the different practitioners can focus on their business logic using the tool of their choice. So a data scientist just wants to write python right. Maybe they just want to use scikit-learn, pandas, etc.

B

They want to be able to leverage infrastructure built by the data engineers and that's great, but they still want to just, but in the end the goal here is to be able to focus on business logic and a tool of your choice, so we're thinking a little bit. Let's, you know we're building a recommendation engine that typically takes a dag of computation, and this is how you do that in dagster.

B

So each one of these nodes in this graph is an op, and you can see here that you construct the graph the the job contains by calling functions right, and this doesn't actually instate the computation. This actually just builds that dependency tree um and the the actual bodies of those functions are invoked later.

B

So this is what the world of uh the world of daxter looks like to a data science user, and it's just plain old python right they. You know, you might not know what truncated svd.fit does, but data scientists do so they're, just using the plain old python. They know love or tolerate.

B

um But you know you can add some sugar with um with dagster and allows you to attach metadata to all these different events, both the kind of software artifacts, as well as the persistent events, and this allows a lot of power a lot of power in this. So this every single. The way this is configured is that every single step in this pipeline, every single op, produces what we call an asset.

B

An asset has specific meaning in the world of dagster. So an asset is an entity that outlives a computation right producing assets is effectively the entire goal of these systems. So an asset could be a data warehouse table, it could be a parquet file and a data lake.

B

It could be a pickled model and it's been really powerful, both for us as dog, fooders, meaning using our own technology, as well as our user base, to really embrace this concept because operationally we have this. What we think is like a critical insight which is obvious in retrospect, but people care about assets, they don't care about pipelines. I hate to break it to you the data engineer, but if you go talk to a business stakeholder, they don't care about your pipeline. No one cares.

B

They only care about the assets that you produce, and you know this way of thinking. um If you fully embrace it is really really empowering.

B

um You know one one great quote from a favorite user of ours, david wallace who's, a staff engineer now at dutch you previously drizzly, is that daxter empowers my stakeholder teams to own their data assets and and like no other orchestrator, can right, and that's specifically because of this philosophical view that we take that action should be part of the game.

B

So we go back to this go back to the prod, so this thing has so this is finishes computation and we have this structured event log down here.

B

So you know, um dagster computations emit the stream of structured events which tell the system what is going on and one of those events that the user emits is what we call an asset materialization so see how nice and fast that ui is. What not- um and you know this- these are actually events which say: hey I produced a specific materialization, it's going to outlive the compute and I'm going to attach metadata to it.

B

Well, the interesting thing about that is that we keep track of that in what we call the asset catalog. So if you go and view this asset right, here's this items asset right, and this is what you know, an s3 file that contains the items that were produced by something, and you can see here. We have all this interesting metadata about. What's going on here we have the row count which goes up and down over time.

B

um We have the step execution time which has gone up recently. That's interesting, and um so that's useful too, but the you know the kind of the the where this gets really interesting in my view is that this allows a completely new way to index into the information encoded by the orchestrator.

B

So let's say: let's just make: let's, let's, uh let's pretend like we're some stakeholder and we know there's a comments table in the data warehouse now I can just go to dagster right and search for that thing. Let's see comments um and actually oh we're in snowflake, now not big quarry. We actually ported it. I need to move the slides, um but let's go to here. Look, we can find the hacker news comments, table right and search for it, see its properties, and then we can see. Oh, this was produced by this hacker news.

B

Api download pipeline. um The last time it was touched was at 131 six minutes ago. It was touched by this run, so you can navigate back and forth from the computations to the assets back to the computations, and this is a super powerful way to to navigate and operationalize uh your data systems go back here.

B

You know, assets are not just useful to track. They also have a really useful properties. Is that there's a great way to encode the dependencies between teams?

B

um You know you might be familiar with the concept of the data mesh, which is kind of all the rage, um but I think the most powerful and interesting idea out of the data mesh is that assets should be the interface between teams and our system encodes that directly. So in this case we have one data science job, it's kind of in the abstract, downstream from the data engineering job and the way we hook.

B

Those two together is through what we call a sensor asset, key or asset sensor, and so all this says that anytime, an asset is updated, kick off a new job, and so, if I go here store it whoops just go to the store recommender and we go to the sensor here.

B

You'll notice, here's this live updating, page and well. These dots represent the last time that I checked to see if the asset key has been updated. My demo took a little longer, so I need to go back to this one day view and you'll see that okay, a job was kicked off. A run was kicked off at 1 32, which is, you know, happened right after that.

B

Ad hoc job had kicked off uh had just completed and we can go here and click on this, and then you get all the exact tooling that you're used to right. So I can go here check out what the acid materializations are, and I can see um what's been going on here, and you know this ends up being a super powerful operating modality for data platforms.

B

um You know, even as we were developing this internally, the person I think was a couple people actually who built the data science job told me hey. I didn't need to know anything about their pipeline. All we did is we agreed upon a mutually agreed asset key at the very beginning of development. It's like you know, you're going to produce this table, I'm going to consume that table after that. We live in different worlds and that's a super powerful way of operating.

B

um So that's kind of the life of a data scientist in the system. I'm just going to quickly go over what the the workflow analytics engineer looks like and if you're not familiar with the term analytics engineers effectively a term invented because a technology called dbt exists and you can think of dbt as a way that you make an analyst into an engineer, an analytics engineer, it's a super powerful tool.

B

So, if you're a um if you're an analytics, engineer, you're listening to this presentation up until now, you might hear develop and test and be like. Oh, the orchestra is going to take over my develop and test part of my life cycle. Well, I don't want that because I use dbt.

B

I like dbt a lot I like to talk about how much I like dbt a lot, and I have no interest in abandoning that tool and we totally agree with you. um We fundamentally believe that people should use the tools they want to do data processing and that the orchestrator should facilitate that, and so in this case we have a fully functioning dbt integration that integrates very nicely with the system.

B

So this is kind of this is what it looks like to develop. Dbt in our system.

C

You know it takes.

B

Very few lines of code, and then you can just run an existing dbt job and but when you do that, you get all these integrated capabilities of asset cat asset, tracking and whatnot. So if we just go quick back here,.

B

Dbt metrics, you know this is a very straightforward dag all it does is invoke, dbt, run and dbt test. um But if you go here, you know this one is kicked off at. Oh, the sensor is off how sad I messed that up, but if the sensor had been on, you would have seen one running about 10 minutes ago, but you know just like our other tools. We can go to the run that I did last night and just like the other, you know.

B

One of the built-in capabilities we have is the ability to produce asset materializations as a result of a dbt run. We ingest the metadata that comes out of dbt and persistent in our system, so this allows. You know this allows dagster to be the single operational plane of pane of glass, where you can manage all your assets and computations, no matter what tools they end up, being computed in or stored in, and that's extremely extremely powerful.

B

So, to sum up here, we think that orchestration is a point of leverage both for you as a practitioner or a platform engineer, but also for broad ecosystem progress, meaning that you know the ecosystem is incredibly fragmented and difficult to navigate and the orchestration layer is really where all this stuff has to come together, all the tools, all the practitioners, all the storage systems. So we think it's a massive point of leverage for improvement.

B

We embrace the full life cycle and we think this is key to having a productive data platform.

B

We fully consider developing and testing in this system deploying and executing the computations and moderately observing both the computations in the produced assets, and this this enables a really fast developer, workflow and end-end ownership, which means dramatic increases both in individual productivity and organizational productivity, and it's an open source python framework. You can end this presentation sign off and go and install it for free.

B

As per an open source project, you know you can check us out on github, we have our docs and then we have our slack, which is very active and the best place to go uh to interact with our community and, like I mentioned you know, this is actually the first presentation we've given with our updated, look and feel our updated core apis and our release of this is on thursday and it's probably the biggest release since the initial release of the project. So we're super excited about that and without further ado, I will take questions.

A

Great talk, nick uh so I'll wait for your instructions of when I can help make the.

A

B

Well, you'll send me the video and then I'll, see if it's good and then uh I'll decide uh when we'll push it out. No, I'm just kidding.

A

Let me ask you that's a very interesting topic because that uh you know echo some of the stuff we do on our on my team. Frankly, we did build our own dev. You know uh separation of dev staging in productions and we have our own. You know testing environments for our workflow engines, and uh you know try to you know without changing the code, but but it can run in different environments. So so it's very interesting to see uh dexter how they, you know how it's done.

A

I do have a question on the one of the uh one of the pieces that you mentioned, that and also interesting, so the dexter actually interconnected with us, the asset categories uh catalogs, which literally almost like a like another system, similar to the data cataloging, which you can discover what table you have or what metadata you have and what's the the min max. You know that sort of thing, so it seems like a dexter, actually bundle that together. In other words, this probably is something like amazon.

A

Like a lift, you know the open source, amazon projects you know kind of embedded in the desktop itself is.

B

A

Is that correct.

B

um So we like to call our asset catalog an operational asset catalog, so it's mostly for um operational use cases, and so we've we really focus on use cases that we think makes sense um where you get a lot of benefit with vertical integration with the orchestration layer.

B

um So the most obvious one is like a direct linkage to the run that produced the asset and be able to click that really easily being able to set that up without integrating yet another tool right. Our goal is not to replace all the asset catalogs out there, there's like tons of companies who are devoted to that.

B

They often have asset catalogs that are also scraping for information that was like manually created. They have manual annotation systems, they're very and have more complicated ontologies. um You know what we wanted to do is one provide a built-in data asset catalog for simple use cases that people get out of the box and then two be able to leverage it for operational use cases where it really makes sense to integrate it with the orchestrator and you'll continue to see us double down in that direction.

B

um You know fully plan on having you know our metadata database. You can query it yourself and ingest it and we produce a structured event stream, and so we fully expect that to produce exhaust that would be consumed by the likes of data hub or stem emunsion, et cetera, et cetera,.

C

B

You know, but you know we think it'll be a really valuable source of metadata for those systems.

A

Yeah yeah, I can see that, but uh it seemed like the majority of the companies uh data data results is actually produced through the orchestration workflow engines, so so there'll be a big chunk coming from here. I guess.

B

Correct yeah, but you know the the lots of companies will have many different orchestrators. um You know teams making their own decisions about stuff, so uh we don't. We don't have any illusions that we're going to be the one asset catalog to rule them all anytime soon. um Nor do we want to cover like a lot of the use cases that I covered.

A

Got it got it so uh there's a few questions uh in the chat. Let me just read through them. So, okay, so one from dave says what are the examples of the companies and application areas that have complicated the data dependency graphs and would have benefited from dexter.

B

I mean I, I think the real answer is nearly every company, um the you know we like to say that every company has a data platform, it's a question of whether you acknowledge it or not, because if you don't acknowledge it and staff it it exists, but it becomes a complete unorganized mess, that's not operationalized or well engineered, and what a data platform is is where you manage and curate all of the data assets of your org, meaning.

B

The data has been ripped out of its original context, whether it's a sas app one of your operational databases, slash system of record or like scrape from the web or something and so any time that you are doing that and you're stitching together, more than one say, computational runtime and you need operational robustness around that. You need something like dagster um and at that point it's kind of like it's, the type of thing where, like, if you're going to build something you want to do it right.

B

um So it's you know to me if you don't use something like dexter, it's kind of like saying when you start writing a computer program. It's like! Oh I'm just going to write some code, I'm going to refactor into functions later. It's like no, you just want to start like engineering it properly from day zero in order to build a well-structured system where the entropy isn't going to get out of control.

B

So um you know most companies, I know of require a system like this because they need to ingest data from sas apps. They need to integrate those, they need to compute data that resides in their data warehouse and then they do something with that, and even in the simplest cases of that, you need an orchestrator to make that all work well and reliably.

A

So next question is uh this: one is a great demo and he want to know. Can you share integration to databricks.

B

uh Well, we do have a database integration, so if you go to our open source repo, there's dagster databricks, so we have a built-in uh integration for that.

A

Great next one is asking your business model. So how do you make money.

B

How do we make money? So it's a good question. You know we are a venture-backed company.

C

B

And so there's an expectation of uh you know a sustainable business at some point and.

C

B

That's the right way to sustain a open source project, but that is a entirely different talk. Concretely, we you can sign up for dagster cloud right now, so we are building a sas product. We will be opening that up um to a more publicized early access in a few weeks.

B

But if you go to dexter io there's a cloud button, you can click on it and then sign up, and what that involves is that we host a bunch of the services that make dagster work and while allowing you to retain control over your code and data.

B

So this is kind of the you know now more broadly used so-called hybrid sas model by the likes of build kite and data bricks, and um we handle upgrades for you maintain the performance of the metadata database, the web server we handle all upgrades and so on and so forth, and then we also add enterprise features like authentication are back auditing and a bunch of ci cd help.

B

So dagster cloud will be our short term path to monetization and a sustainable business.

A

So in a hybrid cloud is essentially you're using the account of the client and then built on their say, for example, the aws, so you're, just using aws account and directly use their. You know ec2 instances and you just, but you have a control plane to correct.

B

We manage, we manage the control plan and the data plan is in the customer. Vpc.

A

B

And there's no ingress into their vpc. There's an agent that phones home over htp.

A

Yeah, I think that clear, a lot of the people has the security concerns and all the other you know, concerns about the company yep right. So I think that's, uh let me see what's the last one so uh so need to agree with your ideas, so the interface should be assets. So that's just a comment so somebody I.

B

Would call an affirmation, rather than a question very much appreciate, linda.

A

I think that's pretty much all the questions, so anybody has other questions. If you want to ask go ahead, ask we have a few more minutes.

A

Now, which cloud provider will dexter use? Is it coming from dave.

B

Yeah well, dagstr is actively used on all the major cloud providers um and daxter cloud is itself hosted on aws. If you're asking about that. um But you know the it. That's actually very um opaque to you um so, um but dexter is flexible enough to use in any cloud provider as well as even on-prem.

A

Cool, so how big is your team, uh if I recall there's one of my all the acquaintances from clutter? I I you know, I remember somebody from canada or joining your team. If I don't remember.

B

You're, probably referring to sandy yes, one man, yes very talented, um he's he leads our practitioner team, which is responsible for um you, know most of the open source, uh practitioner-facing, apis and um so yeah. I know we have a team of uh 18 people, including me, and we are actively hiring um had a product, devrel, um a tech, writer and also engineers.

B

So if you're interested in joining the team, let me know my dms are open on twitter or you can anyone can hop in our slack if there's any, if you want to find out more about the system, um hop into slack.

A

Okay, we got a kind of one more questions, actually two more questions and so kind of related. So uh do you have uh do we have an sdk to get it job dependency lineage, to integrate with amazon our data hub.

B

That does not exist, we're not opposed to building it or having that community contribution, but um it just hasn't come up actually, um so we build and support integrations based on demand, but we're philosophically very open to it. As I mentioned.

A

And alps follows that and say: is there an sdk for plug-in to integrate with the downstream systems.

B

So yes, there, if you go in our open source repo, you can see all the different libraries that have been built um and then there's also libraries out there in the wild too. In terms of uh you know, something we need to do is build a catalog of those but um yeah. No one of the advantages of an open source system is that one you can see all the integrations and then often those are community contributed, which is great.

A

Great, I think, that's all the questions from the audience and uh I'd like to thank nick again for great talks and introduce the dexter to the you know to our communities yeah. You know we'll hope to welcome you coming back next time to give more insights on the dexter.

B

Thanks chester, thanks for having me and thanks for everyone for all the great questions.

A

Thank you, everyone to join the census of big analytics, so thanks ai camp again for providing the zooms, and I will see you next time.