Dagster Dagster Talks, Panels, & Interviews, 21 Jun 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Fireside Chat: Nick Schrock (Founder & CEO, Elementl) with Matt Turck (Partner, FirstMark)

Description

Nick Schrock, Founder and CEO of Elementl, spoke with FirstMark's Matt Turck in a virtual fireside chat at Data Driven NYC in June 2021. They spoke about Dagster, open source, and much more.

Data Driven NYC is a monthly event covering Big Data and data-driven products and startups, hosted by Matt Turck, partner at FirstMark Capital.

A

Nick schrock is the founder of elementor prior to elemental, which is a company behind the popular open source project. Daxter nick was a principal engineer and director at facebook, from 2009 to 2017, where he founded the product infrastructure team and co-created graphql. Another very popular, open source project.

A

Welcome nick thanks for having me very good, so I'd love to um start with a um discussion of the current state of the data ecosystem, basically the premise uh for the creation of of daxter and then uh and then elemental what what was uh the problems that you saw that you wanted to address.

B

Yeah, that's a great question. um It really is a fundamental thing. So if you go to, for example, a a conference about machine learning or data science, you often hear people say that I spend 90 of my time. Data, cleaning and 10 of my time doing my job, and that represents a fundamental problem in the entire ecosystem. When people feel like they're not doing what they're supposed to be doing, and they might express that in terms of that pain.

B

But it's actually a more um often a more complex issue than that, and the analogy that I had in my head was the to front-end engineering um and then early 2010s, and where what you would hear people say there is that I spend 90 of my time fighting the browser and 10 of my time doing my job and for those of you who knew any web developers in 2010, you mentioned ie6.

B

You might get an intensely emotional reaction in terms of browsers, for example, and that kind of stuff doesn't happen anymore, not just because ie6 doesn't exist and it was really uh the browsers had to get better. But it was really a holistic ecosystem problem and it required tools like on many different dimensions.

B

So, like react in the front end graphql playing a part here, a technology called typescript out of microsoft, tons and tons of tools had to get better and you fast forward today and no one says that anymore, and so that was a core software, abstraction and tooling problem and I believe, we're in a similar space in data today, where some people call it data ops, some people call the modern data stack, but what it?

B

What really needs to happen in this data space is the software kind of engineering vacation of data, the acknowledgement that this is not an off outsourceable thing anymore, that this requires software engineering process specialized for the data domain and the tools and abstractions that that match that and when we, when we've, you know, analyzed the space we really honed in on orchestration and the encoding of these dependency graphs within data as a critical point of leverage on several dimensions, and so we think you know, transforming the orchestration space has a massive part to play in uh making the world better.

B

For all the practitioners that live within.

A

It and and just do double click on on that. I think you said um if you look at the evolution of the data space, that um a lot of the um sort of early wins and data in the data space over the last few years have already been around managing scale uh and that we're now switching to a phase where the primary challenges are higher in the stack around productivity testing integration is. That is that is that fair.

B

I think that's fair, um you can think of it, sort of like maslow's hierarchy of needs and when the amount of data started to explode, there was a very critical need that you couldn't even process it on a technical level like you could not scale out compute and that's why you know in the big data revolution, you called it big data. You started with, you, know, hadoop and then spark and now the cloud data warehouse.

B

So now, if you have petabytes of data, you are able to process that efficiently at scale, but now the problems are kind of higher up on maslow's hierarchy. So to speak. So now it's okay. We can actually process the data. Okay. What is it does it have? High quality? Are the people that actually build the computations that process that data productive? Can you track that data throughout the ecosystem and so on and so forth? So you know you know, there's an amazing engineering.

B

Achievement happened in the early to you know throughout the 2010s, which was solving these massive pure technical scale problems, but now we're talking about organizational scale, uh dealing with complexity and dealing with developer productivity and any number of other dimensions.

A

So what what's an orchestrator in the simplest, most sort of layman definition of the term.

B

Yes, the the favorite question any engineer gets in terms of explaining what they do without confusing the hell out of her voice. um The the way I would explain it is that these data systems involve multiple types of people. So often you would have say a data engineer, an analytics engineer and a data scientist. They use different tools, they have different skill sets, but in these systems all data has to come from somewhere and go somewhere.

B

So you can think, let's mention a very simplified example, those three people they show up to a factory and one person's making one widget. They pass it off to the next person. Then they pass it off to the next person so before orchestrators.

B

What would literally happen is that station one in the factory would take usually two hours to complete its task and then the next person would start after that, but there's no way to actually manage those dependencies. So literally it would say. Okay, usually it passes in two hours. The first thing started at 1am, I'm going to start every day at 4 am so in the factory analogy.

B

It's like putting a blindfold on someone and just saying no matter what happens at 4am, grab whatever's up there and put it in your machine and hope and pray that it works.

B

So that's a pre-orchestration world, so instead one of the most primitive orchestrators, what they do all it is is person one does their job in their tool, the machine process it and then just taps the next person on the shoulder and says what's going on so at its basic level, if you think of the process like an assembly line the orchestrator's kind of the factory around it, that determines the order of things and makes sure that things happen in the right order at the right time.

B

Right now, so that's a very basic definition of orchestrators, but they go far beyond that.

A

Yeah, no, no, of course that uh leads into the question like. What's, what's your take on an orchestrator, what kind of orchestrator are you guys, building with daxter.

B

Right so I said that the the one example where there's one assembly line and three stations: well, that's not the way that ends up working. You end up having thousands of assembly lines, sometimes with thousands of stations, and they have all sorts of crazy interconnections and then what's even more kind of interesting about it, is that I describe the interactions between one person to the next.

B

It happens both on a macro scale, meaning like they also interrelate teams, but also the individual practitioner will build their own little assembly line because it makes sense for them to do it. You know, as they're figuring out what their data is, they might say like. Oh, this is an intermediate data product that will be generally useful and I kind of want to have like a checkpoint right in the process.

B

So what ends up happening is that the these systems explode in complexity, and you need to be able to take the assembly line offline test it on test data. Put it back. If something goes wrong, you need to halt the assembly line. Maybe start it from a certain point, move it around, um and then it also makes sense for this assembly line to be aware of and track the things that are actually coming out of it. So, like okay, I have this widget over here it came from this previous intermediate which came from this etc.

B

So it's we think this like orchestrator, is really the central leverage point that makes sense to be the data platform and also it's the place where people actually build their machines sort of software programs, but continue with the factory analysis. This is where they build their machines actually, um and so it's critical, not just to think of it as a operational tool, but as a place where people can productively work.

A

Great and and um so when did you start uh daxter and like tell us about the the project.

B

Yeah, so you know, I formed the company kind of an exploratory sense in 2018, but we didn't really start working on it in earnest until late 2018 early 2019, um and then we launched the project um uh the summer of 2019 publicly.

B

um You know so the I guess the question was kind of. How did it start? um You know I was looking around and exploring um you know taking an interest in the space and talking a lot to people.

B

You know I talked a lot with abe and one of the next panelists a lot in this time and uh that's where we really honed in as orchestration as a critical part, and so I just started building you know experimenting with with stuff building out the open, getting feedback working with people, and that was really the genesis of the project.

B

um So you know, and the genesis of the project was really the the the insight that you know you should people think of data sets as physical things right like a like a table in a database right, but in the modern world, where you're really applying software engineering processes to data all that data ends up being computed, and so what we thought really is that you should move the primary focus of energy from that produced data set to the process that produces it, because, maybe like you throw away that data set, you need to recompute it right reproduce it.

B

It's really the the computation which matters. That's the primary focus, um and you know we thought that you could really make you know almost like the focus of the orchestrator sort of this, like virtual data set concept um and that was kind of the genesis project.

A

Which I'd love to um continue down that path and and maybe double click on how dexter works like some high-level easy concept, they would say, for example, the concept of a solid, which is uh your, I think, your atomic units. Do you wanna talk about that and um how the extra works in general.

B

Sure so dexter works is an open source python project. So you know, if you're uh you know, if you're a programmer, you can just type a simple command: pip install dagster and you're off to the races. You can write a little bit of code and then you know what you effectively do. Is you build these these functions?

B

We call them solids which define like a computation, meaning like a step in the factory to keep on that analogy, and then you can construct graphs out of that um and the moment that you use our apis and structure your code. In this way you immediately have access to all sorts of tooling. So without any infrastructure on your laptop, you don't need to deploy anything.

B

You can just type daggett, which is the name of our graphical tool and load it, and you get this beautiful rendering of like what this graph is prior to actually executing it.

B

And then you can also use that tool as almost like a local ide for these graphs um and then you effectively, you know one of the things that we really focus on at dexter is being able to execute these computations these programs in different environments. You can like develop on your laptop and then also deploy it to a deployed.

B

You know piece of infrastructure and not have to change the core business logic, um and this is really something we focused on um and that's a critical piece of this. So you know you develop locally. You have this fast local development loop. You can test things, then you can deploy it to any infrastructure um and then, from that point, you're scheduling your computations, so you're saying like I want to run this every day or in a first-class way.

B

I don't want to run this uh whenever something upstream changes, and then you have all sorts of you know what we consider consumer-grade tools to monitor and observe those computations, so you can track both like you know the process as it unfolds using a live like gantt chart viewer for example, and then we also track the assets that are produced by those and we can back link and say, like.

B

Oh, this thing came from this computation, which is extremely useful when you're actually figuring out what's going on these systems, um so that's kind of we we take this through the entire process and thinking about like okay. What's the fast developer and test lifecycle, there's a huge opportunity there for massive improvements.

B

You know how do you deploy it reliably? Allow multiple teams to use common infrastructure, that's a critical thing. We think that orchestrator's standard data platform and then lastly, monitoring and observing those things uh both the computations and the produce essence assets being like a table uh ml model, any physical materialization of something.

A

Great who's, a good um customer or type of user for daxter like do you need to have a certain type of infrastructure in place? Do you need to be a certain size? Do you need to have like a specific type of talent on the team who's, a good sort of profile.

B

Yeah, so you know, we've really found two classes of user who uh really gravitate toward the system.

B

One is what we consider an emerging title of uh data platform engineer a lot of people self-title that way a lot of data engineers act as data platform engineers, but what they see is that hey inside every company there's a data platform, whether they acknowledge it or not, and this data platform is where all these people come together and all the different like the data engineers can work with the data scientists. All this stuff can execute on time.

B

You can have like one single management of uh all the important data and all the heterogeneous tools, and so that user, maybe they'll start out and they'll say. Okay, all we need is like an ingest tool like vitram.

B

We need something like dbt over the over the warehouse and maybe a so-called reverse etl or operational analytics tool to jam that back- and this is all kind of terminology-laden and whatnot, but.

A

B

Moment they expand beyond those that fairly prescribed.

B

What's called the modern data stack and need to do anything outside of that, um they need an orchestrator and they want an orchestrator, that's kind of in line with the values of those tools and a lot of those users have like really gravitated towards decks or, like one of our users, said like what what dbt did for our sequel, um dag suited for our python um in very concrete ways- and you know- and I would say the second type of user- is that we really find starting to gravitate towards this.

B

Are people building end-to-end model training pipelines, so they want to work in python. They want to use tools like pandas scikit-learn. They want a fast development workflow, they want an orchestrator because they need it, and this tool really speaks to them and often they have to roll their own infrastructure.

B

And you know one of the things that dagster does is really. You know. We attempt to thoughtfully really think about the the interface between what we call practitioners, those who are responsible for the production of data assets and infrastructure. Folks and those are kind of like two jobs, and sometimes one human is doing those two jobs and the way to make that manageable is to have the nice software abstractions that deal with that.

B

um So I would say those are some of the primary use cases that come to mind.

A

Great thanks and um help us understand how uh you positioned daxter in the orchestrator segment. I mean it's, it's it's. uh You know like all categories in the in the data world, there are other folks we had jeremiah from prefect, for example, at a prior event, there's air flow and then there's like all the historical ones like luigi um we had kedro as well at the event like so for folks. What's uh what are the bright lines in terms of like thinking of daxter, um in comparison to some of those other folks.

B

It's a great question and, as you know, positioning is always an evolving art, uh but I think the primary difference and I'll focus on airflow and prefix and she started with those- and you know, airflow, is definitely the dominant incumbent in the space as it's traditionally defined is that they, you know in terms of the they don't consider the full life cycle of developing data products.

B

They view their mandate as very narrow and purely these operational use cases of ordering. This comes after that comes after that, and then there's operational complexity in that. So you need to know how to retry things and so on and so forth. But you know we think the graphs of the orchestrator encodes are one complex enough, that they need a full local development life cycle.

B

That's fully thought out and two are in fact in some ways the structure of the applications themselves, especially the ones that are written in python, and so we really think about the dev and test life cycle. We also want to be data aware, and so that means that one we have data dependencies that are encoded, meaning that you know not only do we order the machines, but we know that upstream that you know this input comes in this output and without getting into details.

B

This is a much not what we believe is much more natural program model for practitioners. uh You know the uh dexter really also embraces the fact that these technologies are inherently multi-tenant, meaning that multiple teams want to want to be able to deploy to common shared infrastructure and airflow. Just wasn't designed like that kind of all the teams share a python environment.

B

These are all very technical things, but effectively one team can push up one mistake and bring down the entire system, which seems like a bad thing, um and then you know airflow is not data aware. I guess I consider, like you know, prefix lineage is very direct. You know jeremiah was a uh a primary contributor to airflow wanted to make some changes. One wasn't able to do so and went and started, um went and started prefect, and you know just like airflow prefect views itself is a very strictly operational tool.

B

So you know the framework is about positive and negative engineering right which, like we take care of negative engineering, which is effectively like ordering those computations um retrying things taking care of errors, um which is effectively new words to describe exactly what airflow's goal is, and then they also describe their system as a insurance company.

B

uh What's not about that about that is that insurance companies don't make you more productive? I mean I've never met an insurance company. That makes you happier for certain um and it doesn't really talk about the full end on process. It plugs into existing system, and then you know kind of deals with this operational complexity and that's its only remit. um We believe that orchestration is so central that you need to think of it in terms of a wealthy life cycle thing and especially to think of it as a productivity tool.

A

Great, thank you. So what's next, um you guys are a thriving startup like I know you did some integrations with dbt, with great expectations that we're going to stick with in a few minutes. uh What's next on the roadmap for the next, uh you know year or so.

B

Yeah, so you know a year's a long time as a startup, and so you can never predict the future. um You know we are a. You know. I think we'll get to this in. uh We will get to this, but um you know it is a venture-backed commercial company. So at some point we will need a revenue bottle of business models, so I think it would be uh without a spoiler alert. We are. uh We are working on that aspect of the business, but I can't talk about that too much detail.

B

um You know on the open source roadmap. I think there's two things.

B

One is that you know: we've been without what I'll call in sort of a open, applied r d phase of the company, where we've been really working with a set of targeted design partners and not hyper focused on growing adoption and uh too quickly, because we didn't want to have too many partners, because we knew the technology was still changing a bit um we're about to land yeah we're we're about to land some changes that will really set us up for kind of a pre 1.0 release and we've learned a ton from the last.

B

You know year and a half of work, and we really are feel like we're. Settling on a set of abstractions and concepts will serve the foundation of technology for years and years to come. So in the short term. That is really our focus, and I think beyond that you know. Once we have that stable core, that we feel can be the foundation for years to come.

B

We will focus on pushing up the stack and really focusing on things like, for example, improving our asset catalog and making that super rich and taking advantage of the things where we can uniquely do things that vertically integrate our asset tracking with the orchestration layer.

B

um You know we also want to focus on expanding the set of stakeholders who can interact with the system, because we've seen a lot of early users do this really effectively where non-technical ops folks are able to self-serve workflows without any intervention from the data platform teams, which is just incredibly valuable, um and we expect to double down on things like that um yeah. um So that's kind of what we're thinking in very broad terms.

A

Very, uh very good, okay. Well as we are getting close to the end of the allocated time, why don't we finish with a couple of sort of like rapid fire questions which have nothing to do with daxter, so first question outside of daxter? What is a something in the data ecosystem, whether that's a tool or a project or a company? You, you just love, and you just think it's the coolest thing.

B

Well, obviously, it's super conductive and great expectations and marxa, because they're the best tools ever and if you know if I hadn't found an elemental, I'd work for one of them, uh hi eve and devaris um present company excluded.

B

um I think I'll give a probably a common answer, which is you know. Dbt is really taking over um a big part of the stack and I've known that team since 2018, when I they gave a conference talk and- um and I ran up to them because I was super excited what they were saying, because I felt like we shared a lot of alignment on values in terms of the way you can structure these things.

B

You have data dependencies all this stuff, but they were very specialized for the uh very specialized for sql and, what's really cool is I think, they're really on the forefront of saying hey, you know the way to do this is not to try to remove analysts from the equation. The way to do this is to bring engineering processes into their life, and you know so-called upskilled them, and that only that only makes it more productive. It actually allows them to re-title themselves be sufficiently different, where they actually like make more money.

B

They associate their career with technology, and I really admire technologies like that. I feel, like kind of previous work in front end is like that, where you know graphql and, like you know, the kind of a sibling team react like that shows up on people's resumes and they really invest in that in their careers. So I really like technologies like that. um That kind of aren't just more efficient but help with careers, and then actually you.

A

B

Been really impressed with the development of spark over the last few years.

A

You're going to say one: oh, okay and last question one minute or less: um how do you? How does somebody like you learn? What are some of the data? You know learning resources. You recommend whether that's a newsletter or podcast or conference, just like whatever comes to mind like one or two like just quick one, podcast totally.

B

A

Which one do you like.

B

uh I really like software engineering daily and data engineering, podcast and then the other thing is that you shouldn't feel you shouldn't fear to reach out to someone who's on one of these podcasts like just email them and be hey. I thought what you said was super interesting um as someone who's on podcast. If I ever get that email, I'm like super excited to talk to the person, so both.

A

Like the podcast is directly flattery, flattery will open doors.

B

Well, I call it, you know just interest, I would say rather than flattery,.

A

Okay, wonderful, uh look! This was great super interesting. um You know this. uh uh The the the the popularity of dexter and the number of times it comes into conversation is is really impressive. Considering that, ultimately, you guys haven't been doing this for a very long time. uh So congratulations on everything you've built um and uh you know, excited to see how you progress uh over the next. uh You know a few months and years and hope you uh come back and someone soon and tell us uh about all the great things you've achieved.

B

Awesome yeah thanks. I hope to live up to your expectations.

A

All right, thanks nick appreciate it yeah.