Dagster Dagster Community Day - Dec 2022, 8 Dec 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Data Ingestion as Code - a showcase of managing managed services. Dagster Community Day - Dec 2022

Description

In this session, Ben Pankow provides a demo of 'Data Ingestion as Code', and Nick Schrock explains the broader context of 'managing managed services' from within the context of the data orchestration layer.

A

Hi everyone, my name is Ben and I'm. An engineer working on dagster today, along with Nick I'm, excited to share our new data ingestion as code functionality for earbud and 5 Trend Integrations feature allows you to manage your five Trend and airbike connections without leaving your python code base.

A

This approach provides a whole host of benefits to teams with large and small, including faster spin up, letting you bypass time consuming and often error-prone, UI interactions, the ability to version your assets alongside the rest of your python code and the corresponding capability to make changes with confidence, including a full Dev cycle of reviewing merging and reverting changes as needed and, finally native integration with extra assets, making it easy to connect ETL ingestion with any upstream or Downstream dependencies.

A

Let's take a look at what this looks like using air byte to get started, we'll go ahead and spin up air by locally using Docker compose.

A

B

We have our empty.

A

Airbite instance: let's get started by syncing some data from GitHub to snowflake we'll do this using the traditional UI based flow, we'll start by choosing our source type here, we'll use GitHub and then we'll input some credentials and other config information.

A

Then we have to do the same for the destination. Finally, we have to manually select the list of data streams to sync. This process gets unwieldy as the number of sources and destinations grows and modifying live connections, and you have a sense of versioning can be a nerve-wracking affair.

A

Let's see what it looks like defining that same Connection in Python code, we'll head over to our editor, where we have a Bare, Bones, diagster Repository.

A

Here we have an air byte resource, that's pointing at our local airbite instance we'll go ahead and Define an earbike connection, object we'll provide a name and then Define the source and destination using typed python classes, which are automatically generated from your right. Spec files, we'll input, config, passing credentials from the environment, then we'll specify the list of streams to sync, including the sync mode and finally, we'll tell dagster to load this connection.

A

Let's see how we can load that connection, which we defined in Python code into our Urban instance in our terminal, we can run dagster air by check, pointing it at our diagster repository to generate a diff between the config that we've defined locally and once live in our earbud instance.

A

Here we can see we're creating GitHub, Source, a snowflake destination and our GitHub to snowflake connection. If we instead run Dexter right apply, this will reconcile the changes applying our config and generating our connection in our earbud instance.

A

After a couple of moments, we can go back to air byte and refresh and we'll see the connection we Define in Python code.

A

If we move over to our tags, for instance, and reload our asset definitions, we can see a new set of software-defined assets associated with the tables that air bytes generating in our destination in Snowflake. If we click on them, we even get metadata such as the table schema that airbright's going to generate once we run async selecting our assets, we can go ahead and kick off the materialization.

A

Which, under the hood, is going to kick off a sink of the air white connection?.

A

With our earbud ingestion representative software-defined assets, it's easy for us to orchestrate any down through dependencies. For example, we can easily add some Downstream DBT computations that depend on the data that everybody loads into snowflake.

A

We hope this feature makes managing ingestion easy for practitioners, whether they're building new platforms or managing existing ones. This is functionality which brings the extra into new territory so I'll hand it off to Nick to discuss how we're thinking about this.

B

Thanks Ben, that was an awesome Demo First of all. For those who don't know me, my name is Nick Schrock I'm, the CTO and founder of Elemental, and what Ben's demonstrated here today was not just a feature of a couple Integrations, although what features they were, but we really think it's a massive Leap Forward for practitioners in the modern data stack.

B

So why do we think that? Well what Dem? What this demo showed is that now ingestion tools can be a first-class citizen in your engineering workflow. You can manage their behavior in modern type to python. You can manage change with Source control and get all the associated benefits. Cicd. You can review changes, you can roll them back. You can test them. You can build your own abstractions on top of them, and Dexter remains a source of Truth for your asset definitions rather than having them be defined as state in a managed service or app.

B

You know this is the way that Engineers want to work, but this will not end with airbite, 5tran and other ingestion tools. There are and will be other managed services that Define and control the behavior of asset definitions within their tool. The question is: how do you want to manage change with those tools?

B

In other words, you know what manages the managed services. Let's start with our fundamental assumptions. You know we at dagster believe a few things and our work is centered around these beliefs. You know dagsters for those and we believe that data management is a software engineering discipline, and that means that all data assets should be defined in software, meaning code, because data assets are fundamentally business logic and this change in these systems, because its software should be managed to the software engineering life cycle.

B

So how does that apply to these ingestion tools and other managed Services? Well, let's talk about this device, you know. So if you believe that data Management's a software engineer, engineering discipline, you shouldn't be using your mouse and pointing and clicking around a UI to make production changes and deploy them. You know put another way: a data practitioner should not be forced to point and click in a UI to make changes to and deploy business logic. It's incredibly dangerous and fragile. A lot of this work actually stemmed from our own internal data platform experience.

B

We extensively use ingestion tools and it became increasingly scary and nerve-wracking to make changes to our own injection logic, ingestion logic, because someone screwed that up and mistyped something or clicked the wrong thing. How do you roll that back? How do you figure out? What happened? How do you figure out? Who did it? Maybe you have an in-app audit feature that may or may not be complete, but even if that exists, it's totally disconnected from the rest of your processes.

B

You know, and additionally, everything that's encoding in your ingestion tool is completely interconnected to what is going on the rest of your platform. You have Downstream computations that are dependent on it, so we really want to manage this with code. Well, isn't this infrastructure's code and in fact that's a common. You know held belief and it's a reasonable assumption.

B

You know this is in fact a discussion on one of air bites forms and they plainly say we would like to be able to manage and update our data sync operations as code, which is exactly what we just showed you. So we should just use terraform right and nope. We don't think so. Let's talk about why for a second, so terraform is a bespoke custom, DSL designed for managing infrastructure.

B

You know- and it was designed so that infrastructure Engineers could set up load, balancers, ec2 instances, databases and the like it's at a fundamentally different layer of the stack and it's designed for a completely different persona. It's for infrastructure, not business logic, and as a result, we don't think a data practitioner should be forced to learn it in order to find data assets right.

B

They shouldn't have to learn this completely foreign tool chain in language to make changes to what is fundamentally business logic and then, furthermore, it's a tool terraform that has no knowledge of the rest of your data platform and assets. It's a completely siled Black Box. You know how does one declare a dependency on an entity defined in terraform? You can't you'd have to double encode it. You have to write in terraform and then probably write it again in your orchestrator and then set dependencies on it.

B

You know this is not great, so we think that one data practitioners should use their lingua Franca python to Define their data assets, even if they live in an ingestion tool, and we think it should be done in the context of the orchestrator. So why the orchestrator well one.

B

We believe that the orchestrator is the ultimate source of Truth for defining and operating your data assets, it's where all your defendencies are defined and it's where all those dependencies are enforced, because it's the orchestrator that enforces the order of execution and it's that single operational pane of glass for your entire data team. You know, in our view, Additionally the orchestrators at the center of your data teams, engineering and deployments life cycle. It's where everything has to come together.

B

As a result, we think it's very natural for ingestion tools and other managed services to be peers to DVT spark python. Driven assets and all the other tools and have a single, cohesive, workflow and system for defining your data platform. So thanks for your time, and thanks for coming to Dexter day.