Dagster Community, 16 Jun 2023

Previous Meeting

Next Meeting

⏯

youtube image

►

From YouTube: Dagster Community Event June 2023: Experimentation Analysis at DoorDash

Description

Arun Kumar shares how the team at DoorDash uses Dagster to power metrics layer and scale experimentation.

A

Hey everyone: um my name is Arun and I work as a software engineer in the experimentation team at doordash um able to see the slides moving on I'm. So sorry, yes,.

B

A

Okay, cool yeah: let's do it again yeah. My name is Arun and I work as a software engineer in the experimentation team at doordash.

A

So today, I'm going to talk about how do Dash leveraged tax to power and scale experimentation analysis before getting into the dog I want to just give a quick uh context or about like who we are, and for those of you who might not have heard about doordash is giving a quick context of what we do so don't necessarily local Commerce platform that connects consumers um with local businesses in multiple countries founded in 2013.

A

um We building we're building an infrastructure to enable a market, a three-sided Marketplace between Merchants consumers and dashes, where Merchants provide the services to the consumers, and then consumers find their favorite local businesses within their locality and the dashes enable delivering the goods from the merchants to the consumers.

A

um Daxter is truly a rare driven company, we do have multiple data use cases and we use dagster for various use cases that requires Dynamic orchestration. Some of them are ml features and training. Why clients, forecasting pipelines um some Dynamic spark transformation, experimentation, analysis and Reporting today, I want to dive deep into experimentation, analysis use case and talk about a particular project that we recently implemented using texture or before getting into the project.

A

I want to just give a quick um context, a quick overview about what experimentation means for those of you who might not have heard enough enough about experimentation.

A

So experiments are commonly used to attach for making data driven decisions. It helps us to statistically test the efficiency of any new feature that we want to introduce into our platform instead of just making uh decisions based on instincts. We look for some statistical evidence to prove that the feature is really beneficial for our company's key performance metrics.

A

So there are two different. There are different types of experiments that are currently being performed in the industry and out of which a b testing is probably the most commonly used experiment type. How everything is performed is, let's say um we have a search team that Rick wants to test a particular search algorithm.

A

We we do not roll it out to the entire audience directly. What we usually do is we package the users into control and treatment and show the new search algorithm only to the treatment users and the control uses will still be seeing our old algorithm. Then we would incrementally roll out more users into the treatment group and measure various metrics that would prove the Improvement in the search algorithm. Like click-through rate, we will then ship more users into treatment only when we see some upward movement in the metric for the treatment users.

A

So, let's say, if you're not seeing any positive movement in our com in our company's key business metric, then we will decide not to ship a particular search algorithm. I instead like try to roll back the entire feature, so this is how we make sure that any new feature that we introduced to our product is actually moving our company's keyboard. Key business metric in the right direction.

A

um Just gonna give a quick overview of the of how the experiments are being ran. The experiment is a multi-step process involving like multiple stakeholders like product Engineers, product managers, data centers. We have built and end-to-end. You know, experimentation platform that guides and streamlines the entire experimentation process.

A

uh The first step like the product Engineers, implement the feature, and then they configure the experiment and then the experiment goes live once the experimulator goes live, it starts generating data and the data starts flowing into our snowflake warehouse and then the data scientists. They will start analyzing. The data using our experimentation analysis platform called as query um so theory is our in-house experiment. Analysis platform that automates most of the data analysis process for experiments.

A

We help the data scientists to run those statistical analysis by analyzing a huge volume of data, and then we display the SAS resource back to the data scientist. The data scientist then looks at the stats resource and make decision whether to ship a particular feature or not.

A

This is a high level overview of the how the data flows for the experimentation use case and how we process it.

A

um Just to start with um when a particular user opens a door Dash app, they get exposed to an experiment and um and exposure log flows in real time using through our real-time streaming pipeline, which is called, and then it flows into a snowflake table called as experiment exposures. So this log is nothing where it contains the information about which particular bucket a user got exposed to for a particular experiment. So, let's say for experiment: a the user with the idcx123 got exposed to control bucket.

A

We will get a log that says as this information and it gets ingested into the experiment exposure table once we get this experiment context, the data scientists come to a query and they add the business metric definitions and use security to start analyzing. The data by joining the experiment, exposures with the business metrics.

A

This would then provide this request, statistical resource for the data scientist, which it will enable them to make some shift decisions about shipping a particular feature or not.

A

Yeah now I'm going to just dive deep into the part like query where I'm just going to give a quick context about query and how we, how we analyze experiments and then directly jump into the project so how we started Curie. um So when we started this, uh when we started to build our own in-house experimentation analysts platform, um the primary goal was to just standardize the entire experimentation analysis methodologies to into a single platform.

A

So for metrics we started pretty much easy. We adopted a bring your own SQL approach and allowed users to bring their own ad hocs equals to query in order to fix the required metrics data. This actually worked really well like there was not. There was no friction on the USS, because most of the users were already using ad hocs equals to analyze most of the experiments, so it increased the adoption to our platform and people started using our standard analysis methodologies by just bringing in an ad hoc SQL that they're already using.

A

However, as we gathered mode adoption, we started uh chasing a lot of challenges with this ad hoc approach. The first one is standardization. We did not have a single source of truths for all the business metric definitions.

A

Different teams came up with different definitions for the same metrics, which ultimately reduce the Trust on our results.

A

In addition, there's there is now a dependency on the subject matter: experts because everyone needs to know the metric definitions, how to fetch symmetric definition and where the metric definition need to be fetched from. They need to know how this domain specific information, which made it hard to analyze, experiments.

A

The second one is scalability, obviously like ad hoc Seekers did we don't have any control on them, like users can just write their own monologic SQL? Sometimes they most of it, and they write some expensive sequels, which we don't have any control on. The only and the only lever that we had was, to just add, add more machines and scale up the computer, so ad hocs equals were extremely hard to scale.

A

The main reason for that is because we are doing most most of the Redundant computation. So let's say a particular metric is being used by different experiments from different teams. Then we ended up. We didn't. We ended up recomputing those metrics from scratch again and again for each of those experiments, because we do not know what a particular metric definition is. So the only uh possible way was to just like execute the SQL blindly provided by the users.

A

This is a lot of written and computation which in turn like yourself, a lot of like redundant table scans and joints in Snowflake, which is quite which are quite expensive.

A

Lastly, the usability problem, because we depend on the SQL now it makes hard for the non-technical stakeholders to be able to analyze experiments.

A

So we identify the challenges we Face, uh that that was mentioned in the previous slide will primarily caused by the lack of standardization of metrics and centralization of the scalable centralization and scalability of automatics computation. To tackle these issues, we chose to build a matrix layer or experimentation, otherwise called as like semantic layer or Matrix. Though, if you are following the recent news about the DBT, you might have probably heard of heard it a semantic layer.

A

However, both like both are pretty much the same, and we ended up calling it as metric layer and I'm going to just call it Cosmetics layer for the rest of the dollar.

A

So metric Ray, to give a quick overview of automatic layer is Medicare is a centralized framework that contains that translate the data or the table and columns in the warehouse to business, metric definitions and or dimensional definitions uses uh using our Matrix layer uses build metrics as a reusable data model using our declarative, DSL framework and they consume the metrics same metric definitions from other platform. Like query, every metric definition would like every metric creation would go through a standard approval process with the appropriate domain stakeholders.

A

So the main core idea here the metric will be defined once and used across different experiments for different teams. Instead of without having to redefine them again and again,.

A

So yeah, this is how um our DSL looks like today. um When someone wants to create a metric, it's actually a basically a two-step process, the first onboard a source layer and then Define a metric on top of it. If you want to dig deep into the source a source here, mimics a data set or a table defined by a select SQL.

A

This is very similar to if you are already familiar about DBT. This is very similar to how a DVD model would look like. It includes this select SQL and each of the column that each of the column in the projection of the select SQL can be tagged as either a measure or an identifier.

A

Here, a measure is the most raw quantifiable value which can be used to define or build metrics.

A

So in this case, um this select statement is returning, um a measure called as n delivery and it's returning to identifies called as delivery ID and consumer ID. So the measures are the base, as I mentioned, are the basic quantifiable value which will be later aggregated in inventory, and the identifis are basically the joint Keys used to join with other tables.

A

There are like a lot of other metadata that is tracked in the source which will be used in our Dexter jobs. So once the measure is created, uh we refer to those measure in the metric and the metric definition by just adding some aggregation functions on top of those measurements. So that's how we Define a metric. So in this case let's say we are going to define a number of deliveries made by load Dash.

A

We just use to measure, that's been defined in the source layer and then add an aggregation function like call Sound In order to be able to compute those number of deliveries metric so that that's all the user had to do the user just build these two yaml files push it to the GitHub and then they go through a standard upload process, and then the metric definitions are synced to our backend. Using our grpc servers um once the once these models are created, then the Matrix would be automatically um displayed in our experimentation platform.

A

So all the user have to do is just go to our experimentation platform, create a config, further experiment and then select the metrics from the drop down. So that's in that's, that's all the user had to do the. If the metrics are already available in our platform, then they can just skip the entire metrics of the ship process and then go to the UI directly to select the metrics.

A

It just becomes a single click for them. Instead of having to like rebuild the sequence again and again,.

A

Yeah um with this, we really we improved our. We were able to improve standardization. um We started seeing uses through using common metrics across different experiments. We as a platform team. We did uh build some standard metrics and we built some standard collection of metrics and auto, configure it to various experiments, and basically, we also saw a lot of non-technical stakeholders started analyzing experiments, because now they don't, they don't have to be, depending on the SQL definitions provided by the data scientists.

A

Instead, if the metrics are already available in our repository, you can just directly come to our platform and start analyzing their experiments. However, the scalability concern was still there. How did we improve the scale of entire platform?

A

We just basically revamp the entire Matrix computation engine uh on top of dagster.

A

So, as I stated earlier, um the the original scaling concern comes because of the Redundant computation that we perform. um Basically because a single metric is being used by different experiments and across different teams. We repeatedly compute this those metrics from scratchy performing all the Redundant table scans and Joints repeatedly, so that was a primary root cause for our most of a scale.

A

Scalability concern so in order to avoid that we started by materializing The Source layer, and, if you, if you still remember, to use this first build a source layer, a source is nothing but a collection of measures and then the metrics will be built on top of those measures.

A

So, in order to avoid this written and computation, what we did is we initially first started materializing the measures defined in each source incrementally and then the idea here is to use those materialized source to measure metrics for different experiments from different teams, so that, like we don't do those redundant joints and table scans like huge tables, can I get it again, so how we did introducing tax service for every source that user creates. We dynamically build a tax day job under sensor.

A

Again, this the entire orchestration is abstracted from the user. They don't really like. They don't really create the job as a sensor instead wants to model this uploaded to our uploaded to our metrics layer, the job or the and the sensors will be automatically created uh by uh by our platform.

A

A

um Yeah so, as you mentioned like, if you look at here, uh every Source will have a desktop job and a sensor, so the sensors uh automatically tracks the Upstream dependencies that have been. That is mentioned in the source definition, so every SQL has uh has to read from certain tables. So the sensor that we generate would automatically take care of um monitoring those Upstream jobs. That means that manager so stable and once the once the Upstream dependencies are satisfied.

A

It automatically triggers the tax state job, which then incrementally runs the select SQL provided in the source definition and materialize it into a snowflake table. So every Source would have a material Edge table at this point.

A

So at this point, uh all we have is raw measures in materials table now. This is where the query aggregation jobs comes into the picture so similar to how we create dynamically, create extra jobs for source for every experiment and the metric combination that is configured in our query platform.

A

We dynamically build uh that's the job under sensor, it's very similar to how we did for Source any any if a user comes to our UI when they create and if they create an experimentation, config and that's the job would be automatically created for those for the experiment. In the background again, like um these sensors uh wait for the source assessors to be materialized. If you see if you- and if you already remember it, there is an inherent relationship between source and the mesh and the metric through the measures.

A

Right so let's say when, let's say a particular user wants to analyze an experiment a and they want to analyze metric B did this metric would depend on a certain measure from another source right, so the sensors would actually the sensors would actually um made for the those four success to be materialized before starting analyzing. This particular experiment.

A

A

Yeah so once the source has been materialized, then it triggers the tax job and within the job um we joined the experiment within the job. We joined the experiment, exposures with the materialized measures and then perform the aggregation to run all the statistical analysis and then once we got, the experiment results that job pushes the research into a postgres database from which they would be sold in our QD UI for the users.

A

Yeah this overall uh end-to-end flow of how we analyze experiments um using the metallic sources today um now next next I'm going to just talk about some of the space specific aspects of our platform and the pipeline and dive deep into some special cases.

A

See how backfits um again like, as mentioned like all of her? That's that's a materialization. Jobs are incremental and they are partitioned by date.

A

um You know the actor has a first class support for partition based backfills, so it's quite easy for us to like run like backfills, uh without having to be without having to do anything manually or we go to all. We do is just go to the backfield UI select a set of partitions and then start triggering black bills. We also like triggered the same back views.

A

We allow users to trigger this impactful using graphql from the from our experimentation platform directly so that they don't have to like look into the uh like get into the tax UI or find the corresponding jobs.

A

I just wanted to add one call out here um by default, like um that, you know, like that's. The creates one job run for each date, partition uh which could be quite expensive, particularly when you're building snowflake based SQL pipelines, because this could end up in, like numerous job runs, um triggering like triggering one sequel to backfill particular days worth of data um yeah.

A

This was like quite um quite a big problem for us, and initially we chose not to use it back full UI due to this limitation, but, however, with the recent uh Dexter version, I think now it's uh now. They we have a feature where we can actually batch multiple, um uh multiple dates in a single run. So when you, when you run a backlit for uh for a specific date range, uh it's I, I I think it.

A

It would be better to just batch multiple date partitions into a single Trend, rather than creating one run for each of the date. Partition particularly like when you are building SQL base SQL based pipeline, um when we actually move uh away from single run back to bash back build. We actually saw like close to 5, to 10x Improvement on our batch backflip performance.

A

um So one other thing is um about the look back period. Most of our jobs are designed to be look over. So what do you mean? We look back here. um As you mentioned, most of our jobs are incremental. However, there are some Upstream dependencies that our source depends on are not purely incremental.

A

uh For example, we have some uh some of the fact tables that could change uh the last 90 days of data in a daily pipeline, and if, if we have a source that depends on those Upstream dependencies, um we a user automatically user actually adds something called as look back period, which says like um the number of days that could change in the Upstream dependencies for the last like 10 days, and we basically like uh use the Dexter partition support to rebuild last NDS partitions on a daily basis.

A

That's the partition, probe, suppose this makes this very easy and provides a clear audit trail of when the asset partitions actually change again, like we also designed our pipelines to be still feeling. So, there's been a job phase to run on certain days. The next python run will automatically catch up and back below as unprocessed data based on the last updated timestamp in the table. These tips ensure that the data is always up to date and complete.

A

Is completely abstracted away from the users, um our jobs are dynamically orchestrated, based on the models available in the back end, so any new changes to the models are if a user creates a new model, the that should automatically reflect in the dragster repository without needing a manual intervention um in order to enable that uh we use a worker on such a server data. That's the team um thanks to Daniel, specifically for this workaround, so we just implement the reports.

A

We just implement the repository data class and override the logic to fetch the sensors and schedule jobs and sensors and schedule definition. So, while creating a report Repository, we start a background thread that periodically features and refreshes all these definitions. So this made it possible for any new jobs to be created automatically without requiring us a tax to deployment.

A

So anyone so for, let's say, as a user, if I create a new experiment, config in the platform attach the job for that experiment would be automatically created in the background within like minutes, without having to make any manual intervention.

A

I've already recently heard in the recent version of taxider there's a better solution for this, but yeah I have to look into that.

A

um Yeah metrics and monitoring um we do as as a platform team. Since we are responsible for the entire orchestration, we measure various metrics to make sure that the platform is healthy and we use promoters to measure most of this metrics um again like we use most of the we use diaster Primitives, to enable this metrics are recorded from the Run status sensors, where we use the dashboard API to fetch all the jobs metadata like the job. Latency is queuing time, error status and then push it to the parameters from within the Run status sensors.

A

We measure, like various important metrics related to jobs, like counts, latencies failure rates queuing time for the jobs um by by measuring the queuing time. We know uh when there are a lot of jobs piled up in the queue then we would probably can. We should probably like go and like scale up the machines and change the concurrency limits. We also measure the metrics on number of lagging jobs, see if there are too many jobs that are lagging um like that skip the particular trigger.

A

Probably then there's a bigger problem in the platform, and then we also measure all the Matrix related to sensors, like to measure the sensor Health like latencies, how long it takes for a particular sensor, tick to complete how a sense of failure rate and then, if a particular sensor is skipping or lagging we would get alerted.

A

I mean like this is the dashboard example dashboard, that's powered by those operators metrics.

A

um You also have um like detailed alerts and notifications So, based on the promotionality that you saw in the slide. um You configured multiple alerts to proactively identify and fix issues on a platform level like we have stripped thresholds for the job, wait times the job queuing time, the job latencies and sensor latencies, so that we get alerted um before there is a bigger issue and we also heavily use.

A

uh That's the slack integration and we have configured multiple alerts using that, like the job failure alerts directly to the source owners, and we also have something called as Dax Daily Report, which provides like a bird's eye view of all the jobs that ran on the previous day. It provides a view of like the jobs that are consistently failing the jobs that are like taking too much time to complete and the jobs that are lagging.

A

This helps our teams on call to regularly clean up and optimize those bad jobs report to the users directly on how much time are they Source SQL jobs are taking and how much cost they are attributing to the end, how much costs are attributed on their team.

A

um Lastly, like we use heavily used adaptive graphql endpoints, to integrate with most of our internal tools here in the snapshot, you can see how a screenshot from our experimentation platform, we heavily use the dragster UI in order to show all the job statistics and asset metadata directly on our platform, so the users have to don't have to jump between Dexter and the platform.

A

In order to know this specific metadata um again like we use the stack study uh graphql, also to provide all this metadata on, like our internal data catalog tool, the sum of the metadata that includes are like the snowflake query, URL used for the job, um the row count um and also the we directly linked the job run URL, so that in case, if there are any issues, users can directly jump into the job run. That is responsible that is responsible for that operation.

A

Project so with this standardization in the introduction of Matrix layer, we were able to like standardize more so for Matrix and SQL is not no more requirement for analyzing the ABS. Now it enables the non-technical stakeholders to run a bit is without much supervision uh from the data scientists.

A

um From the scalability perspective, um The Matrix computation framework resulted in a 10x Improvement uh in the average experiment. Analysis latency compared to a previous SQL ad hoc SQL approach, enabling users to make decisions and ship more features um actively again like the Improvement in the scalability, enabled us to work on some Advanced features like automated image analysis and also helped us to improve the overall reliability of our experimentation results.

A

um So, currently, um we were able to like um standardize most of our experimentation metrics, but the semantic layer or Matrix layer, as most of you might know, has much more broader applications across lot of other data driven use cases. We are currently working to replicate this success in, like other areas like business intelligence, exploratory analysis, forecasting, Etc um yeah, we're currently like trying to find new use cases apart from experimentation for the same business, Matrix.

A

um This entire presentation is actually a condensed version of the blog that I wrote last month. Yeah. If anyone wants to know more about this, I would recommend to take a read of the blog. I can probably share the link in the zoom chat. Once I'm done with the presentation um yeah.

A

That's all I have for today um I'm pretty much open for any questions, but I'm not sure if we have enough time or otherwise I'll be hanging out in the zoom um to be able to answer, and so I'll be able to answer to any of the questions. Yeah.

B

We have time for a couple questions thanks Aaron. That was super interesting. um The first question is: do you use um back fills automatically by detecting changes in the code or manually.

A

um That's a great question: uh currently we are doing it manually, but we do have plans to do it uh automatically. Currently, whenever someone uh some, when someone make any changes to the source definition, they actually do this backfill manually, uh although we have some automations like users, won't have to um get into the tax to find their jobs and then trigger it. We do use the tax graphical integration to do it automatically from the security from the experimentation platform.

A

However, there are plans to detect those changes automatically by using some sort of fingerprint and automatically trigger the backfield for a certain range of period, but we haven't really done that yet.