Dagster Dagster Community Demos, 9 May 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: How K12 education leverages analytics engineering using Dagster

Description

Marcos Alcozer—Analytics Engineer at Alcozer Consulting LLC—who presented on how K12 education leverages analytics engineering using Dagster

🌟 Socials 🌟

Checkout (and star!) our Github ➡️ https://github.com/dagster-io/dagster
Check out our Documentation ➡️ https://docs.dagster.io/
Join our Slack ➡️ http://dagster-slackin.herokuapp.com
Visit our Website ➡️ https://dagster.io/
Follow us on Twitter ➡️ https://twitter.com/dagsterio

A

uh Today, I'm going to talk about some of the challenges that analytics engineers face in k-12 education and how we are leveraging community and open source tooling, such as dagster, to solve those challenges.

A

The responsibilities of an analytics engineer, more broadly, are also true of those in k-12 education. However, we do have have our own k-12 specific goals. You know we do this work because our goal is to provide a complete view of a student. There are many things we want to know about students to ensure that they are on a path to success. We want to know who they are.

A

We want to know their grade level, their race, their ethnicity, if they have a learning disability and we want to know things like their attendance and things like their their grades, and so, let's look at the challenge that that we face with data in k12, so our disparate data is spread across many different systems and that may sound super familiar and similar to to your environment, student data, attendance grades, assessment, etc. It's housed in various systems uh and, unfortunately, five tran and stitch just don't have connectors for us.

A

Assuming that the vendor has a method for us to get data out from their system, that method is going to vary from vendor to vendor. So hopefully they have an api that that's uh wonderful, but sometimes they have just sftp drops. Sometimes they can email us a file on a schedule.

A

Sometimes we have to log into their system and we have to download an extract of our student data uh or sometimes we don't get anything um and- and we have to engage in uh in conversations with that vendor to to see what's what's possible in terms of putting what can be put on their roadmap now, once we get the data out well, similar data can be represented very differently. So, on the left column we have the student last name. We can call that student last name, but other vendors may call that last name.

A

Some may call it family name and and others last surname, and so that's just a look at kind of how how different how disparate the data can be in terms of location access and what it actually looks like and end of the day. People are our customers, who are our educators, our school leaders, our students, our families. They just want the insight right, and so they just want to know for the specific education organization, so think school district think state, think charter management organization.

A

They just want to know what is their student nutrition, how many students have left since the start of the school year, how many students are chronically absent and who are they, which students have a learning, disability or or an iep, is what we call it? And when you look at data, how can you cut that data by the different things? Look at things like attendance by race and ethnicity or by gender, et cetera, et cetera, and so let's look at how how we're solving that in k-12.

A

So uh the solution is being built through a community led by the edfi alliance. It's a 501c3 nonprofit and the edfi alliance has committed to publishing all technology open source under an apache license.

A

The community works on a set of rules for collecting and organizing student data, and those rules manifest themselves in the edfi data standard. The edfi data standard is a set of restful apis, and so the screenshot on the right may look familiar to you. This is just a swagger document, but the power is is that all of the endpoints have been decided by the community and then we can go to the vendors and- and we can say this is how we collect student data.

A

We use these endpoints, the endpoints have these payloads and it uses this nomenclature, and so it just sets uh sets the tone for the conversation. So we are all talking uh similarly and and our conversations themselves can be interoperable.

A

So let's look at this diagram, so here is where the collection of disparate operational data gets really cool through either state mandates, philanthropic funding or just good intentions. The vendors create an edfi api client that takes their operational data and sends it back to customers via the customer's ed fi. Api said another way. It is the responsibility of the vendor to submit data back to the education organization.

A

The education organization does not pull data from the source systems; they stand up their own edify api. That's following that data standard, and then they receive the data back to them and we have had a lot of success with states creating mandates. So, for example, if you do business in the state of texas in the state of arizona wisconsin, you can only do business in the state. If you maintain an edfi integration.

A

So edfa eases the process for education organizations to receive their operational data via common data, standard and model. This solves the operational piece but does not touch on analytics, and that is where dexter comes in once an education organization has their disparate operational data in a common operational data store or ods.

A

Dagster is used to pull that data back out of the edfi api store it in a data lake and materialize it in bigquery in nice, tidy data marks, here's what that looks like just a level deeper and so in this example, I am using google cloud storage for my data lake and I am querying that data like via bigquery through what are called external tables and I'm using dbt which, which we all know to manage my sql transformation work.

A

So let's look at just a few kind of screens, so I can walk you through and this and help you be able to put your finger on it.

A

So here's that swagger document uh I shared, and so the dagster graph today uh looks a little something like this. So it's your kind of traditional uh elt job where you've got this central extract and load, but I've got to run some ops at the top, because the edfi api has this idea of change versions and what that allows me to do is pull deltas. So I can pull all data for an education organization on the first run and then on subsequent runs. I can just pull what has been changed or or deleted.

A

So I run those ops first and then I move into the extract and load, and this is all one one single op up today, because I'm using as I'm fetching from the api, I'm using a generator to be able to yield results back and upload to the data lake to reduce the memory footprint. But it's definitely like a to do on my my side to be able to split that out. If I can and will likely reach out to all of you on slack for some thought: partnership at some point and then the final op.

A

There is running the edfi models now something that sandy said earlier uh in in this meeting really caught my ear around being able to mix assets and apps that's going to be super powerful, because in this example, I'm running a bunch of ops to be able to produce an initial asset to then produce additional assets via dbt tooling, and that's where, if you haven't checked out software defined assets, I really recommend it, because this graph is hiding the dbt models or those assets, and this is what software defined assets allows you to do.

A

Is it allows you to blow up your your dbt files to be able to visualize that inside of daxter, and so I have things such as this fact table down here, and I can see the things the dependencies and the things that have to be run upstream to be able to get there, and I can start to see well what are what are being what's being done in python, what's being done in dbt, so I just wanted to show that visual.

A

This is what that data lake looks like so I'm using high partitioning inside of google cloud storage to be able to have my various api resources and then to be able to have my data segmented by school year and if you've used high partitioning you. You know this, this equal sign and to be able to run dexter jobs where they always just extract and load into the data lake. Nothing gets deleted, everything gets preserved um and everything is put there as just raw json and then in bigquery.

A

I'm able to have these tables where they are external tables. So if I look under details, it's an external table, it's pointing to my to my sandbox environment. That is my my bucket that has my raw uh json and is using high partitioning, and so I can build dbt models uh that are using sql um to to query this external table, which is really just my portal into my data lake. um So I can get into so.

A

I can create my various data marts, so things like grades fact table uh my my dims and and what have you? So that's just a little look at that piece and then a few other things that I I wanted to mention. So this work can get pretty com complex, and so let's talk about community to see how it's it's taking place. So as an as a k-12 analytics engineer, I maintain open source repositories that implement everything that you've seen here. You can access it at k12, analytics engineering, dot dev.

A

I also maintain a google sheet that lays out the learning journey for a person to become an analytics engineer in the k-12 space.

A

Finally, I also provide pro bono mentorship and support to any education organization that wants to tackle this work to help their their students and then the resources will be also turning into a publicly accessible free online bootcamp that will run in cohorts where people can consume it and they they will walk through a full implementation.

A

Learn how to do it. Do it themselves have the community for support again, all all free, publicly accessible um one more thing before we move into questions, so there are times when a vendor does not build an edfi api client to send data back to an education organization. There we left we leveraged dagster as well.

A

For example, I maintain an open source repo that extracts data from the google forms api and submits that data to the edfi api via their surveys and coin, and so that's another part is when you, when the work just gets even more complex, is looking to the community to solve things together and to share those those resources, so that we can all all benefit, uh because we all have the same goal, which is you bring that data together to provide insights back to our educators so that they know how to where and how to to help students?

A

That's a look at k-12. I will turn to any questions that people have.