DataHub Community Talks, 19 Feb 2021

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Python Ingestion Framework Tour : Feb 19 2021

Description

Harshal Sheth describes the new python-based ingestion framework in DataHub, including support for Airflow.

Recorded at: DataHub Community Meeting: Feb 19, 2021

A

The scientist community so and and we're using a lot of open source in our tools, of course, aws as the service team that are building products, etc. um We always try to use what the native aws analytics, because if there is a service that that solved the problem, but if not, we are looking at third party or open source, but um one other things that people do not know that we are also.

A

Some of our customers are amazon themselves, so amazon has different business units from alexa amazon go and we actually help them as well, because they are using aws and I joined amazon in 2013 when amazon web services has 15 services and today it's like almost 200. So it's uh the platform evolved over the years so and I've been in the data analytics, probably most of my career. My background is uh distributed computing, so a big fan of what linkedin did with kafka and others.

A

So you know jay and nia from past life where before confluence so always happy to join back to the meetups and teams. So now, in my role, I think there is a tight correlation of things that we can do together on the open source so happy to help and again I'm not selling any aws services in here. So just from from sharing knowledge and learn from you as well so nice to meet you cool thanks.

B

A lot thanks a lot all right. We have hershel who's, come back, so yeah.

C

B

C

Awesome yep so past couple weeks, I've been working on a new python ingestion framework for data hub um first off is kind of. Why did we do this? um So you know the status quo was that people were using python ingestion framework that we had previously was really just a set of scripts um and they were already using that to ingest metadata into data hub.

C

But you know there were a couple shortcomings there. um Specifically. You know it was hard to ingest via both kafka and the rest api. If you wanted to get instantaneous feedback, you'd want the rest api, but using that with those scripts wasn't possible, and another thing was that we had these opaque json blobs, that you would ingest into data hub and it was based on avro, which is a serialization format. Similarly, protobuf the issue is that it's it's pretty difficult to use. There are a lot of sharp edges.

C

People run into issues, not know what the schema was for the um or you know how to format their data, and so we get a lot of questions around what what is even possible.

C

With that, um and specifically you know not having type annotations around that was- was something that a lot of people struggled with um and they'd run into a bunch of runtime errors when they tried to try to execute their code, but they wouldn't actually have any prior warning and then the other thing was that, in order to configure your metadata ingestion, you need to go and modify a bunch of code, which was not the ideal situation. Ideally, you just modify some configuration and then the code remains the same.

C

What we found is that people wanted to stick with python because of the the ecosystem around it all the open source projects. um You know things like airflow and numpy and tensorflow and so forth, and so they were used to this. They wanted to continue to use it to ingest data into data hub, and we wanted a principled way to make that happen.

C

So how did we approach this?

C

The first thing was that to solve the problem around the schema for avro and ingesting data, we we took the schema from datahub and we did cogeneration to generate a bunch of python classes that are fully typed and you know because it's code gen always in sync, with the data hub schema, and this way you always know exactly what what is possible, what you can ingest and what you can't the next thing we wanted to be able to support writing to kafka or writing to datahub via either kafka or the rest api, and, as I said, it's a trade-off between throughput versus immediate feedback with kafka.

C

You write and then forget- and you know you don't necessarily know that it was processed correctly until you, you receive the audit event or the failed metadata event back, whereas with the rest api, it's a little bit more instantaneous and so for different use cases. I wanted to use different things.

C

We wanted to enable that um we were inspired by apache goblin for the architecture of this and I'll go into a little bit more detail as to what that means, and the final thing that we we made sure to do when architecting this was have file based configuration, so you write configuration in a yaml or a toml file, and then you know you can just run the data hub ingestion framework against that config.

C

This enables easier testing and debuggability, but it also enables you to just have more flexibility when you're running the ingest framework.

C

So how do we architect this? As I mentioned, it was inspired by apache goblin right now we have two main abstractions. One is the source and the other is the sink, so it sinks. These are the methods that you can take a event of some sort and write it into data hub, and that can happen over kafka over rest or for debugging purposes. You can just dump it to the console or write it to a file.

C

The event that all of these operate on is metadata change event. So this is.

C

This is a central concept in data hub um every single event, or every single change in metadata is modeled as a metadata change event, and you can update um you know basically do all of the operations that you want to do by emitting a number of mces or metadata change events, and then, finally, we had we have the sources, and these are wide and varied everything from databases to a file to even like ingesting the metadata of kafka itself, um and as long as the source can create a metadata change event.

C

We were hoping that you know we can just plug it into the rest of this framework and get all the benefits accordingly.

C

So I will talk a little bit more about um what it actually takes to add a source as well.

C

So you know, as with all live demos, I'm gonna take a stab at it, um but you know things go wrong in live demos, so bear with me. um Hopefully you can all still see my screen here.

C

So all of this is in the metadata ingestion directory um the easiest way to install it. We can build the schemas from uh the rest of data hub and then we have a relatively simple set of commands. You can just copy and paste this into your terminal and get set up immediately.

C

So the first thing that you might want to do is ingest some sample data, um and so the way to do that is data hub ingest and then we included a number of examples.

C

So example two data hub rest. So let's take a look at what this does before. I run it.

C

So this reads from a certain file which is just a sample file that contains a json of of all of the mces that we've previously constructed, and it sends them to datahub gms over the rest api.

C

So if we run this, we'll see oops that that's not supposed to happen, um as with all demos bear with me one sec, um I will worry about debugging that after theoretically, it will work, um so we will fix that later.

C

But what you'll see is you know, you'll, go to data hub and go under data sets and we'll see some stuff here, we'll fix the ingest. The other thing that that I did, I uh I used to be a student at yale and I ran a course selection tool there with a mysql database, and so I actually ingested this real world database into datahub.

C

This is the configuration there's a username and password above that scroll down. uh So you know we have the the host port and then we have a filter rule. So you can filter out certain like mysql tables and then allow the rest in initially I just printed it out to the console. We can actually write these to uh data hub over kafka, this time, let's say and so oops I've gotta change the uh recipe to run, and so we can run it.

C

It will configure all of the um ingestion and then it will go through and ingest a number of tables, and then you get a nice summary of what happened so it filtered out all of the mysql internal tables.

C

uh Here are all the tables that it actually fetched, and so, if we go into data hub, we can take a look at let's say the students table and we get the full schema. We get. You know a bunch of other information related to the tables that we just ingested um yeah. So that's the that's the usage side of things. Let's talk a little bit about what it takes to add a source, um so the simplest source that we have.

C

Let's start with the sources.py here we go so the abstraction that we have for a source. You need to be able to create it, get work units and get a report.

C

uh These are the three operations that a source needs to support and if it supports those three, then we can integrate it into the rest of this framework. So, as an example, here's a very simple source that just reads from a file.

C

As I mentioned, we have like a bootstrap mce file that just has you know a number of pieces of sample data.

C

So it reads that and then you know, constructs a metadata change event out of it using the code, gen classes and then sends them into the rest of the framework, and you know that's that's how simple it is to add a source, a slightly more complex one. Let's take a look at the source for kafka, so this ingests metadata about the topics and partitions and so forth in kafka, and it sends them into data hub, um and so here it's a little bit more complex. You connect you construct your your consumer and then similar thing.

C

You know you, you list your topics iterate through them and then yield metadata change events accordingly, and as long as you can do that you can construct a metadata change event.

C

um You know using things like aspects and snapshots and all of these things that we're relatively familiar with as long as you can do that, then um you know the source plugs into the rest of the framework. Accordingly, all right last thing I wanted to talk about touch upon is: where are we headed with this.

B

Can I ask a question: um can you walk us through the chord gen that you're doing from the avro schemas? I thought that was one of the hard things about this project. Yep sure.

C

C

um Let's take a look here, so the av sc file, which is the average schema file, looks like a big json blob and it has you know all of the fields and everything. Accordingly, we have a codegen system um that you know I I took from an open source project and and modified, and I'm also working to contribute that back um but the cogen system.

C

Basically, let's say this is the ml properties class. um So for this, let's, let's do like data set.

D

C

Let's find this one.

C

There we go so data set snapshot has a number of aspects. Each aspect can be either. You know one of these things so properties, deprecation, lineage, upstream lineage memory, ownership status so forth.

C

When you construct it, you can automatically, you know, have the add the type systems and it all associates correctly, uh and this way you know if you run my pi on this code, it actually completely checks every single assignment, every single constructor.

C

All of that, so that you know for sure that um you know the code is at least semantically correct in terms of types. um So that's that's kind of a little bit on the code gen um and obviously it generates a truly massive file of 5000 lines. um So you know glad we aren't writing this by hand and updating it um did that answer your question.

B

C

Thank you um yeah. So let's talk a little bit about where we're headed with with ingestion, so we're going to do a more formal, rfc process on this relatively soon and then hopefully also publish a packaged pie pie. So you can pip install you, know, data hub or something along those lines and then start executing this, and you don't need to do the code gen yourself and all of those other steps.

C

The other things are more functional improvements so, for example, detecting when metadata stale. So if you've deleted a a table in your source in let's say mysql, we want to be able to detect that it disappeared and perform the according the the associated deletion in um data hub right. Now, it's a purely like additive process and we can do updates, but the the deletes are not something that we get support.

C

um Another thing is validating that it was actually ingested correctly. So you know, if you run with kafka and you ingest via that, you just send you know, let's say 30 events to uh to a kafka topic or to the broker. You don't necessarily know that those all got accepted correctly and so doing that um validation step would also be very helpful and then on the java ingestion side. So we have python and we also have java we're.

C

You know, hopefully going to continue doing a little bit of improvement on the java ingestion side as driven by community needs.

C

So if people really love the the python or we'll invest more on that, um if people still want to be able to use the the java we'll we'll do that accordingly as well um and then the last thing is standardizing a testing harness between the two so that we can test functional parity between the the java and the python.

C

Any questions on all of this.

D

I have one harsha um I was hoping you could talk a little bit about how someone would productionalize this say in airflow or something.

C

Yeah sure um so we include.

C

A couple sample dags actually in air in our in our repo, so that you can use this directly within airflow. It's actually quite simple. Let's say for for mysql, um you know you just create a pipeline, give it a configuration um and just call run, and um you know, as long as you can do this within a python operator of airflow, you can run this and you'll get all the standard error reporting out of airflow, as you would expect.

C

um So it is, it is production ready already, uh and you know you can use it. However, you like, um I think it does make sense to run it within an orchestration framework like airflow or even cron. If, if you want something sim simple so that you can continually get updated metadata from uh from a given source, instead of just a one-time import.

D

Awesome. Thank you.

B

Yeah this is cool, because uh I know that data hub has this uh positioning almost as push-based right and so a lot of people think that oh it's push based. So I cannot pull metadata into this system, uh but I think this kind of shows uh the people like how easy it is to once you have a push based system.

B

You can always add on a pull based system upstream of it to essentially essentially pull metadata into your system, and you don't have to really choose between the two you can you can do both if you want um all right, so I had just very small logistical things to go over very quickly.

B

We can take some questions thanks. Archer.

B

Are you hosting office hours as well, for injection.

C

I can, if that's something people would be interested in.

B

All right, all right, everyone sign you up yet.