DataHub Product Demos, 5 Jan 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Sneak Peek!! UI-based Metadata Ingestion

Description

John Joyce (Acryl Data) gives a preview of upcoming functionality to ingest metadata into DataHub via the UI!

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Today, I'm going to talk a little bit about a feature: we've been working on in the past month: it's called ui based ingestion and I'm just going to get right into it. First, with a recap of the current state of the metadata ingestion framework so to ingest batch metadata into datahub. Most of you are probably aware it's kind of a three-step process.

A

The first step is you install the datahub cli from pipe second step? Is you define a yamo source recipe where you say which you know source you want to pull from and the configurations about that source, how to connect to it, which tables which databases to pull out of it and then finally, you run this datahub ingest command and the cli.

A

So on the on the right side, here you can see kind of that process in a picture we have the config and then we run that data hub ingest command, and this is all well and great. It actually has a really a lot of good things about it and some not as good things. So I'm just going to talk about what we view as the pros and cons of the current metadata ingestion framework that in batch metadata ingestion in particular, um you know it's simple.

A

We've tried to design the framework in a way that has clean, abstractions, clear separation of concerns. um It's easy to extend. You can add new systems pretty easily on your own. You can add custom systems to pull from um it's scalable right.

A

So by virtue of having sort of distributed producers of metadata, you can actually scale across environments, and this is particularly useful in really complex cloud setups that a lot of companies have today or even in cases where larger companies have subsidy areas and they actually want to push metadata from multiple disparate places, as opposed to having one centralized source.

A

uh Now we'll talk about the downsides of how we're doing this today, so the major one is really just accessibility.

A

um The framework is really intended to be consumed by a technical audience right developers, sys admins people who are kind of comfortable going into the command line, but also operability right, so once you've run that data hub ingest command.

A

For the first time, it's great you, you have metadata in data hub, but it's not always clear how you actually productionize that right and kind of like schedule that on a recurring cadence, which is what you're going to most likely want to do, and so typically people end up using a third-party scheduler, something like airflow or prefect.

A

um So what we've done the last month is really just take a deeper look at the cons here in order to see if we can kind of help address them in any ways, um and I think we came up with a couple of goals here and that is to increase the accessibility, obviously and reduce the operational overhead associated with operating the current metadata ingestion framework and the approach we took was kind of twofold.

A

We wanted to simplify ingestion such that you know, users didn't need to code to ingest and ingestion should take less than five minutes after you've set up datahub, so pretty much quickstart ingest some data within five minutes, but we also wanted to keep the good things about the metadata ingestion framework that I called out. You know continue to build on top of that as a basic building block, as opposed to reinventing something completely new. If it ain't broken, don't fix it. So what we came up with is a way to do.

A

Ui, based metadata ingestion and specifically I'll talk through the feature set that we had in mind when we were first brainstorming this, and that was a recipe builder in the ui, a way to configure ingestion using a pre-built template or write your own custom recipe in the ui on demand, execution being able to run your recipes in a single click once you've actually defined them.

A

Cron scheduling so a way to actually schedule your recipe to be run on a continuous cadence kind of addressing that productionization or authorization issue.

A

A secret store, a way to create, store and embed encrypted secrets that are injected at ingestion time in data hub itself and real-time monitoring so a way to actually monitor your pipelines as they run in real time, and so now I'm going to take a step back and do a quick demo of what we came up with in the last month. um This was all I'm going to disclaimer. This was working this morning, so really hoping it's going to work here. So we have a fresh instance of data hub.

A

This is what you'd find if you ran quick start and went to local host 9002. In this case, I have a local instance going um you're gonna see immediately that we have a new tab up here called ingestion, so I'm just gonna go ahead and click on that and right now we don't have anything it's pretty bare, but I'm gonna walk through the process of setting up ui based ingestion through this through this portal here and I'm going to start by clicking, create new source and what you'll see is you know?

A

We've got a bunch of these pre-built templates for for different sources that are supported by datahub, and what I'm going to try to do is actually connect to the data of my sql instance itself. It's kind of meta we're doing like a recursive thing here, but let's see how it goes.

A

So I'm going to click on my sql and what you're going to see here is actually you know an in-app recipe builder, and so I can actually come up here and click this and try to expand it a little bit um that has sort of all of the configurations for the mysql source already filled in, and if I have any questions about these, I can just of course go here and check out the my sql source docs see what the conflicts that are available are, but I'm pretty familiar with it.

A

So I'm just going to go ahead and start filling it out. I'm going to be connecting to the datahub database and my username and password is finally just datahub and then finally, I'm just going to point this at gms directly to uh all right. So the next step is now that I've defined my recipe um actually scheduling that on a particular cadence and what I'm going to do is I'm going to schedule it for for two minutes from now. So I'm going to say 46.!

A

This is a cron scheduler. Of course, I'm going to say every day.

A

What you're going to see is this is going to run at 9 46 a.m, and I want it to run in america los angeles time, because that's where I live, of course you can use ctc something else, uh and then I'm just going to give it a name, I'm going to say my sequel source 1.

A

done and now we have a new ingestion source. And so, if you hover over here, it'll show you you know, runs at 946, but you'll also notice that I have the ability to just go ahead and execute it. So I'm going to go over here and actually execute the source with the new recipe that I've just defined and what you'll see is the source starts running.

A

So I can click this to actually see all of the executions of this source, and I can see that it's running so we'll just give it a couple seconds here to see if it finishes.

A

Okay, so what I have over here is a window. Let's see, okay great, so I can see that you know after 44 seconds the pipeline ran. I can click it and actually see some some ingestion output. I can see that we connected to my sql. We were able to run it and we extracted a bunch of information.

A

um So now, if I go back home, the expectation is that I'll have some data yep. All right, I can. Okay, maybe the indices haven't been fully updated yet, but let's go ahead and try to get in here yep, so I've got my sql database loaded in which is which is great. um So actually, if you come back here, you're going to notice that it's actually running again and that's because it's hit that scheduled 946 run. So you can see there's two source ways to execute.

A

You can manually it see that you can execute it on the schedule.

A

Refresh it yep, so that also succeeded. You can go ahead and check that out as well. Now, I'm just going to show you quickly what it looks like to define secrets. So that's one of the other bullet points I wanted to call out there.

A

um You know it's not the case that you really want to put your sensitive information into this file directly in many cases, because this file itself it isn't necessarily encrypted, and so what we've come up with is a system within data hub to define secrets, and so what I'm going to do is come to the secrets. Tab, I'm going to click, create new secret, and this is a place where I can create a named value that will be encrypted by datahub and then resolved at ingestiontime.

A

So I'm going to create one called mysql username, I'm going to call it datahub and then create it. It'll take a second here to refresh, let's make sure that's created, and then I'm going to create one called mysql password same exact thing. I just will skip the description on this one.

A

Okay, so I've got two new secrets: I'm gonna go back into here, I'm just gonna edit this source and I'm gonna actually use that secret inside of my recipe, and so how you can do that is just the typical environment variable substitution.

A

So I'm going to say my username password and then hopefully this just works next done all right cool. Now we're going to execute it once more. Let's make sure this search this time.

A

And then, once this is done, I'll show you one last thing which is kind of what things look like when execution fails. So there's many cases in which execution can fail. You misconfigure a recipe or your secrets aren't properly defined. um So there you go, it succeeded.

A

We did resolve those secrets and we were able to extract the metadata. I can actually show you the failure case by just maybe changing this to you know localhost, a different port which isn't which isn't actually active and trying to run it one more time.

A

This one maybe.

A

One other thing I'll call out is that you can of course cancel running instances of the job as well. So I know some jobs can take a really long time or it can get hung sometimes, so you can go ahead and cancel that using the cancel button there and you can see that it's failed because we've failed to properly set it up and if you go into it, you'll actually see probably some debugging information about why that source failed, and I think it'll eventually say that it can't connect to.

A

um You know local hosts, no connection, so this is really useful for for us too, like as the central team. You know when you're running these recipes, it's going to be really easy for you to share your screen or send these to us for us to kind of take a deeper look about why things maybe aren't working for you. The final thing I'll show is just the the cancel which I talked about. Hopefully you get a run okay, so it's running I'm just going to try to cancel and note that it won't remove any metadata.

A

That's already been ingested and there you go. I went ahead and canceled that one, so um that's pretty much the ingestion framework. You can actually see you know in this case with the cancellation we got through some of the installation steps, but we didn't actually ingest anything. So you'll have a little bit of details in the case of the cancellation as well, um but that's that's pretty much it. I I think, for the demo. So I'll go back to the to this last.

A

All right just a quick overview and pictures for what we built the people who weren't able to make this session and now, as usual, I'll talk a little bit about how this actually works behind the scenes for the technical audience. Who is interested um so how this works is we have a few new critical components that we introduced along with, of course, the ui screens that you saw?

A

um We have first of all, an embedded scheduler that we added into the metadata service that is simply responsible for listening to changes in your you know, ingestion source schedules and then actually scheduling them locally on a local thread using a cron scheduler, and what this is doing is just running and executing ingestion requests on a on a schedule. This means that if you take down the metadata service, when it comes back up, if you, if you've missed a scheduled ingestion, it'll simply ignore it and pick up where the schedule left off.

A

So I think that's a pretty important piece there. The other big, big piece that we've introduced is what we're calling the actions framework and the actions framework is basically a subsystem that listens to changes in the metadata graph and then takes particular actions.

A

So in the case of this, we actually built one action on top of the action frameworks which is listening for requests to execute an ingestion source and what it's doing is in turn using an executor which we call it's like an agent, an executor to actually handle that command so to kind of validate, run the actual data hub in jets recipe.

A

um So we're really excited about this, and the actions framework is going to be much broader than just ingestion. We're envisioning it to be a place where we can define actions on you know, tag changes, term changes, schema changes and the like and just a components. Overview again is we have this ingestion scheduler an embedded, cron scheduler inside of datahub metadata service? This is an important point, because it means that we don't have to integrate with something like airflow or prefect, or something else as an add-on to datahub. This is a native capability.

A

We have an actions framework which I mentioned, responds to changes in the metadata graph. We have an ingestion action which just actually runs the ingestion, and then we have the executor, which is actually the worker, basically and finally I'll wrap up by talking about the vision for the future. On this set of features in particular, um we would love to actually clean up some of the configs that you saw us having to define in the recipe builder. Specifically, the sync configurations aren't really that useful, because we already know where your data hub instance lives.

A

So you want to make that a little bit easier, we'd. Actually, in addition to that, like to put a forum or a more friendly ui on top of the recipe builder experience with the ability to switch back and forth between the normal yaml based editor and a more user-friendly, you know form we'd like to add the ability to test your connection uh to a data source directly inside of the recipe builder, so that you can get that quick feedback without having to go and click, execute and see that something's wrong.

A

We'd like to actually have in-flow secret creation, so as you're building that recipe being able to just create a secret and embed that secret right inside that flow and then finally, based on the demand from the community, we would love to introduce some sort of controls for rolling back and gesturing us into the ui. Obviously, with the obvious upsides of that, in the case of you know, bad ingestions.

A

All right and yeah, I just want to make one last call to action, help the core team prioritize. These features you can help us by you know reaching out to us on slack letting us know what you like, what you don't like about it, and most most useful of all just trying it out when it's released, trying to build some flows on top of it and really battle testing the future.

A

All right. uh This is coming soon we're targeting v1 rollout, which is everything you just saw in january 2022 by the end of end of the month, so really excited to to get this out to the community and start getting some feedback.