DataHub Tech Deep Dives, 12 Aug 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Snowflake Ingestion Improvements

Description

Shirshanka Das (Acryl Data) shares recent speed and functionality improvements to ingesting metadata from Snowflake during the July 2022 Town Hall.

Learn more about DataHub: https://datahubproject.io
Join us on Slack: http://slack.datahubproject.io
Follow us on Twitter: https://twitter.com/datahubproject

A

Awesome, it's raining ingestion today. Overall, I think right. So we've gotten a lot of feedback around things that are working well and also things that are not quite working as amazingly as we would expect one common one that comes up often- and we see this both in the open source data hub adopters, as well as actual customers is people forget that we have two connectors, often for the same warehouse, so snowflake bigquery.

A

We have two different connectors that bring in slightly different metadata for each uh warehouse, so snowflake the primary connector brings in table schemas columns, descriptions and lineage, and it you know, uses, show tables, and you know additional queries to get all of that metadata out and then there's the snowflake usage connector that gets. You know your usage counts, both tables as well as columns, so the stop users as well as kind of the operational history of the table and, historically the reason we split.

A

Those apart was because the first one we actually did it for was bigquery and with bigquery usage, the kind of permissions you needed to get access to the logs, as well as the kind of protocol that was being used, was so different. It felt kind of weird to mash up both these sources into one, and so we repeated the pattern for snowflake and then repeated it for a few more warehouses.

A

But you know one year down this road, we're realizing it's not quite what we had anticipated in terms of ease of use, and it's not ideal. Most people do one, and then they come to the snowflake data set page table page and they're like. huh Why? Don't I have my queries in here? Why don't I have top users they're all these amazing things.

A

How do I get them and so we're making big improvements to combine these connectors together so that you only have one connector to configure and you can get the benefits of both of these streams of metadata using a single connector next slide. There's another common mistake that people end up doing.

A

Is that the you know the allow deny patterns have to be configured twice once for the snowflake source and one's for the snowflake use it source and if you don't line them up exactly the same way, you can get usage for some tables and then metadata for others, or vice versa, and it's just it's a mess.

A

So so those are kind of two common problems that people run into with kind of the split connector approach and then, if you go further down even for that base connector that we have the one that pulls out technical schemas and things like that. The first time we wrote these connectors, we layered them on top of sql, alchemy and sql. Alchemy is great because it is, you know, a generic connector system. It can. As long as you have a dialect, you can go get a bunch of metadata about anything that you can connect to.

A

So it allowed us to build a lot of connectors with low amount of effort. But if you look at what that connector does under the covers, what is sql alchemy doing to get metadata out from the source?

A

It's actually pretty expensive, essentially there's a first query that shows essentially does the equivalent of show databases and then for each database. It looks at all the schemas in the database and then for each schema. It looks at all the different tables and then for each table it goes and gets. You know, descriptions and other metadata. So essentially the number of queries is, you know, order of number of databases, times number of schemas times number of tables it can take a while for most production warehouses.

A

We end up seeing it taking from five minutes for small ones, to 20 minutes to sometimes even uh 30 minutes to an hour right and that's not ideal.

A

Moving on what we wanted to do was move to a much more efficient way of ingesting where, instead of depending on the sql, alchemy connector, we just use the official client so, for example, for the snowflake source. If we use the official snowflake client and we make more efficient queries, we could actually do it in order.

A

You know d times s where the first query just gets the databases and then for each database we can either get all the schemas and the table metadata using one query or if it's too much then for each schema, we can just get the table method now and we were like: huh let's try this out and let's see what happens, how much can we actually shave off of these latencies and the results were actually pretty surprising.

A

You know there's an open pr. The future is actually here.

A

We have an open pr that proposes adding a new connector, we're calling it the snowflake beta connector and you know, mayuri ran some tests against the long tail deployment and we were able to bring the latencies and this long tail is just a you know: a simple warehouse: it's not really a production warehouse and even there we're able to shave the the the latencies from like four and a half minutes to just 30 seconds, which is amazing, um and I can only imagine how much we'll be able to um you know improve the latency of like real production, warehouses with thousands and thousands of tables.

A

So good news, uh the pr is up, we'll, probably merge it sometime later this week it is compatible with the current snowflake connector in terms of getting schemas lineage and table level profiling. So if that's all you're using your current snowflake connector for you can actually move to this right away.

A

Couple of caveats: stateful injection, is not yet supported and column level. Profiling is not yet supported, but it's coming soon. So next slide.

A

um We so we're first going to roll out the connector as part of the next release. With these caveats, try it out give us feedback, help us improve it. We'll follow up really quickly, with addition of stateful ingestion, column, level, profiling and adding in the usage capabilities from the usage connector. So that will give us like everything we wanted a fast, efficient connector with a unified config profile that allows us to get both user statistics as well as regular technical metadata.

A

Once we do that, we'll deprecate the current snowflake and snowflake usage connectors flip the pointers and make this be the official snowflake source, and then we're going to repeat this for bigquery and let us know where else you want us to repeat the same pattern for we'll kind of gauge, obviously based on community demand, for which other systems to go after super excited about this.

A

uh The latency of injection has been something that people have been talking about for a while, as well as this duplication of ingestion. um So this should again simplify things up a tremendous amount for everyone.