Delta Lake Delta Lake Discussions with Denny Lee (D3L2), 23 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: D3L2: Implementing a Data Lakehouse for Improved Data Science and Analytics

Description

As a follow up to our session "Why did we migrate to a Data Lakehouse on Delta Lake for T-Mobile Data Science and Analytics Team", Robert Thompson and Geoff Freeman, Members of Technical Staff at T-Mobile continue their in-person discussion with Denny Lee on how their data lakehouse improves their data science and data analytics efforts.

Quick Links
Blog: https://delta.io/blog/2022-09-14-why-migrate-lakehouse-delta-lake-tmo-dsna/
Join us on Slack: https://go.delta.io/slack
Join the Google Group: https://groups.google.com/forum/#!forum/delta-users

A

To get the LinkedIn and the the YouTube links to get them up and running so, but we are live together actually at T-Mobile campus in uh cloudy Bellevue or Factoria Washington factorial area in Bellevue Washington, so give us a one more or two more minutes to go. Get ourselves up and running and uh chime in on through the LinkedIn or YouTube or WebEx. Tell us where you're based out of like I, said we're. The three of us are live together, based out of the uh cloudy Bellevue Washington area, Factoria.

B

Factoria Mall of Seattle.

A

It's not even that kind of an outlet mall per.

B

Se, it's really nice. It's.

A

B

A

C

A

We're not going to rehash the fact that we're out.

C

A

There is actually a Nike Factory Outlet Store. Here, though, is.

C

There yeah it is there.

A

Oh yeah I come downstairs over there, yeah like there. No literally, there is one in the factory mall.

B

I should go check that out. They often carry size stupid. For me.

A

All right so um did we uh sorry I since I haven't gotten the LinkedIn, their YouTube links yet I'm just wondering.

C

If they're in the chat here is in the chat there, yeah.

A

All right, uh if you can just ask um Carly, to send me the LinkedIn, where we're good to.

C

A

Send me by slack: where are we good to go? uh Let's see, okay, perfect! Thank you kindly all right. Well then, let's start the show, while we're waiting for everything um hi there, my name is Denny Lee. uh You are currently uh live in Cloudy Factory, belly Washington at T-Mobile campus.

A

uh This is the implementing data lake house for improved data science and analytics at T-Mobile I'll start off by actually having our gentleman here to my left and my right, I don't know which way you guys seen it um uh to introduce themselves a little bit about who they are and why we're even talking to them today.

B

uh My name is Jeff Freeman I'm, a Solutions architect on the procurement team at T-Mobile. I have been here for about three years I before here, I was at Cruz for a little while uh building autonomous vehicles before that I was at Microsoft for 13 years. My background is in um cloud computing and data warehousing.

D

um I am Robert Thompson I'm, also a Solutions architect here at T-Mobile same team. We our desk side by side.

D

um And I also spent time at Microsoft not spend time at cruise with him, though, uh skip that little job, so yeah I've been here at T-Mobile since 2018 and plugging away doing the Cloud stuff Cloud.

A

Stuff all right: well, hey! Let's, uh we actually had a a webinar I want to say a year ago now, uh but let's rehash a little bit that I actually want to rehash by starting you and saying like a little bit of your history, because you uh as opposed to me, which is very boring, like you all, have very interesting Histories on how you got into this world in the first place. So since Robert you spoke last, why don't you start off a little bit?

A

Tell us a little bit about your past and how you got involved in the tech in the data world in the first place,.

D

um The data world so or.

A

For that matter, computer science, I.

D

Don't know uh computer science is what I've always wanted to do. So you know I got a computer when I was like 10 and self-taught. Pretty much I went to school in Louisiana at Louisiana Tech for computer science uh did that for a little while and then left and went to LA worked on some planes worked on the prince of Saudi Arabia's plane. That's the most famous one, a.

C

D

Well, he was the prince, then he became the king, now he's dead, so there there was a little path.

D

um Then I worked in a pharmaceutical company and then I came up to uh Washington and worked at Microsoft at Bean um got hired there because I was pretty good at distributed computing um and then that they hooked me up with Jeff and we've kind of been going ever since that's.

A

Yeah, that's your history! Yeah, let's see if we can shut off your uh video because it looks like your video is interfering and it's only showing yours as opposed to the panel, so it looks like the cameras turned off right or is the camera on right now, of course you see it.

D

A

At least on right now, for some reason, we can only see you Robert on the restream for uh that's going on LinkedIn, so I want to shut off yours just to see what happens. No! No! It's that I'm! Sorry.

D

A

It's that sorry we're still working on how to fix this. No.

B

No, like I'm just saying like.

A

The restream here that's presenting on LinkedIn for some reason it's not showing that oh.

C

A

So I'm saying why don't you shut it off and see if that helps, I can use that yeah. Sorry, we're still working on the technical stuff see if this actually helps us, okay, bring it up. If only we had some technical people in here I know I wish we had some people that knew what they were doing, but apparently that's not us.

A

uh Yes, apparently not so about product qualities, all right, so yeah right now uh um we're gonna wait a couple of minutes just to see if this happens, uh because there's usually about a minute delay between what's broadcast on LinkedIn. That's also, yes, so because right.

D

A

It still says Robert Thompson.

D

I left the thing yes.

A

You did oh, it looks like we're good, okay, all right, yeah yeah, all right. We are okay for people who are on LinkedIn YouTube, please validate, and let us know that you actually are seeing us right now. I believe you are now, but that's we want to confirm that so just chime in on the the comments please I'm.

D

A

Log into uh link YouTubers say that I haven't done that yet so all right! Well, meanwhile, uh as I wait for that well, Jeff, why don't you talk a little bit about yourself, your history, how you got involved in all this other stuff.

B

When I got out of the Army in, like 1998 I, was trained to be a cop or a janitor. Basically, you know the job skills that I had and I was.

C

B

What do I like to do, I like to cook and I like to play with computers, and it turns out I, did a little like pros and cons and it was pretty heavily weighted for the computers, so um so I I, my first programming job was at a startup called complete in Portland Oregon, where I was the back end and uh uh uh data access layer programmer um I worked there for a few years and then I decided I was going to move to Seattle just to see what would happen and I got a job at Microsoft, working for uh Sasha, Berger and Moshe postmansky, um and uh the first thing they said was hey we're building a distributed version of analysis.

B

Services. Why don't you build us? An absurdly large Cube.

B

And it's about that time that uh I got introduced to Denny um and Denny had some experience at building absurdly large things. At that point,.

B

Yeah, that would that would be pot calling kettle black yeah and then uh uh worked for being for a little while. um uh That's where I met Robert, um that's where Denny gave us like the largest machine that had ever been built for testing out tabular models um and.

A

Also Services tabular model, but yeah yeah well.

B

I mean yeah and then uh and then, uh while I was there. So I worked on that for Bing for a while and then I was working at Azure on the um data Lake analytics team, which is for anybody who used U SQL all three of you but learned a lot about uh about uh cloud data processing. There um went to a cruise where, basically, like the the amount of data that was coming off of these cars, it was just truly absurd.

B

um So I basically built giant data shovels in a catalog um and uh when I left there Robert was was telling me he's doing some interesting things. Having some interesting problems at T-Mobile and I decided I'd, you know I.

C

Thought I would.

B

Be here for like a year, 's.

C

B

We keep getting to just do cool stuff, so still here all right, perfect.

A

Well, that's actually an excellent segue so from let's start before we get into even the data science aspect of it. Let's start with the data.

B

Engineering aspect.

A

Of it right, which is basically just a small rehash for folks that didn't watch the previous video like what were the problems that you originally were facing and like. Basically when Robert, you said: hey Jeff, come join us right, um and what did you basically do the shift into like what? What was the transition that you ended up doing and.

B

A

For everybody, that's you know watching on the line you you you're, probably noticing that I'm hinting a very strong. Oh yeah, they had a data, warehouse, design and they're switching to.

C

A lake house design.

D

So yes, basically.

A

That particular journey we'll.

D

Start with that, we had a bunch of data silos um and no one could tell you the source of Truth or a particular attribute like where did this come from? I I got it from here. Well, where did that come from I.

C

Don't know I got it from there right.

D

So there was no um way to know the source of Truth for pretty much anything like yeah I mean you had a couple of source systems, but it was. It was tough to get access to those systems. You were told to go here or here or here um and then.

C

D

I started doing was I, would just go to these Source systems and bringing all the data to one central location to a a data warehouse and then exposing it. And then people just started organically using our data warehouse, even though it wasn't for them for yeah.

D

Yeah, even though it wasn't for them, they they liked coming there because they had all the data, and then we had contention issues and then we had uh you know. We ran out of resources so right we needed to get rid of those. So we started moving to lake house so.

B

The big thing, the the big source of contention that we had was we had um uh analytical workloads and operational workloads, trying to run at the same time on this Azure data warehouse that you know isn't isn't really meant for that sort of combination of workloads and um and the the first thing we identified is like Hey. How do we? How do we? uh We can't scale this up any further, not like it, wouldn't be worth the the money?

B

What can we do in order to relieve this, and so we said well, let's take some of the more complicated etls that are occurring that are the more problematic ones and, let's move them off of the data warehouse.

B

We can still use the data warehouse for presentation, but we'll just move those etls off, and uh so we did a proof of concept of one of our um allocation uh programs um where we just did it all in spark, so we just compiled it all at spark uh using databricks and then we output it to um Delta tables, and then we uh sourced the the actual program that was reading from that from those Delta tables and uh a huge, immediate benefits like that.

B

Like this process that was constantly having contention and frequently failing on this Azure data warehouse succeeded all the time and in a fraction of the amount of time and uh we're like this will probably scale. We.

A

Should do this um so once you, when you talk about probably scale like I'm curious? What what are, what? What are the type of numbers we're talking about, like the amount of data we're talking about like yeah, so.

B

It's it honestly, it's the amount of data, for that is not huge. We're talking about I, don't know, uh P3 would take it overall was probably analyzing. 30 gigabytes of data gotta go right. um The problem was not the amount of data. The problem was the the workload contention that we were seeing on a dedicated Warehouse compliance gotcha.

A

Gotcha gotcha, so, basically currency more than it was basically kicking your butt right right from the get-go.

B

Right, yeah yeah, the the the the major problem we were having was was concurrency um both in number of users, but also in uh the types of of ETL, frequently block.

A

Each other right, ah gotcha.

B

So so, basically.

A

The ETL processing we're basically trying to overrun each other that, basically, this is where, like the asset, transactions were coming in handy, because the different pipelines couldn't actually delete each other's work. In essence,.

B

Also, we would have um operational workloads would be blocking the the ETL okay um and then analytical workloads were waiting on the ETL to run. So you have so uh like how do we? How do we continue to support these operational workloads that people had built organically under you know, without necessarily having the the authorization to do so, but I mean not not from a probation standpoint but like when all this data was centralized here, a lot of this data had been centralized for the first time and people just were like. Oh my God.

B

We can build such powerful reporting off of this right that uh that they would go build reports uh without having like an SLO agreement right, but by the time that that they were told hey this is you know we did not agree to support this right.

A

B

Business was already running on this report. So yeah we could have said you know pound sand, but would have cost huge business disruption. So how do we? How do we continue to support that without causing major business disruptions.

B

So the the combination of of analytical workloads and operational workloads uh would cause all this contention, both from the number of people that you can have concurrently querying a data warehouse, but also from just is the workload going to complete right.

C

B

um The so the biggest thing, the biggest driver for us to move to uh a lake house architecture from the very get-go was just workload: isolation right um the fact that it scales out linearly instead of up what is also a huge bonus. The fact that now we don't have to do as much data engineering to move data around was a huge bonus, um but the the real Crux of what drove us in that direction was concurrency and.

A

So I guess there's an implication also that your environments are very very heterogeneous to begin with, anyways right, so you've got lots of different systems trying to access this data, so you've got lots of different workloads trying to process the same update into whatever to the same tables set of tables, and you also have basically a very diverse environment. That's actually query these tables. At the exact same time, you provide a little bit of context.

D

um I mean like.

C

D

We started out, it was. It was if you think about all the data that's needed in order to build a cell tower right. Okay, like.

C

D

Your mind, it's probably not very much the reality. Is you gotta lease some land, you gotta you, you have to have a contractor that builds fences you have to. uh You know, have all the parts to build the Tower. You have to have the backhaul, the the fiber go into the tower.

D

You have to pick the proper location to build the Tower, because if you build it in the wrong spot and the fibers on the other side of the river and you have to drill a hole under the river in order to run fiber, that's a problem right. So all of these things go into. You know. Where should I build the next Tower.

A

So basically you're saying like from the standpoint of like the data that you're centralizing at this point before this. Basically, everybody had to go to very different environments, yes to access like to do the logistics of how to even build a tower. So basically, it's like okay, if I I can access one system that says, okay am I legally left purchasing this. This land I have to access a completely different system. That says where the heck, the fiber is yeah. That's and that's just the simple problem, let alone everything else right, yeah, okay got it.

D

Yeah right yeah, so so putting it putting everything in one spot just like unlocked.

B

Things right that was, but when you talk about the heterogeneity of of those systems that all had to come together, I mean you're talking about so part, partly just the normal heterogeneity that you would find in any other company right.

A

When you're talking about like one is sap and the other, is.

B

You know uh you know DocuSign or whatever contracts in T-Mobile until like 2017 was just a marketing company with a technology yeah.

C

B

Mean they were leasing stuff from everybody and right and they were, you know, gobbling up the the the other companies that were leasing space on other companies, Towers gotcha right. So when T-Mobile decided they were going to build a network right.

C

B

Had already they were already trying to tie together all of these different systems from all these different companies that were all completely different right and now we're going to try and and use all of them together in order to build a network, and it was a the the normal heterogeneity that you would find in a company was magnified by an order of magnitude.

A

Oh I actually didn't realize that about t-ball, but basically it originally had basic Ally basic stat everything first before I start building its own network yeah. That's actually really.

D

That's actually pretty eye-opening, so T-Mobile didn't have a nationwide Network until I think 2000, uh 2010, 2020.

A

Really so well, when it started building it's 5G, Network.

C

A

So, prior to 5G, T-Mobile actually was leasing most of it. It.

D

Did not have a it did not own a nationwide Network wow, so so.

B

Cool prior to prior to 5G T-Mobile was building what'd. You say it was 10, 10, Towers, I, think February hours a.

D

Month, I think if I remember right, it was somewhere around 30 Towers a month or a hundred Towers.

C

D

That neighborhood and we upped it to on-air dates of about 1500 on-air dates a week.

A

1500, so you went from 30 a month to 1500 a week, yeah.

A

Yeah, that's a lot of.

D

Dick, so so, when you, when you ramp up that fast right, like you're you're, just trying to figure it out as you go you're because you don't I mean you have a plan, but yeah, everyone has a plan until they get hit in the face right.

A

Not to mention it I'm, just thinking to myself the different states, different Counties have different records with different formats that provide you. The information just on the land leases, let alone like the locations of land, let alone like the rights versus the legal documentation that goes with it and yeah yeah they're, all in different forms the smoking standardization that we all know. If you go to a single County within the county, it's actually going to mess you up, let alone yeah go ahead and cross counties across States, okay wow.

A

That is really impressive, so, okay, so so to say, you have a diversity of data would be an understatement. Okay got.

D

It okay yeah, the volume is not really our problem.

C

D

The it's it's definitely the the conforming right.

A

Yeah right so yeah, because you need to basically somehow standardize all of this data, so people can actually go query this data. So that's why your ETL processes were so complicated are not so yeah are still complicated because it's actually how to try to standardize and conform. All of the forget about any of the facts, just the dimensions alone,.

C

D

Yeah to find a site Dimension right back in 2018 that had everything that you needed right was impossible. Gotcha, gotcha, gotcha.

A

That is okay, so damn, okay. So that's why it was crucial for this lake house, because, basically before this, if you actually had to go to all these different systems in order to be able to go from 300 or sorry 100 a month to 1500 a week that wouldn't even be possible, like it's not even physically possible to query that it wasn't.

B

It was, it was like the the planning for what they were going to build was some guy with a spreadsheet yeah.

C

B

And it just didn't like the.

D

Change the game yeah.

A

Yeah building, because there's only so many of like the that guy or that.

B

Gal, you know like in almost all.

A

Business, there's always that one person or that you know Gaga that basically has the Excel spreadsheet. You know this goes back to our power pivot power bi days like the one guy or the one gal right that knew the business domain and he or she would be able to go ahead and like no, oh yeah, I'm gonna grab these 20 different sources and press these things. I want to create this one spreadsheet that merges all together and then boom you're good to go except I'm.

A

Thinking like you need that guy and that gal, except for almost every single County, if at least not every single state right in order to pull something like that off.

D

Right, it was a lot of and done pivots and whatnot back in and we just changed the game. So we came in and said all right. I can I can automate this right right, okay and the more I automated the more it was a problem.

A

Like Yeah Yeah Yeah, the more you automate, the more you made everything conform, so there was a standard way of looking at the more people. Are gonna. Go oh cool! Now, I have this information I.

B

Need to go ahead.

A

I can now finally get these 50 sites that I wanted to go online last week now I can finally do it because I can go query it. So that's why they're being The Living Daylights out of the system, yeah yeah.

C

A

Cool so then, let's.

C

Talk about that now, like okay, so right now.

A

You know if we're just having the conversation to be like cool I, get to run another cool pivot report. You know like or an Excel report power, RBI report, but obviously that's no longer the case for you guys right. You actually have other scenarios other other problems that you have to solve so up to like, in other words, what you did per the last webinar was described like in essence, how you're able to upscale from 100 a month to 1500 a week, okay great. What are the problems.

C

That you're, trying to.

A

Solve now this is and, of course, what I'm implying now or the more the the deeper analytics the data science that you're trying to do with all the stuff. Now, basically,.

B

B

We have got our system to the point where um it's easy for us to ingest new data and and get that data and run etls that are going to shape it and and present it. um The the two big problems that we're having right now are: how do we get the rest of the Enterprise to do that?.

C

B

That's one track and the Enterprises is, is going down that path uh with some velocity, which is great. um The and the other path is. How do we um then make this this data easily discoverable, okay um and and give people confidence in the quality of that data where it's come from and and what it means. Okay,.

A

So, let's, let's actually, let's break those two things out absolutely so with data discoverability. The last thing you just said and before that you're talking about like basically making this available for the Enterprise right is this: by a chance, the template, CDF stuff that we had been talking about: okay, so yeah. Let's definitely focus on that. Let's talk about uh actually yeah for the audience.

A

What is a template? Cdf like let's provide that context. For starters,.

D

Well not sure how much of this we can talk about, but.

A

I mean from a high level like what the concept of teams they defeat. You know that type of thing- yeah, yeah, I, I, know that there's a bunch of stuff. We cannot talk about it.

C

D

C

A

Just more like just the idea of like basically standardizing so that way, you can actually spin up these pipelines significantly Factor things yeah.

D

We're actually trying to templatize our platform all together, okay, uh so that we can hand off templates and say okay, secure team over here. That does marketing sure um you should also build the same type of platform that we have in order to service that domain.

B

Got it the benefit now that all of your data is in Delta tables, which makes it really easy for us to share across without having to move stuff like right? So you get all the data Mobility off that um so the the templatized uh change data feed, though that you were that that we were talking about before, is how do we um use the power of Delta tables and the change data feed in order to um there are still systems that require their own back end or that require some some way of of reading data.

B

That is not from a lake house right.

A

So like the standard vendor systems or customized systems or whatever else, because you have a very heterogeneous environment, you basically need to say no, no as much as we love.

B

A

House, for that particular environment, we have to build something that is exactly to their that vendor specification to that, whatever yep cool, we don't have to call them out, I mean if you wanted you're more than one to.

B

Do whether that's Erp or is that sort of web app or whatever like they need a a a live database that they can go query, and so how do we? How do we take what's in the lake house and most efficiently get it there, and so um what we just built? The first version of for our system is: uh how do we read from the change data feed um so as at what level of expertise are we talking to here with with everybody um like if people need me to explain to feed I can but.

A

Then yeah, okay, so and get a little plus about.

D

It yeah come in yeah, so.

A

B

Yeah so with Delta tables, um as of uh Delta 2.0 uh great job Denny, um yes, the uh when you make any sort of changes to a Delta table, we've always been able to see a point in time with a Delta table, but Delta table 2.0 introduced the change data feed, which makes it possible not to say show me the state of the table at a given time, but show me everything. That's changed since that time. So you can say the last time I loaded my table was yesterday at.

C

B

Show me all the things that have changed since yesterday at noon, and then you just get a little output. That says this row is added. This row is deleted. Here's the new version of this row and you can load that into a staging table in your system and then do a merge depending on you know. Whatever type of system it is the syntax May differ, but the the concept is the same I'm going to create a table that looks just like my Delta table or my parquet file.

B

That's that says, here's all the things that have changed. You load it to that staging table, and then you use a merge based on whatever Keys you have on that table and instead of um every time having to have a like either a rigid ETL where you have to uh have strict standards of timing or you have to have strict standards of structure, you can just say point it to a table, make a table in the database.

B

It looks just like that and then on a regular basis, run something that keeps it in sync, and it becomes uh very, very easy to say uh all of these tables that I have in my in.

A

My lake house, this.

B

Subset need to be loaded to this operational system just set Whatever frequency. You want whether that five, every five minutes show me load all the changes or once a day load all the changes and fire and forget you set up some monitors and if they never fire, then you don't have to worry about it.

A

B

Is pretty cool yeah.

A

So so the whole premise is that you've designed a system that basically, even though it's a lake house, even though it's you're using Apache, spark to do the processing of the data. uh Yes, it's data bricks, that's okay! I can admit that.

A

The context is basically then you're able to build these templates so that way, the downstream systems, whether it's an Erp Erp at data warehouse or whatever else basically doesn't matter like they don't know the difference that you basically feed it to them in exactly the format that they need and they're able to go ahead and start working with them exactly.

B

It makes it like, in order for us to create a near real-time feed to an online system, takes us about an hour, including all of the metadata configuration all the deployment and all the validation and setting up your monitors. It's like an hour. Oh that's,.

A

So sweet it's sweet.

C

A

So that means basically any other. This.

B

Is what you're talking.

A

About getting it across to the Enterprise, but getting all these folks to be able to jump on board, so they can leverage exactly these systems, so they can go and and do their own, not just like operational data scaling and concurrency, but also their own business. Scaling like it goes back to like I'm, going to repeat it again: 100 a month to 100 Towers a month right like to 1500 a week right, that's a that's a massively large job that blows my mind away still right.

C

A

The context like that even from a marketing like as an I, can now do Outreach that I couldn't do before, because I know exactly what towers are about to go up. So I can actually say: hey, we've got new towers in your area and then now they actually have the information to do it and.

B

In instead of having to like set up their own dedicated data pipeline in order to get that list into whatever Marketing System they have, they can just point their compute like for that sort of marketing thing. That's that's doesn't need like real time and they don't need their own back-end.

A

B

They just set up their power bi or their Excel or whatever, to point to our serverless endpoint, and they can query the data Lake directly right, but if they do have something where they want to um they're setting up a machine learning, that's going to take all of these the the towers that are going online and then find all the people you know based off of you know, uh Pitney, bows or whatever, then then, then, that all they do is they set up?

B

We say Here's the the database that we're going to run our processing against these are the tables we want. Here's the metadata for that go done so.

A

Okay, so to provide context to the audience without like how long has it been that journey to go from okay, now I've got my lake house right so.

D

A

House went to what two years three years, not.

D

No, no after we made the decision, we're gonna build this. We had it up and running in six months. Okay,.

A

So thank you for correcting me, which is an awesome correction, six months to get yourself up.

C

And running with the lighthouse.

A

So then, from the perspective of saying, okay, now your the Enterprise is starting to recognize. Like, oh, my goodness, we could do the same thing right so.

C

A

The temp, so how long did it take for you to get templates and how long is it, like, you said already an hour just to basically get a running, but how.

C

A

You think the journey is going to take for other groups to be able to jump on board. Basically.

D

um April 17th.

A

Yeah, that's a little too precise, less.

A

Okay, I think you should provide some context.

B

The truth is, it's still going to be, like we've done a lot of of templatization and built a lot of automation, but it still requires, like people still need to understand the the mechanics of what's going on in the back end, and so it's it's going to take teams time to ramp up like.

A

You got to have.

B

You got you gotta have knowledgeable data Engineers, who can understand what's going on behind in the back end.

A

um Also, you can't just take any random DBA and just put them there.

A

We've had this specific meaning any time where we're talking about the fact that people need to be trained into this environment. It's there. There are very many vendors which shall not be named. That will make the attempt to say: oh, let's take a DBA, a or less debt, and they can automatically be a data engineer or data scientist.

C

Tomorrow and we have an inside joke where we're talking about.

A

Yeah, that's never going to work like everybody is trying who wants to learn absolutely. They can definitely learn that. Don't get me wrong, and so, in other words, if you're a former estat or a former DBA, absolutely you should absolutely you should yeah yeah, absolutely we're not discouraging people, quite the opposite, we're encouraging people. We just also don't want to tell tell people- oh yeah, magically tomorrow, all sudden your data, that's that's the only economics I apologize for throwing that one in so.

D

We woke up yesterday and decided to do this.

B

In all seriousness, T-Mobile is a big company with a lot of with a lot of moving Parts. um It's gonna take a while, but there's a lot of there's a lot of uh recognition at the value of this, and so there are a number of teams that that, even if they're not going to do it exactly like, we have um have already moved in this direction. Right.

C

B

That's important actually and it's important, and because it's already showing the value like when, when when teams have their stuff in Delta tables, and then in order for us to share that that data or combine it all we have to do, is Grant permissions like it blows people's minds, yeah I'm not kidding like people are like. Oh my gosh, that was it. Was that easy, yes, store your data in a data Lake like in Delta tables and sharing data becomes like a click of the button.

B

um There are, you know, obviously pieces around that you need to have the governance in place right. You need to have you need to make sure that the the the way in which you're sharing data the way you're controlling access is something that you can audit, and that makes you can rationalize. You can standardize and.

D

You can scale it like. The button. Click needs to be yeah you're, going to tie that button. Click in yeah.

B

Yeah yeah, but um the the the key there is is that now you're talking about a procedural hurdle to sharing data and not a technological right right.

A

Because policy, which in the middle enough is okay, because you actually do want to implement proper policies, proper governance, which you know I know for lots of folks they'll, be frustrated, be like. Oh, my goodness. I have to do all this extra work, but, like that's.

C

How you secure the data, that's how you.

A

Make sure when you share the data, you know exactly who you're sharing with how long you're sharing with the whole and so as much as people want to tell you. Oh I'm frustrated like no. No, that's the whole point. You actually have both the scalability and then your frustrations are now policy, and it's a good frustration to have, let's be very clear. I want to be as.

C

Much as I get frustrated myself too right right, the.

A

Reality, that's a good thing because now we're securing the data. We know exactly like who you're playing with how we're sharing with why yeah yep. So then, this actually is a good segue to data. Discovery and I I jumped a little bit ahead, just because we're only got about like 15 minutes.

A

So, okay, so in some ways, I've already implied where why data Discovery is so important because with that last uh diatrib alive so yeah once you instead of me doing the diaphragm.

D

So, in order for your users or your your developers to trust the data, um if they build a report and they go to a an executive at some point and they're like the executive, every time is going to say where'd. This data come from.

C

D

C

D

You can't tell someone: this is where this data comes from, like this is the path this data got to this report. The trust, isn't there right right so and.

A

Well, you can't just randomly put free spreadsheets together until the echo works.

D

I mean you could say the workout's 21 and throw it to the tester right. Yeah 42 is the answer: yeah 42. yeah, but so so what we're doing now is we're building the lineage into it right.

B

So we're we're.

D

Going to show from this is where the data came from, and this is how it got to you and we're going to show every step of the way and how that attribute got its it's number or its value right right so um and and that's what we're working on right or yeah. That's that's! What we're trying to build into this whole system now and it's uh being a little cantankerous.

A

So yeah I mean most companies when they get to the states that you're talking about especially like- and this is not even talking about the level of diversity that.

C

You guys have right.

A

I'm just talking about like stand.

B

A

Like it's like a standard, sap or Erp system of some type right and I've got like eight different sources of rest, apis or whatever else that I'm at and data warehouses, I'm dragging data into the lineage on just that alone is already relatively complicated. In your case, it seems like it's like a permutation of God knows what, at this.

D

Point, yeah and, and me just telling you well telling you that the data came from here and it was ran. Then I'm tired of the phone calls so might as well automate this yes.

B

Part of part of the the history of of where we came from where you know once once before, all of this data was centralized.

A

B

You had a bunch of the people who were the domain experts and and doing stuff in Excel on their on their desktops um those processes, even though they might have sourced them, or like pointed them back to the data that that was centralized in our Azure data warehouse at first. Those processes still live on right, they're still driving their business off of that right. But uh what's happened once you have all that data in the same place is previously, those people were answerable to nobody right who's, going to tell them that their data is wrong.

B

Nobody because nobody like they're the only people who really know how.

A

It's right, it's not even though.

B

It's not even like a like they're.

A

Trying to be malicious, it's like there's literally nobody else that can tell them right, there's no basis to even understand or refuse say like how's it wrong. I, don't know exactly.

B

But now that you have all this data together and you can do reconciliation right, um there becomes those questions about like. Oh, what is the what is the correct way for this to actually be uh bubbled up and how should it actually be structured and- and even as you create the correct answer to that, their reports still live on yeah.

C

B

Right and now that all this data is centralized and anybody can just go find it by by you know, if you have access you just if, then you just go, find it and you're like oh. This is this is the way that that uh uh so and so has been doing their reporting, and they gave me access to their View and now I'm going to use that and go build something off else off of it. um The Challenge there is.

B

How do we direct people to the stuff where people have have the business has actually said. This is the correct way to do.

A

It right right, your golden tables person, yeah.

B

Your golden table: how do you direct them to your golden tables right and um and amongst all of these different things that have been floating around? How do you? How do you uh socialize that, and so um one of the big things that we're working on right now is uh using databricks, Unity catalog, so uh um I promise I'm, not shielding for data breaks, I do not get paid I! Think it's I think it's quite a lot in the other direction. Actually,.

A

B

Well, I mean compared to the other options: it's not bad, uh but but because we're.

C

B

Doing our our are etls in data bricks, um we're onboarding to their their Unity catalog. That makes it we've already had a data catalog, but Unity catalog is going to make it a lot easier for us to point. People um to this is the actual like source of Truth, and this is the easiest place to get it, but it also has all of that lineage that Bobby was talking about where all we have to do is export that into whatever the Enterprise data catalog system is got.

C

B

um And that's, and and with that, it becomes possible for people to recognize oh I've, been reporting off of off of you know, uh uh Dougies view that just drives his particular business and nobody else's and- and they can see that, oh it came from here- um it came from here- here's the actual it's. Instead, here's the actual, like business, approved View, and we can go redirect from there right.

D

We're trying to include like even the transformation logic, but here's how this thing transformed, like here's, a case statement that says at this value, it's whatever right like it's, it's we're trying to do everything right.

A

So so you, basically even the business you you start putting the business logic. Also inside you see so that way, in addition to saying I Dougie's view is this, but it's based off of this particular golden table or golden view. You know yeah, then, but here's also the udfs or.

B

Here's also the logic.

C

A

That I, that we use to calculate such a state, whatever.

C

Whatever metric, we want yeah.

A

Yeah, so that way everybody else can say: oh what I need to calculate session State perfect I can just leverage that UDF directly, so I don't actually need to generate my own.

B

And accidentally.

A

D

A

I actually can use the exact same logic now, yep.

D

And it's exposed and discoverable right.

B

And so this becomes. um This becomes great for reporting, but the place where we're really looking forward it towards the future. Is that that this makes it possible to ensure that the data that you're using to drive any ml processes when you move from oh we're, reporting on what happened or we're going to predict what we think is going to happen as a batch in the future.

B

When you begin when we're talking about moving to the automation stage, it's really important for us to be able to validate that that data, that the features that are driving, that automation are reconcilable to the the decisions that are driving the business right.

C

Right so that's.

B

um That's that's the piece that we're looking forward to with that. um Discoverability is not just. How do we write better reports for socks compliance, but how do how do we make sure that the the automation that's is is what we think it is right.

A

So so one of the things that I I definitely like talking about when it comes to especially now that we're shifting more toward the machine learning side of the house right is that um there's a term coined as explainable AI like which is basically this idea that I'm going to run a machine learning model. That's great!

A

That I have some idea what my features are, but I still need to even be able to explain how did I come to the conclusion that I got in the first place like because, most of the time, a lot of machine learning models are black boxes right. So you have no idea. I'll use the decision Tree in context right. You have no idea where it says, like a alpha, is less than 0.2.

C

A

Does that even mean to say that also, okay, we're gonna forsake argument, go Target a marketing campaign for factorial for some reason you know like that that type of thing right we so even our attempts within tradition like not even like General easy machine learning models there we go um like decision trees comparatively versus AI, like especially like your llms or whatever else. We can't explain all that stuff unless we actually have full lineage full policies, one central place to go.

A

Look at all this information in the first place and and the one thing I also just now doesn't.

C

A

Partially, putting a database hat on now like.

C

The idea that basically.

A

Even the machine learning models and the ml flow, like in terms of tracking all that stuff, could also go into UC as well. So this is what sort of I guess what you're talking about interesting, where now this is one place for you to go ahead and just look at everything from the get-go. Basically,.

D

And- and you should so well, okay, so go ahead so so.

D

You should have your features, discoverable of course, right like so that the next guy he's like oh I I, want to take this knowledge and.

C

I want to expand it right right, whatever.

D

That is, but so so having this lineage. Even when you get down to the features right like the the next column and one of the data breaks guys, he said it the best and I forgot who it was, but he's like machine learning is just another T yep. Yes, yes, it's just another T and I was like that's perfect, I'm, going to steal that so and yes, um but the the out. The output of that t is another table, and the feature is just a column in that table.

D

Right, like let's just break it down and make it oversimplify right, so so having the whole thing discoverable from end to end, like all the way down to this next column. So this next guy can build another T on top of the T on top of the T right in order, I mean just being able to explain that all the way through is powerful, right, um I can't think of a better word, but we'll.

B

Just go with that. You.

D

Know, source of truth, I mean source to Target, is just what we're looking for right and.

B

How do we get there uh to push back on on people wearing databricks hats, just a little bit I, don't think we're actually going to like centralize it all in in UC. Oh yeah, amazing catalog, because um a big part of the of the automation that we have built is in Azure data Factory. So, um unfortunately, ADF doesn't export.

C

It's lineage very.

B

Well, so so he's building a uh something: that's going to Output it to open lineage, but but.

A

B

Job of exporting data um of exporting that lineage, so people will still go to a Unity catalog to discover data, but for for a lineage perspective, I think that we're going to be I, don't know if we're going to have it in an enterprise-wide catalog.

B

um Our team has our own data catalog, um but uh to bring it back to Delta tables I mean you could do the same thing with astronomer um and datacan or whatever you know whatever it is. Your orchestration engine is the the really cool thing about Delta tables. Is that it's easy to automate and scale out and as soon as you're doing it in a consistent manner, um you're only having to figure out your lineage from one engine right.

C

B

If, if you have, every team has their own version of whatever like 80 like, if every pipeline is custom, then it's going to be impossible for you to standardize that reporting right right. But when you centralize on one technology like Delta tables, and then you scale it out using automation using whatever your automation engine is going to be right, um it makes it possible to to turn your attention to other things right right. So.

A

No I appreciate you doing that call out, because that's almost the whole point, like you're you're gonna, basically choose whatever Technologies, whether it's both that you're able to whether it's a limiting factor or whether it's a oh now I can do something that I couldn't do. The whole point is that you're not limited like you can just do what you need to do in order to be able to focus on the next thing and then on the next thing, and on the next thing, so you can make it this automation, concept, A, Lot, simpler!

A

You can make the lineage concept A lot somewhere. You can make the explorability and actually having some form of standardization, or at least even if they don't standardize against it. At least now. They know why they're doing it that way, right right.

B

Yeah exactly yeah, um one of the things I like to I like to tell uh people on our team is, is that I want to code myself out of a job.

C

B

So, and so you know the frustration that people might have with oh now, I have to go deal with governance or, or you know, the process type of of workflow um is evidence that you've done a good job coding yourself out of a job yeah, um and now you get to turn your your attention to this process, piece which might not be as exciting as data engineering, depending on what sort of nerd you are, but um but now.

C

We get to automate that.

B

Yeah, right and and yeah lake house has been a big part of of why we're able to turn our attention to to new and more interesting problems, so sweet.

A

Yeah, actually with that I think that's a great segue to end it actually. So no, no literally the lake house allows you to focus on more interesting problems, so I think that's a great way um if you have any other questions, please do chime in Bye. By this token, uh I think we're actually technically even a.

C

Little over right now so.

B

uh Definitely join.

A

Us on uh the Delta slack, which is go.delta dot, IO, slash slack that allows your students to join and the three of us are usually on there. Anyways uh goofing off.

D

So yeah I mean Robert, Thompson is on there. There might be 50 Robert Thompson's because that's you know yeah most common name in the world, but yeah well,.

B

If you want to find him on LinkedIn, you can look up Geoff Freeman, because I don't know how many of those there are well, they linked to one Robert Thompson.

A

I want to say thank you very much, really appreciate your time. I appreciate everybody on the audience for joining us. Hopefully you enjoy this three panel or two panel thing that we've got going Dustin. We want to give a shout out to you.

C

Thank you very much.

A

For helping us out with all of us uh and Carly uh behind the scenes, thank you very much as well, so without further Ado. That should be it um we're done for today. Thank you very much. Everybody.