Delta Lake Simon & Denny Ask Us Anything, 17 Sep 2020

Previous Meeting

⏯

youtube image

►

From YouTube: Tech Chat: Simon and Denny Discuss SQL Server and BI to Delta Lake and Lakehouses

Description

Considering shifting gears into Spark Data Engineering? Join this fun session with Simon Whiteley (@mrsiwhiteley) and Denny Lee (@dennylee) as they chat through their meandering journeys from SQL Server & BI to Apache Spark, Delta Lake, and the emerging Data Lakehouse approach. Be prepared for a geeky, trans-Atlantic event from two data nerds. Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

So uh if you are watching us live, are you just launching in uh you're, probably seeing us go ahead and mess around with technology and all that stuff um hi there? I'm denny uh welcome to the tech chat with uh simon and daddy, uh basically we're here to talk about uh our experiences from the standpoint of sql server and bi to uh spark delta lake and lake houses. uh So basically we wanted to go ahead and have you guys um uh chat uh if you'd love to sorry, as it looks like there, we go target.

A

A

Oh there we go okay, so this is what happens when you have double whammies.

A

So uh now that we've went ahead and survived our technical issues, uh if you have any questions, please prop them into the chat or the q a and for those of you on youtube, live, go ahead and chime there as well I'll, be monitoring the youtube live simon and I will monitor monitoring the q a when the other person isn't talking. Basically so simon without further ado.

A

Let's go ahead and start with you and why don't you have you have you introduce yourself and let everybody know who you are and why are you even involved with this silly little endeavor here, yeah.

B

Okay, so hello, I'm simon wiley, hi uh yeah, who am I I don't know, I'm just a tech.

A

B

Spark person thing um so I run a consultancy in the uk, um but generally I'm getting around all over the place at the moment. So clients in the u.s clients of europe declines all over the place, because there's lots of people suddenly doing lakes and spark and big data stuff that haven't gone there previously uh and that's kind of what I spend most of my time. Doing going around helping people out learn spark learn how it integrates with the rest of microsoft, azure, I'm a microsoft mvp.

B

So I spend lots of time doing microsoft, stuff and talk to those guys about where product directions and all that kind of stuff and I've recently started making lots and lots of youtube videos about the stuff that we know and we like, which happens to be databricks. Hence why I'm suddenly talking to lots of dangerous people.

A

Rock on well, okay, then that actually is a good segue to myself. uh My name is denny lee. I am a developer advocate at databricks. I am a former sql server guy myself, literally sql server resin. I was part of the sql server team itself. I was the my goodness what was my title: uh sql sql customer advisor team data, warehousing and bi lead. I think that was that's, why it's such a hard time figuring out what the heck I, what the heck!

A

That is, because it's some asinine title, but nevertheless uh uh the context is. I worked with some of the largest enterprises that use sql server. I was post sales, not pre-sales, uh I'm there to go ahead and help you get the max out of sql server, um so yeah and then uh due to a store which actually I'll probably talk about a little bit later. um I went ahead and made that transition into spark.

A

I was fortunate enough to be able to join databricks, because this is an awesome company uh and then just like simon, I'm really interested in having the conversations about how you work within the context of both the microsoft azure world and also the data brick spark delta lake. You know lake house world right and the fact is that those concepts actually aren't that far apart right, in fact they're at least from where I'm coming from I'm sure simon and I have slightly different experiences.

A

um But the the fact is: there's a lot of similarities and the transition, while in some cases a little rough. The transition wasn't actually that bad to go from one or the other, and so in fact, that's probably really the basis for the first question, which is you know uh that that we want to ask which is and by the way when you please do continue, asking questions inside the chat and q a we're going to go ahead and definitely answer them.

A

It's just that we're gonna, probably gonna, start with a little bit of a not a script because we're we have no script by the way. So just uh this is completely live, and so, if you hear my three-year-old in the background, that's live you're listening to my three-year-old in the background right, so uh the idea was more a matter of like we. We had some things that we want to cover first, just to provide some context, and then we want to definitely dive into the questions.

A

So definitely chime in and put your questions in, um but then yeah I mean simon. I think maybe the first thing we should be talking about a little bit now that people have some background of who we are is what like what was your first data job like not first job like not newspaper boy like or anything like that, just first data job versus and how does that? Compare to your current job uh that what you're doing right now.

B

Yeah um I mean, I think I started in the same place that so many people started uh in just like, as you know, reporting grunting a support team right, so fresh out of uni, I've done likes of business and stats and all that kind of stuff um I did a. I did one of the year intern things at ibm and at the time then it was like balance between crm and like some reporting- and it was like you know your job for this week. You know it's like it was like three days a week.

B

I was meant to be making a basically building a report using lotus 123 in the dark side, um and it was like manage all this data together, copy that column over. It's like a giant script of manual, things to do to then put data into a report, and I was like so I was there for like a month just like going just chopping and changing columns around and I was like have we ever asked if we could just get it in the right format like no? No, no, it's a big report.

B

We can't change it, and I spoke to the person who made that report and they're like you're, literally the only users like.

A

B

Idea that this whole job existed to receive this pre-canned report chop and change it and put it and he's like do you wanna, join in the new state? I was like yes right and then that's like three days of work and they're like what are we gonna do with you.

B

uh And they had this whole lotus notes uh playing from emails, turning it into like putting into a reporting system. Basically, it's like.

A

B

From their crm system and then they could do like views which were like little reports and like little sales dashboards, and I just took that and started building and building and like at the end of that year. That was like their fully fledged reporting tool. They did a load of stuff in so went back to uni for a year and then kicked off joining like uh support and stuff doing a little bit of crm and loads and loads of reporting, and then that was sql server and that was like microsoft, bi. Well.

B

Actually, at that point it was access. You know, monday morning, every monday it's run the giant access database of doom that spits out a load of access reports and it takes the entire day cranking out pdfs- and you know world of pain and eventually migrate, that over sql server started learning about bi. He started introducing mbxq, but it was just kind of you know that slowly learning that stuff- and that was so internal.

B

I was there for like six seven years, just again like to learning little bits, but at that point it wasn't at all involved in the community. You know I didn't get to get out there and go to meetups and go to talks, so we kind of like scratching together what we knew and kind of just could learn and then left looked outside one, oh okay, yeah. We could have been doing this a whole.

A

B

Differently and then yeah after that, microsoft, the iron onwards, but yeah it's. uh I think a lot of people have had that monday morning. It's your job to make sure the reports go out kind of pain.

A

No, I actually absolutely 1 000 agree with you on that one I mean the reality is like. Whenever you start talking about data projects right, that's exactly it now, I'm I'll! Admittedly enough did not go from lotus, one, two, three, that that was not my background, but nevertheless um I definitely went ahead and um one of my first jobs basically was uh just to do web analytics, and so I was actually like before I like the the first job.

A

I actually want to talk about released and because, before that was just sort of like you know, toss on some data inside the database, I I did play with um with cubes actually actually, no, maybe I should talk about the first one, because in fact, actually I just realized my first one actually was part of microsoft, I.t and yeah yeah, and so what was interesting about microsoft?

A

It was that we were actually the first team to build the very first olap cube, like the very first analysis cube ever it was back to this is back to when, like you know, microsoft had purchased that portion of panorama software and uh the nets brothers had uh had come from uh from israel over to redmond, and literally I was on the team. It was myself there's a guy named jim berg, so shout out to jim. If he happens to be online right now uh we were uh uh um um and dave shuba.

A

Oh, I definitely want to give a shout out to him uh where we basically went ahead and uh as part of microsoft uh it in hr. Like you know, you know human resources all right. This is back in the time where basically, the hierarchy from bill gates to myself little me was only seven levels: okay, yeah right yeah. Exactly so we built the first analysis.

A

Oh sorry, olap services at the time cube, and so that's how I got into it and then slowly because of that the transition went into uh web analytics, and so that's actually, where I was sort of introduced to the idea of distributed computing early on, even though we didn't we didn't have hadoop or anything like that. It just was just more the this idea of a concept of doing that.

A

Basically, and then that transition to, uh because I was doing weapon lex and was constantly dealing with very large cubes that transitioned to basically building uh the um joining bing to help them build like really large um sql server and analysis versus instances.

A

uh We then, at the time, uh with a buddy of mine named bella, obed he's actually a solution architect here at databricks as well. um We built the at the time uh the largest cube at ad center at 6.5, terabytes or 6.1 terabytes or whatever it was, and then- uh and that was a cube that size and that transitioned into uh the awesomeness yahoo cube.

A

That's the yeah, exactly that's the so for those of you who don't know what that is, the yahoo cube, at least in the bic, microsoft, sql, server, bi side of the house, that is a 24 terabyte cube that was on top of a 5 000 node hadoop cluster, um with like a massively large oracle rack as its staging server, actually, which is which is amazingly painful um yeah. It was just exactly. It was just a tad painful so and so yeah.

A

It turned out to be the largest cube and that, actually, shortly after we built that, uh that's actually when I got introduced to spark right, because the yahoo team was going like. Let's stop transferring all of that data from the 5000 no 2 cluster. Over to this one cube right and just to give you some context and this sort of leads to our next ques the next. You know quote-unquote question or a script here um it took us 72 hours just to process a quarter's worth of data.

A

So if we couldn't, if, if there was just one mistake on it like just one mistake, it literally would go ahead and kill off two weeks for us to figure out what the heck happened: reprocess reapproach every process. So that's what led us to the idea that huh maybe this isn't the best idea.

B

A

A large cube right, so so anyways that led to spark and then bada bing bada boom. So here here, here's here's where we're at so all right, which which is an excellent segue, because uh I have plenty of stories.

A

But I want to start with your story, simon tales from the trenches like what are some of the issues that we got into and then I think we actually have some a q a here from bob that actually probably would be very applicable here for us to talk about things like how important is etl and data wrangling to get things done. You know like yeah, so.

B

Yeah, so what would you start with that?.

A

B

So there's one in the q a as well about what actually is a lake house and we're going to get to that later.

A

Exactly we'll we'll get to that later, so yeah, sorry, yeah, we're going to answer live on that one. uh So I we apologize. I I apologize for going ahead and not chiming that out so go ahead, uh tails from the trenches, some of the pain and the the headaches that you've been going through uh yourself, simon.

B

I mean so yeah again like talking about I've, not dealt with anything that big and that's kind of always one of the kind of things I always feel bad going. Am I really a big data.

A

Person, because you know you, no, no, you are you, you are, you are, don't worry, I'm it's just you know just a call out to everybody else. Yes, simon absolutely is I just happen to have really good stories on the most insane sizes. That's all that's that's all. This is that's actually not normal. If it's normal for you to build a 24, terabyte cube uh you, you need to start like really not do that. That's the best answer about just not do that.

A

Sorry, okay, so go ahead, but I did want to call that out and and add to your point there. That's all I'm saying yeah.

B

So the normalist, which I look back kind of I look like fondly on it with like a stockholm syndrome. Kind of this was like the thing that punished me and it wasn't even that long, a project. It was like, like three four months, project kind of like a fairly early in my consultancy life. You know so I've been been working away as a reporting guide done. Some mdx join the consultancy in my towards the end of my first year going. Okay. Actually, this is how all this cool stuff works.

B

um There was a client who was doing some interesting stuff like market research, trying to read like sort of we'll compare that to that and that that and it's like okay, this is fine, and you know, consultancy, client kind of goes, oh and then on this, and it goes like this this this this and doing it in mdx was just it got gnarly. So essentially it's you know. It's just a load of.

B

uh I think let's do a load of dealer data like some uh car dealerships, spinning up a load of sales uh and they're like okay, and that needs to go to a sales order, so we need to roll up in an old hierarchy like yeah cool.

B

That's that's just we do that by a parent child because it changes a lot. It's like okay,.

A

Oh yeah, yeah, yeah, parentals.

B

A

Painful exactly yeah yeah.

B

They're like and it needs to be slowly changing so when you need to track the point of the sales goes into there and then rolls up, and it's like you want a slowly changing parent child done everything in a cube and they're like yeah. So we got we kind.

A

Of got that working.

B

And then they're like no, no, no wait! So obviously, if you like, if a dealership moves into a different area, you know the parent total can't change right. You want. You want a scoped assignment parent child slowly, changing.

A

Oh absolutely yeah yeah yeah. Absolutely you would no! No. I love doing scope assignments like that. It doesn't slow down the performance at all. It would so.

B

It was pink beauty, the fact that we got this thing actually working. I was like okay, I'm amazed that it's actually accurate and gives you correct things and they're like it's not very fast, just like what do you expect uh and yeah. I just keep looking back like stuff by these days. You know kind of mdx isn't to be seen. No one goes near it, and it just makes me sad. I used to love mdx because some of the horrible glory.

A

B

Yeah problems uh but that one's always stuck with me, as kind of like the the kind of like goal post just moving and me just like desperately trying to catch up with like just just growing and growing, like mbx calculates gift card. Please no.

A

Oh, no, oh! No, I'm with you a thousand percent. I mean exactly to your point, which is like because I remember doing like additive measures myself and like oh distinct count that was the bane of my existence. The absolute bane funny right these days. That's like great easy, exactly yeah, exactly exactly, but at the time right at the time, the bane right and but exactly to your point right, I mean in a lot of ways. Actually, your cubes.

A

What you had to build was actually harder than mine right, because because you actually had a lot of business logic, complex calculations that you had to go through, so that you actually had to understand it wasn't even about getting the data from the storage engine. Oh sorry, to provide context, analysis services had a formula engine which is basically how it does its queries and it had a storage engine storage engine was grabbing data from disk and chucking up the to memory, and the formal engine was actually the part that actually did the calculations.

A

So, in your case, simon, a lot of your stuff was actually very formal engine heavy. In other words, it wasn't so much about getting the data off of disk you just once once you got a disk, it was, it would actually take a long time for it to process the data in memory so that wasn't so yeah. That was your problem like and so in a lot of ways. My problems were simpler right because, even though I was dealing with a massively large cube right, basically it was more about her.

A

Can I speed up the disc fast enough? That was it? It was all about. Can I get the because if you think about the calculations, if I was to do semi-additive measures on a 24, terabyte cube it's not going to work yeah, just it's just not going to work, let's not bother the pretense right, so I did simplify. We did simplify the cube right, hats off to dave mariani, uh who actually was leading the project. So I I I want to call that. I helped him not.

A

I wasn't the lead of it because I was at microsoft, so yahoo's the one who created it so shout out to dave mariani here, okay, so, but they went ahead and uh um it was just about getting the data into memory faster. So, like you, if you look at some of the older presentations, thomas kaiser- and I did it was things like you know- we we love ssds right because it allowed random iops. So we could get the data from disk into memory fast enough.

A

But then the formula engine queries that we had were like pizzly compared to stuff that you were doing right like the slowly changing type. Do parent child dimensions with semi-additive measures right. Look that entire.

B

Idea we were like.

A

Maybe like that.

A

But but see that's the thing it you, you actually call out the fact that big data isn't just about the size right. It goes back to that old adage of the three v's right. It's about volume.

B

A

Sir, no no you're right, no you're, absolutely right. The four v's volumes, variety, the variability uh and uh velocity right exactly the four v sorry volume, yeah velocity.

B

A

B

A

Variability yeah.

B

A

No, no, what's your definitions, then I'd love, it yeah. So for us the fourth is veracity. Oh okay, I like that one actually.

B

um More is a kind of reaction to the lake stuff. You know so the fact that kind of like kind of took off him when he's like. I.

A

Need to build a lake and he started shoving.

B

Data into this thing, like it was a network, drive and just going it's fine and cool. Oh yeah, it's fine exactly and then no one had a clue what was going in there and no one could trust anything and it's like well. This thing is entirely pointless, um so I kind of like just write like putting in veracity just to say you still have to manage your data. It's it's not magic.

B

It still needs some kind of thing in there um and that's like another real thing right now, so talking big data, you know and that's why please data.

A

Because, oh yeah, exactly it's, it's the nice bingo buzzword of the day, exactly yeah yeah.

B

Yeah, we're in the group we're in violent agreement on that one.

A

Oh, we, okay! Oh! No! No! Go ahead! Sorry! Sorry.

B

The connections are with uh yeah, it's like the number of people, and I say you know spark's really good spock can help you out and they go. We don't have big data and it's like you.

B

What do you mean and they go? We don't we don't have huge amounts of data, it's like cool, but what kind of date are you doing like? Oh we're ingesting a live stream of like till data, that's in fairly nested json. It's like you, have a big data problem that that that is an exotic data type with some gnarly unstructured stuff in there. That's coming in the stream. That's the very definition of dealing with a big data problem like but there's not that much data.

B

It's like and that's like what it's infuriates me that everyone like sees this thing as this big. I don't need that tool, because that tool is for people who are this tall gonna, be this tall to ride kind of thing right. um So, just like you know kind of pulling those things out as much as I hated the four v's originally because, that's like you know, big data marketing people getting a buzz word out. Oh yeah yeah yeah yeah yeah, but it's like a great thing to have that conversation in co.

B

It's not all of those four v's, it's any of those four v's. It can help with.

A

Right and and more importantly, the the call out- because this is actually when I was calling out one when um we originally uh so. Oh sorry, I forgot- I think I forgot to mention this uh uh one of the projects I was involved with. um I was actually on the project isotope team. uh That was the team that actually uh built what now is currently known as hdinsight okay. So we were the ones who brought hadoop into microsoft, uh which was that was a fun conversation to have with lots of people by the way.

A

um But exactly to your point, what the the it sounds very markety right when you talk about big data absolutely, and especially with the the v's like three four eight, I don't care like just whatever the number is right, but yeah.

B

A

But but number one, I like your four v's by the way, so I'm gonna stick with that volume, velocity uh variety and veracity. I love that one number one: okay, uh so we're in agreement number one um number two. What typically was what we would consider a big data problem? Wasn't the fact that you had volume and nothing else or velocity, nothing else, because theoretically, you could then just have a single system handle volume right.

A

If it's just volume, you could just literally chuck it into uh to azure blob storage, or you know adls uh gen 2 and be done for the day right um or if it's you know uh a velocity, you could literally write custom code for that type of thing and so forth and so forth. Right. The problem was that you had all the above or a com or some of them, like you, know, two of two of the four or whatever right and typically, especially in this day and age.

A

It's really in all seriousness, three of the four or four of the four anyways right like. Even if you don't have the volume aspect, you'll often have that velocity aspect which is like just like you said streaming json coming in, then you have the the uh the variety in terms of like it's you're, not just looking at one set of data right you're.

A

Looking at json, you plus you're, looking at a database plus some csvs while you're at it plus you know, rest api calls for social or whatever else, and you've got to combine all that together and exactly to your point, veracity right. This is an entire idea that you actually need reliability underneath that right. So the great thing about data lakes was that I could go ahead and chuck all my data in there and not worry about it. Oh by the way somebody just chimed out and said: is it possible to re-watch this?

A

We actually put this on youtube, live concurrently as well, so you're more than welcome to go ahead and just watch it then so anyways. uh But back to my point like, if you have all this data coming in the reality, is you need to actually make sure it's reliable? You actually have to manage it, and so at this point this is where I think at least that's my guess, because you know, in fact I think this is only our second time.

A

We've actually talked together, even though, even though we know each other, this is actually on their second time talking. um This ultimately led us into not just obviously spark data engineering, uh even though we came from the sql server side of the house, but it also led us into things like delta lake, because it brought us back to asset transactions like something we missed, something we loved actually having before and so yeah I mean simon. What do you think, like you know, provide a little uh paint?

A

The picture from you know from some of the projects that you've worked on some of the customers that you're working with on some, where all sudden you started hitting those roadblocks, because great I was able to store the data but oops I didn't validate, so my data is actually is sol right, so I mean.

B

One thing I want to pull out uh just before stepping into that is going back to uh so bob had that question about you know what about etl right so.

A

B

And we're talking about business logic and we're talking about encoding all the actual end calculations that you do at runtime on top of a semantic layer right um and you know you get across like it became you get to a point when you know putting calculations in you kind of like you know, you can there's an amount of domain understanding that you need for that right right and then, from a consultancy point of view, you're going from client to client, client and seeing they had the same problems getting the data in there the same problems, making it trustworthy in quality, the same problems trying to just hop through the same things that then became like the interesting tech problem going.

B

How do we just solve that? How do we make that when we get to a client we can just go yeah fine, getting the data in is easy, and then we can start fixing your actual business problem and start getting to the nuts and bolts of it right. I just went through, like you know, you're going through client after client after client and the tech is moving and each time you do it there's like a slightly better approach. You can do to it and then I find myself looking back going.

B

It's been a long time since I've actually looked at the customer data. Honestly, you know kind of it's it's. You know you can build almost an abstracted. How do you manage data and I don't care what the date is about? It's you know you care the shape of the data, how fast the data comes in the volume of it, the the requirements for doing that stuff. But then, if you're, a bank or if you're a retailer or if you're, marketing or whoever you are actually they all have the same data problems.

B

You know there's some real common, similar data engineering style problems that you see across it and that's when you start talking data engineering right, so we mentioned block data engineering. But then, when we started we were bi etl this and exactly there's a transition that people are making going. You know I'm no longer just building an ssis package, so I can get it into a cube.

B

I'm now designing a reasonable data pipeline that I'm actually sort of programming and I'm actually having to take software engineering principles to make it decoupled and have a microsoft service, style architecture and all of that kind of stuff. And that's very it's it's a big shift and I I think that scares a lot of people. A lot of people are going going. Oh they're talking software engineering, they're, talking coding, standards, they're, talking unit testing on a python, build making a wheel and that that sounds so far away um and it's yeah it has.

B

Things, have changed in terms of what it takes to actually do that stuff. But for me it's um so that it's it's changed in terms of the amount of work has gotten much less right, but it's just slightly more complex.

B

You know because it used to be you're back in back in ssas days, which I keep saying back in ssis, then everyone's still using ssis goes. Oh, it's, like you know the the I've got. I've got a thousand different files.

B

I need to load in and you sit there copying and pasting a templated ssis package and changing the connection and the next one and then the next one or you write, bible script, and so business is market language, awesome tool, but geez, just writing, c-sharp and xmla and having to hack them together into a nested loop. That then generates things for you on the fly, it's fairly painful, uh to say the least, and you know kind of I'm fairly cheeky in that the you know: devops story right.

B

So if you're using a code generator, it spits out, you know a thousand thousand things and the the work involved to make those thousands is much much shorter using a code chain, but then you've got to deploy those things.

B

You've got to still get them out, and so that's that's takes devops and deployments and slick processes, whereas you know spark and all the modern stuff. We can just write a generic reusable package, that's metadata driven. So if you want to say I want to onboard a new data set, that's then configuration it's a bit of job, I'm going to add a.

A

Bit of metadata.

B

And, like the whole, you know the slogan: is you can't deploy faster than not deploying right? So if that doesn't involve deployments, that's going to go faster, no matter how good your code, generation stuff is, and it's it's like a whole evolution of thinking about that as a almost isolated problem right so going from very much the customers trying to predict this value or they're trying to report on this thing, and it's all about that end user. Just taking steps back me and how do you get smarter?

B

How do I stop having the same pain every time to? Oh, that's really good! I can't even do it even better. Oh that's a really cool tech. I could apply that in this this this way and then you just become more and more.

B

It becomes a separate technical challenge of its own case and that's super interesting to me and that's where we are now and all these tools around delta lake and all these kind of uh new technologies they're all just evolutions of that same thing. It's all ways for us to make that slicker remove work, remove pain. You know so the way we've been building daily traders. We use parking park.

A

It's fantastic.

B

Because it's complex store and it's super fast regulations, and all that and then when you're saying when your client's going yeah, but I need I still need, like some kimball style stuff. So can you make that slowly changing right, like you're changing achievements? The pain of my life follows me everywhere and you're having to build something going. Okay, so I've got like a gigantic table of parquet. I've got some change coming in, so I need to write a script that says: lift them both up.

B

Compare the no friends write that that back you replace the existing one. With this new one write, the other one into a history, merge them together, and you have this big script of stuff to do run that again, delta comes along and says cool. That's now a merge statement. It's like right!

B

It's like that's! That's my life is just hopefully taking buckets of buckets of script and going okay. Now it's just that much good! Okay! Now it's that much script. I don't know now it's just one command and then life is just getting easier um and that's what does it? For me, delta is just a whole bucket of utilities that mean a load of stuff that we could do before.

B

Just now takes a hell of a lot less work, and it's just a lot more approachable and approachable is like the key right that whole imposter syndrome. I can't do big data, I'm not I'm not a big I've. Never I don't. I don't write scarlet, I didn't do mac produced back, then being able to say you know what. Actually, if you can write a bit sql, you can actually use delta and you can start doing things that encapsulates all of the parquet. All of the big data stuff.

B

All the big massively parallel processing distributed engine stuff is a lot. Basically, if you can write a bit sequel, you can now use a load of it and that's cool and that's a big. The big shift that for me has happened in the past. It's only really the past three years, past three four years that that's become that approachable okay, you could do it, but it took an element bit more of engineering and config and setting up- and you know so hdinsight.

B

You know just number components that you have to kind of the number of dominoes you need in a row to get it well, you know oozy and all these other.

A

Things you have to get a lawyer, it's my favorite thing to set up a newsie job. It was completely my favorite thing today.

B

uh But I I came fairly late to that stuff. Honestly, because I I went uh a a meandering, uh a jewelry, so.

A

We started using azure.

B

Kind of as soon as we could we're like, hey look, we've got a giant sql server, let's, let's split that sql server up into an ssis job and the ssrs and the as yes put them in separate vms, and then I can turn bitter. But.

A

Off when I'm not using it, I'm saving this money so.

B

Yeah good and then you know, data factory v1 came out and it was just the biggest bag of spanners known to man. It's like great, um but like we did, we went down the whole path of uh dead lake analytics. That was my first real kind of foray into that stuff. Oh, my god writing new sequels. You know because I've been writing with c sharp for various different things and going okay. It's it's like c sharp and sequel jam together. I I get this.

B

I think this makes sense and building code generators right to spin up adla jobs. They went on demand and that's like that's great, I'm not having to I'm not having to build anything. I've got my little job function, but on the fly writes some new sql scripts kicks the job off. That does some stuff.

B

I I don't need to write ssis anymore, I'm just like happy happy days and then from there you know that evolved. Like kind of future uncertainty, kind of things, data breaks came released. um I managed to sneak on. I was so there was a microsoft internal training course for likes of uh some of the microsoft's. Like csa or yeah, yeah, sp, yeah, exactly yeah um and me as a partner, we were like you know what we're super interested in databricks. Now we can sneak you in.

B

So it's like me and I think I'm just like the only people who weren't microsoft in the room going they've, not noticed it's cool uh when they were first doing the dailyrx rollout in uh in the uk, um and then you know we just like to try to try to use it, trying to piece it in so that and that's fairly late in terms of spark right, we're already we're already talking.

B

2.4 we've already got data frames, I'm not having to go and rummage around the buckets of rdds and axis of all that kind of stuff. Oh.

A

Come on, why don't you want to play with rdds? Why don't you want to write all the code in scholarly come on? Why not.

B

Well, that's fairly late to the party right, that's kind of what five years ago, four years ago,.

B

A

B

Of spark that's become so much easier, I'm looking at it as, like. You know, kind of going wow. I am so glad I didn't start back in 2012 when it was literally just and cranking things going, yeah so like when you when you'd like to vlog. When you start, then, when you did your your hop was that rdd lance was that enough? Oh.

A

Absolutely no! No! I was involved with spark back in 0.7, so when the project was still in berkeley right, so I was I was starting to mess around with it. Then, um shortly after we made hd insight beta like um when the the name isotope was still prevalent and yeah, and by the way, by the way, there's only like nine of us that created this project, which is pretty sweet.

A

um I would already dove into spark, because I reckon we a bunch of us who had been working on the project, recognized like some of the issues with hadoop right, which was that it allowed us to process massive amounts of data, which was great like especially for the scenarios that we were trying to address right, like the ones right. The there wasn't the same level of clarity that like that, for example, that you have right now right the at the time.

A

It was just more like you couldn't process that data period, so it wasn't about speed anymore. It was just about the fact that I couldn't even do it right, like like, like with the the sheer look at the time. Terabytes uh hundreds of terabytes was considered a really hard thing to do and we were approaching petabytes already at that time. This was like 10 years ago, right like so so you know that. That's why we had no choice but to dispute the problem.

A

He had no choice but to go chug it up and so yeah, and so because of the the we inherently understood the issues um that result that revolved around hadoop. We, a bunch of us, actually decide to sh dive into spark as well and exactly to your point, rdds. Yes, we were definitely playing in that land.

A

It was a while before we had uh um michael armbrust and mate had introduced this concept of schema rdds, which of course became that what is now known as data frames right back in spark 1.0 to 1.3 that time frame right, um and so uh it was. The funny story was that actually I had a regular sync with my friends at yahoo, because you know that 24 terabyte cube right and what end up happening is, but purely by accident we end up having all of our meetings end up being about spark.

A

So it was just like oh okay. Well, I it wasn't like. I was telling them sparker. They were telling me it was just we just we. We both came to that conclusion in separate ways like which was the at least in our case was just because, like I said it was just purely about it was that large we couldn't process it, so we had to have distribution, and but we still wanted at least it not to take three days to process the data yeah yeah, I'm just saying you know yeah exactly I mean I realized.

A

We couldn't ask for the query to come back in minutes or you know like I'll, uh be iq or anything like that. But I did want to not run the query. Wait three days and recognize the fact that, because I forgot to put a colon here or a dot here that that entire query.

B

A

Like I did, I wanted to avoid that problem right. So so, invariably, that's what led us to spark and then exactly to your point, you know work with larger and larger customers on spark which was pretty nice, um and it was pretty sweet right.

A

um The idea that also in these queries that would take maybe hours now we're taking like you know, minutes and so that and then, but yes exactly to your point, uh actually the running joke, and so was that, because of I was actually trying to initially work with hadoop right in terms of actually working with the internals and all that stuff and then, of course, invariably uh that led to working with spark and in it's internal. So don't get me wrong. I'm not trying to pretend that I'm really good at what I'm doing.

A

I'm not I'm! Okay right. um I started writing a lot with my code, of course, in scala right and then, and so uh the reason I wanted to just do. A real, interesting callout is because holden corral she's one of them one of the awesome people that was able to push forward with pi spark right. So because you know why we have pi spark, uh I mean don't get me wrong, I'm not trying to discredit other people, I'm just simply calling out there.

A

There are a lot of really good people that and holden was one of those people right that helped push through pi smart. The the reason I'm calling out this running joke is because, as I started, diving into data science, of course I did pi. I just started doing python myself and then did pi spark, but when she wanted to do performance she started getting the scala. So invariably, even though I started in scala and she started a a la pi spark uh her most recent spark book uh um was written in primarily in scala.

A

I I.e the person to help us create pie, sparkles now writing in scala and then my first uh book in in spark was which is learning pi spark. That book um actually was well obviously written in python, even though I was a skull engineer first, and so yes, uh and at that time, that as the evolution happened, I was perfectly happy saying: oh, it's bouncing back for scala and python. In fact, here's a shameless plug to learning spark.

A

You know that book right there and which I I'm glad to be a co-author of and what's what's interesting, is exactly to your point as we progressed as we worked with more customers.

A

Also, all the things that were awesome about sql server that I loved about sql server like things like asset transactions, the manageability, the the fact that we needed to be able to do everything in sql as opposed to writing reams of code. No, not now nothing like xmla.

A

That was insanity on its own right, okay, but I do mean, like writing, like at least it wasn't java. So at least I wasn't like writing like the reams of java code, but still I would write like scala code or exactly to your point, like the merge statement, like you know now, with spark. Actually, the scala api is actually pretty smooth too, but the idea originally what I had to write in scala versus right now.

A

A simple little merge statement is in sql yeah, exactly to your point like the as time progresses and as people who, in some ways I wish. I didn't hadn't I mean. Obviously I don't wish I didn't have to do it, but in terms of if I was starting from a data warehousing perspective, oh boy, yeah it'd, be it's a lot simpler to jump to it.

A

Now, because, with sparks equal you you have the friendliness and the awesomeness and the power of your sql language, yet it actually can be distributed, which is awesome, sauce right and at the same time, the idea that now we have delta lake, it allowed us to go ahead and actually have acid transactions and the combined combination of the two together allowed, especially with some of the uh the apis that were included in spark 3.0 with delta, like 0.7.0.

A

That allowed us to actually significantly simplify the manageability of a distributed system, because we all know a distributed system actually is harder to maintain, not easy to maintain like if I was to maintain a single sql server instance. That's actually not that hard. If I'm trying to maintain 50 of them, that's a little tricky yeah, just just a little bit so yeah. Sorry.

B

But those kind of hugs you know you always so so many clients, uh like we've, worked with they've gone. Like you know, someone someone high up has gone we're having a modern platform, we're going to get people in and we're going to put a modern platform in, and we speak to the warehousing guys and they're like oh, I don't want to learn. Scarlet python they're, just looking in.

A

B

uh You know kind of this enthusiastic consultant coming in going hey.

A

B

Got an amazing distributed system, we're gonna, do some cool stuff: let's go go uh and then you start working and it's like how do you get the hair in okay? Well, this is a merge statement. I'm like wait. What it's just like this anchor point of familiarity- and you know we teach them uh pi spark. We teach them.

B

You know you can have a merge statement, put some parameters in and have a single merge statement that you can use for any of your data sets and that's just all kinds of awesome and you see like the little light bulbs going. Oh god, that's going to save us so much work. That's awesome! Right!.

A

And to add to simon's point the whole reason why he loves merge, and so do I by the way, I'm not trying to I'm just it's, because the merge is an upset, I.e, insert and update, but also includes deduplication and also includes schema evolution.

A

So it one statement that covers all of those things which is boy sweet.

B

Which, again in my little journey of you, know my life, my life is basically saying: I want to write less code as I get older like not not. I want to code less just when.

A

I code actually.

B

Write like small.

A

Exactly exactly yeah yeah agree, I don't want to write dreams of this stuff just because you know full well. Every extra character you add in is a is a is a point where you might fail.

B

Ham, fistedly going uh that's probably gonna.

A

Work so, okay, we actually only have 16 minutes left and I just so this one we went a little longer than we thought, but all right, actually, let's dive right into some of the questions, because actually this is a good segue. So, for example, uh let's go back to the first one, which is, um I understand, delta lake from previous presentations. But what is what are? What is a lake house like? Why are lake houses important, so we actually covered it without actually explicitly calling out.

A

So why don't you start and, uh and then I'll chime in as well for that matter,.

B

Okay, um so specifically, data legs have an absolute ton of flexibility and power and all that crazy, big datay stuff of saying different kind of exotic data types. Vast amounts of data streaming all that kind of stuff daylights are awesome at it, but structurally having that kind of hey I've got a schema. I know what structures data is I'm doing some regulatory reporting and I need to actually manage this thing. I don't want to use it to come in and just accidentally delete my data because they've got access to that.

B

All of that kind of stuff has always historically been a little bit flaky in the data lake land, then over on the warehouse side. You've got all of that so transactional consistency, management of schemas, deployability control, auditing, awesome, and then you know you try and get jason in there and it's like.

B

Oh we've got a jason collins now you know how to write for jason pass thing right there, oh god, and it's just like there's so many things that it's just hard and especially these days when people are going there's some new data we've got an opportunity. Can we take advantage of that new data and, like traditional warehousing teams, are going yeah yeah? Absolutely our next production replaces next production deployment is scheduled for three weeks time. Is that soon enough I mean, I know three weeks is being generous.

B

Normally we're talking, you know, can I monthly cadences at the end of this project we might do a deployment and then, by that point you've missed a vote. You know, so it's missing out on all the stuff in the lake that you can do just you know what actually bring some data in. Let's do some generic landing of some data and we'll figure out what we want to do with it later and like there's kind of the two different sides of the fence.

B

So for me, like the whole lakehouse point is a technology technological solution to enable so much of the warehousing goodness, but based on a leg to get all the guns of a leg. So it's kind of like why not both- um and it's awesome- that's kind of how we see.

A

It perfect. Well I mean I don't think I could do much better. I I'll just do the short, the short phrase, tag line, which basically is uh best of both worlds of data, warehousing and data lakes, uh the manageability of warehouses with the flexibility of lakes, right that that's the the marketing tagline, but you dive in a little bit deeper when you, I think simon called it out perfectly right. The reality is like we're not done yet.

A

There's, obviously things the community as a whole, whether it's the spark community or the delta lake community, um or just the overall data community. You know we still have more work to do. Let's, let's call, let's call a spade as fate here. There are things that we can do to improve, but the reality is that's what lake houses are, which is to say that for me like okay, actually, I know this sounds like a like.

A

I'm going off track a little bit, but in fact it's actually an important component like when, when simon and I started talking about our past- it's not just because we're reminiscing. Okay, I mean yes, we are too okay, but but the reason why we're doing this is because there's a fundamental concept that there's actually a model, that's supposed to be applied to your data right there's. Actually, your data is actually important right when we went to lakes. The whole premise is that we were just trying to chuck the stuff in as fast as we could.

A

So we kept on things saying things like schema on, read, schema reads: schema we'll worry about the schema later we're worried about the schema later we're worried about what, if it's important, later right- and there is value to that statement by the way, I'm not saying there is no value to that statement. Quite the opposite.

A

You don't know if the data is any interesting until you actually put it to a point where you can query it, so that so that's completely true, but at the same time we I think we went a little too far right, which was we forgot that no no, but we weren't so just to store it.

A

We needed to actually go, read it and do something with it and build a model on it I.e. The schema is the model for your data right. We needed to do things like that. So that way, people remember what the value of that we we have these awesome marketing and I'm being very facetious. When I say this awesome marketing, taglines, like oh yeah, there's uh the new, the the new oil is data or the new gold is data or whatever you know that type of bs right and that's that's a great marketing tagline, but the.

A

But the fact is the truth of statement. If, even if I was to follow the marketing tagline, is that yeah there's a lot of work, though right oil doesn't just automatically come on the ground and automatic pro a process right data is the same thing right, so it required us to do a lot of things to to make sense of it and so the so. For me, it's not just about manageability.

A

It's also about re, remembering the value of your data and reapplying that back, which is why lake houses are so important to me, because it's not just the technical construct or you know, like you, sometimes hear us say paradigm and merely enough. That's a marketing term. So I also apologize for using that. But the reason we often use the word paradigm is because it's not it's, because it's not just a tactical innovation.

A

But it's also saying I love my data right. It's also saying I, I respect the value of my data right massively and I think that's a crucial call-out that isn't necessarily always covered. I'm saying.

B

Yeah, I did so I wrote a blog post, I think, like six months ago, or so, which was pure click. Bait of saying, you know, is kimball still relevant in in a day, just like a little fishing rod going yeah.

B

But then you know like so you know my take on that is is so kimball and starskeepers and stuff is a thing that was almost entirely designed for relational databases because of the way relation databases, work.

A

I'm not going to make that.

B

Right and then there's a ton of like evolution, that's become on it, which is all the data management, uh patents and things like slowly changing and things like auditing, lineage columns and things like making a fat table with all your um data quality, lineage data and calling that an audit effect- and I thought of that- it's just data management, and I can do that in a link really easily and you know so it used to be.

B

You know if you had like some kind of fact table and dimension tables that didn't perform too well in spark, but spark 3, because we got mix of uh the dynamic partition uh push down, so you can actually filter your date table. Have that actually correctly fill the effect. Suddenly that unlocks all that stuff that stuff's a hell of a lot easier to do so, you can actually get a lot of the fairly which, if you talk to a big data person and say I am doing kimball in my lake, they go.

B

Oh, oh, no, I think you're one of them, but then there's still a ton of goodness in it. I think it should it's not saying.

A

It should only be gimbal right, okay, agree.

B

Everyone always ends up having three zones to your lake right, yeah.

A

B

Stages, a we call it raw bass and red. Do you guys call it bronze silver gold people? Call it one? Two and three: I don't care.

A

Exactly it doesn't matter the name, the concept, the the fact that you have a data quality framework of some type yeah, if relevant or what you call it. That's what you have exactly yeah exactly.

B

So you can go to a certain part of your link and you know the data in there that has been. That is just as it can. I I I need enter at your own risk. You need to be able to understand the data that might not be right. Middle bit. That date has been sanity checked, it's been sense, checked, it's had some quality cleaning done. You can go and trust it, but it's still its original format so requires a bit of skill to figure it out and know.

B

What's going on there and then some kind of curated this is this is managed trustworthy. This has been shaped for ease of querying now. Sometimes facts and dimensions make sense for that, because the kind of data and the kind of people you're showing it to sometimes it's a big wide reporting table. Sometimes it's a mix of the two in some other shape and that's absolutely fine and that's just options right. That's just that you.

B

We now have the flexibility for all of the different data management paradigms that we've used to get in so many different places. They can actually fit and there's no longer a technological barrier saying you have to use this way. You have to use it you're not allowed to do a start. Gaming exam. It's now a manage to take manage the data in a model that makes sense for your business purpose, which is great right.

A

B

All of that stuff, it's just maturity right. It's rather than being a fair kind of wild west. We can get kind of get it to work. It's just a no. This stuff just works. You can fit. You can design your data for your business, which is a huge shift.

A

Right and actually exactly to your point, so this is a slightly plug for the databricks youtube channel, but I did want to call out that, like in the daybreak, this data and ai online meetup that you're on right now in the database youtube channel. We actually have videos a la kimble talking about sergey key generation, uh the importance of them right for delta lake for data lakes, uh slowly changing type two dimensions in your data lake right cdc in your data lake right it it's so exactly to simon's point it's like.

A

I get the idea of using this these techniques and in some cases people are going like. Why would you try to apply that, to you know a data warehousing technique to a data lake and I'm going like well because it it's not like the concept was wrong. The concepts actually make a ton of sense. It's just that now, with you know, spark especially with spark, 3.0 and and and and delta, like I'm actually able to do that now.

A

Right like like, I couldn't do that before, or at least it would just like someone was going out, reams and reams of code that I actually had to go go through in order to be able to handle that I'm like yeah all right there. We go right.

B

Yeah, so previously it was a square box. I had a massively round peg, I'm just trying to squeeze into it and try to force it to it right. Well, it's not to say that absolutely everything still stands right. You know, probably of course, we're holding days. If you ever had a string or sorry have you ever had a var chart on your fact table, then you are the devil and that's.

A

B

A

B

Right, that's fine! You can denormalize some things onto your fat table for ease of querying because actually park, a constant compression dictionaries related to coding. All of that stuff squeezes down really nicely exactly exactly so there's some stuff so that that's like the use case right if you've got like a a conform dimension, you've got a hierarchy and you need to manage that if you're insisting on having your big wide reporting tables, and then you need to change how your product hiring works, and you suddenly have to regenerate all of your giant transaction reporting tables.

B

For that one value. That's changed like that's. A huge amount of work is, if you have a a separate table that encodes how we talk about that particular logical entity. You could call it like a dimension if you wanted to just make sense so, like there's.

A

So many of those ideas that.

B

You can use and go, but that still makes still makes sense um but yeah, but for a long time it was very much the you weren't one of the cool kids. You when you went a cool kid, if you're trying to do like some traditional data modeling in big data land, but it's people do people want to do that and those things make sense the business and that's how they think about their data in a lot of places.

B

So why not enable people to work with how they want to work and not make a tech value which is awesome.

A

Cool okay, so you know what we're probably gonna need to wrap up in two minutes. I just realized because we were running along so for all the people that uh have asked questions in both youtube live and the q a or the chat we apologize for not getting to all of them. I just want to start off with that, um based on the feedback that we're getting it looks sounds like simon. You- and I probably should do this a couple more times, so so so we'll definitely plan to do so.

A

What I encourage you to do, if you can do me a small favor, since this is actually also on youtube, live obviously for the folks on youtube. You're gonna know it uh for the folks on the zoom. I already propped it into chat the link to this video.

A

Why don't you take your questions and prop them right into youtube, and the reason we want you to pop them into youtube is because that way, simon and I can go through those questions, we'll answer some of them on it, but, more importantly, that'll be the basis for our next chat. Yeah how's. That.

B

A

I promise as much about.

B

A

No, no, no! No! It's! Okay! I'm a kimball guy, too, all right! So most people, a lot of my sequels sequel server friends are like are like, like how can you say: you're a kimball guy when you went into hadoop and I'm like you know what there is a fairness to that statement. There really is okay, so I'm not gonna go ahead and actually disagree with that.

A

But the thing is that, as you can probably tell it's not like I've ever lost my data warehousing roots, it's just that it had to scale to a point where you know, for in my case at least the data sizes kept on getting exceedingly larger. That's all right and.

B

Exactly to your point like when you spend your entire life, just building gigantic workarounds to get around the limitations of the technology and.

A

Then you take a step back and go.

B

Wait: what why are we working so hard to certify the balance of that? Well, that's! That's! Not what the business wants. That's not what I want to do it it's just having to do it to get over a hump, and then you take a step away and you go. Oh god. This is.

A

Idiot all right, dude, okay, let's wrap it up, uh we're gonna have to go. um I did want to call out two things again, one the youtube link. Is there put your questions since we're, since we did not answer your questions uh and there are too many, we apologize uh simon and I obviously are having a little too much fun here, so put them onto youtube uh chime in there.

A

We're gonna use those as the basis for our next show, simon I'll, find another time to do this number one number two: we we uh small plug, we do have a show next week um on the 24th, um so come join us for that. Oh, so, that's a completely different show on the automation of pi spark. It's actually the data collab lab with franco myself, so it'll be a little bit of fun there. It's also very much sql sql centric, so I definitely would love you guys to join for that.

A

But, yes, based off of this feedback, we'll we'll make your simon and myself we'll we'll find another time. We'll definitely answer your questions.

B

That's what I say you.

A

You have the final word, my friend, okay,.

B

Yes, uh well next week, we're both at big.

A

Day, london, uh no, no I'm not got big data, london, but my boss, uh ali is gonna, be a big deal. No, no! It's okay, we're still gonna be there. I just I can't go for other reasons. That's the reason why that's all sorry.

B

So yeah ali is doing a keynote on. This is the data lake house. This is why we did this whole thing, uh I'm doing a session which is actually here's all the actual individual bits of delta, which enables data lake house. So that's something you guys are interested in big data. London happening next week on wednesday thursday, I'm going.

A

To say, but again, yeah, let's go with that. Google big data london in quotes, you'll you'll, see simon's session and you'll see ali's session and we're good to go all right. We got to jump. uh Thank you very much, everybody simon. Why don't you do the final word, my friend.

B

uh Yeah well again, thanks for coming uh again, you've got your youtube channel. I've got my youtube channels to look out for advancing analytics and we're talking all things. Databrick van spark and.

A

B

Bit of synapse at the moment so investigating this microsoft, spark and databricks bar can see and how they fit, uh but yeah any questions on that stuff. Just give us a shout and line up some more stuff with the chat again perfect.

A

Awesome thanks very much everybody. Thank you, simon, for joining uh we'll. Do this soon all right take care. Everybody.