Delta Lake Delta Lake Discussions with Denny Lee (D3L2), 16 Feb 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: D3L2: The Journey Unifying Data Lake and Data Warehouse with Robert Kossendey at Claimsforce

Description

In this D3L2 episode, we chat with Robert Kossendey, Tech Lead at Claimsforce on their journey from unifying data lake and data warehouse. As Robert’s team builds and expands, they chose Delta Lake and AWS Athena as the foundation for their lakehouse.

Quick Links
Read Our Newest Blog Post: https://delta.io/blog
Robert Kossendey: https://www.linkedin.com/in/robert-kossendey-303b0019a/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
Join the Google Group: https://groups.google.com/forum/#!forum/delta-users

A

To the new session of d3l2 uh yeah, that's the name of this right. Is it yes, uh uh where we're going to be talking to Robert here uh from claims Force about their involvement of building going from a data warehouse to data Lake uh using Delta Lake? um So we've still got a few minutes so Robert? Why don't you just introduce yourself a little bit, but then we'll we'll do the formal introductions once we get the LinkedIn and the YouTube thing going.

B

Sure yeah um hi I'm Robert I'm, the tech lead of clemsforce. We are a small intro Tech startup, based in Hamburg Germany and yeah I'm, leading the data team at our company I'm responsible for the data architecture. I've been there since now more than three years started as a working student transitioned, then into the data engineering role and then also started, leading our data team or the being the technical lead for our data team and yeah happy to be here.

B

Thanks for the invite and happy to talk about our journey from our old two-tier architecture, to the new Lakehouse architecture,.

A

Perfect, give me one second I'm, just literally trying to get the LinkedIn and the YouTube things going. It's the usual fight when it comes to these things. I get this quite not sure.

B

A

Okay, all right, perfect I looks like we are live on, uh live on YouTube and we're live to LinkedIn. So thank you. Everybody Welcome Aboard, all right now, we're gonna do our official introduction. So here give me a second all right. uh Let me just type in the welcome aboard to all our our different uh and I believe. There's actually a twitch thing going too, if I recall correctly open yeah, yeah, that's pretty sweet eh.

A

B

Right first time live on Twitch, exactly.

A

Yeah so I believe twitch is going at least I will we'll find out if we're doing it or not. After the fact, that's usually how that ends up happening, but saying that uh welcome aboard everybody uh again. This is a a session of d3l2. The journey for unifying data Lake and data warehouse with Robert at claimsforce I want to have Robert introduce himself, but I did want to do a little housekeeping. First uh just want to.

A

Let you all know that there's the data Council Austin uh conference, uh the the council Austin 2023, that's in well Austin Texas from March 28th to March 30th. uh If a bunch of us are going to be there. So if you want to look at the latest when it comes to data engineering and data science and make sure to show up at the data Council Austin uh conference, all right, perfect I did my little housekeeping uh call out.

A

um Oh yes and yes, we are on Twitch uh the twitch Channel, by the way it's twitch.tv uh Slash, Delta, Lake OSS. uh So just to let you know all right now saying all that Robert. Why don't you introduce yourself more formally uh with the this audience, provide people a little bit about your background and we'll go from there.

B

Yeah sure so, hey I'm, Robert I'm, the tech lead of the data team at Clan Source. We are a small intro Tech startup based in Hamburg Germany um I joined claims for us now more than three years ago, starting as a working student, I came directly from University I studied computer science and business at the University or the University of applied science in Hamburg and um because I had some like interest or as focused on on data.

B

During my studies, I transitioned into the role as a data engineer um at claims for us and then also took leadership of our Tech Team in data and now being responsible for our whole data architecture, data engineering department also bi and data science yeah. That's me. uh Thank you for the invite today happy to talk about our journey from the data warehouse data Lake to the lake house. Finally,.

A

Perfect, um so actually, let's go backwards. A little bit. You said you start off in Hamburg to go into University, so I want to understand a little bit more about your background. Actually, so how like we all have slightly different ways on how we get into the data space right. You know like, for example, uh for myself. I actually originally was going to go into gaming and then made this really weird turn in into this and then also now in data.

A

So I'm just curious like what, when you went to University I'm curious, what were was the goal to be a data warehousing like or data.

B

A

What were you thinking of doing I'm just curious.

B

To be honest, I I, applied for computer science and business just out of like I I I've, been always curious. Working with computers, I didn't really know what to do essentially with with the degree but I knew I wanted to do something with computers and um during my last year, in University I attended some nosql big data um lectures and really got interested interested in distributed computing learning all about Hadoop ecosystems, Spark, all the good stuff- and this led me to the interest in data itself.

B

Also, there was a little bit of data science I did during my University times. There's also like further um grew my interest in data and yeah it kind of came. Naturally, it's not like I I started my studies with the goal in mind to to become a data guy, so yeah makes.

A

A tough sense I'm just curious when you said you were doing comp science business, where is it like? Were you like? Was python your primary language, SQL Java I'm, just.

B

Curious, no, no! No! We we did a lot of java during in University, but also python, so uh the main part was Java, but then in the special or when you specialized, uh the lectures were primarily like at least the data like just primarily python yeah.

A

No, it makes a lot of sense. Okay, so now, let's switch a little bit so you coming out of college, you went and joined claims Force. So why don't we start off? Why don't you explain a little bit about what claims force is to provide people some context here.

B

Sure we are a startup uh that focuses on insurance claims, so uh we provide software for participants in our participants in the insurance Market. um Let me give you an example. Let's say you have a water leakage claim at home.

B

um If it's a bigger one, then probably someone needs to come to your home, an expert, so-called expert that assesses the claim and we provide software for those experts and also for um the clerks and back office employees at insurances, managing those clerks. So we help them doing the disposition. We have them, assessing the claim correctly, providing them tools for measuring for um yeah, assessing the claim essentially and then also writing a report that assesses the claim damage and how much needs to be repaid to the policy holder and yeah.

B

This is basically we we try to cover the whole value chain. The whole claim value chain there, and this is what we do essentially.

A

Oh, it makes sense so basically from just with houses, switch it to like in terms of data. Speak you've got a lot of data that comes from the initial claims from the original user. You basically need to figure out how to process it. You provide data that allows the adjusters to determine what the actual claim is supposed to be worth.

A

uh You aggregate it, so you can actually do financial reporting from both a individual purpose and also from an aggregate purpose, and then that way you can basically, as a startup, you provide that software services to everybody. uh So that way you can process all this information, yet at the exact same time, uh make it as uh streamline and as efficient as possible. So that way you're not actually you don't actually have folks, basically wasting money on the process here, just more the money can be either.

A

You know either to the person who actually is making the claim or you know, profit type deal right.

B

So yeah and then one of the main goals is actually creating transparency, because in in the claims process there is like there are so many parties involved. There's the.

A

B

There might be external Justice restoration, guys the clerk, and we try to to provide transparency on the claim itself to to each of them both parties, because currently state of now. This is at least in Germany, mostly done on paper and.

A

We try to yeah.

B

It's it's like. The insurance business in Germany is a bit uh yeah.

A

Wow, okay, hold it. Okay, I was going to shift another question, but let me let me ask that a paper. So is there like a lot of OCR workers in order for you to basically there's.

B

A lot of OCR work, so so we don't focus on OCR work, but there's also other companies focusing on a lot of like there's a ton of flows here in the insurance business in Germany.

A

Wow, so it's just basically processing all that, like it's handwritten or is it like yeah, I, guess.

B

It's people no yeah baby. It depends right, but okay, for example, there are like.

B

Policy contracts, essentially, policies that are right, 20, 30 years old and okay, a claim happens, and you still need to extract information out of that right. So a ton of stuff is still on paper because the contracts last for so long wow.

A

So, basically, okay, wow, that that's that's eye-opening, oh my goodness! Okay, all right! So no wonder you guys exist because there's obviously a lot of things to to optimize for okay, cool. Well, okay, sorry, I, I digress for all the folks there, but that just blew my mind: okay, good! It's all right! All right! Let's talk a little bit about this, so in terms of the data space, you obviously have tons of data coming from all these different locations. Why don't you? Why? Don't you provide us?

A

The original state before, like also to provide context to everybody and we're going to add this to actually to the the LinkedIn YouTube links afterwards, uh Robert's actually written three really cool blogs about that journey, and so this is really a discussion based on those three different blocks in all seriousness.

A

um So why don't you start telling us about that original, like before you like, when this all started like when you had this data warehouse? What did it look like what the sources like the sizes? What what were your issues, processing, yada, yada, yada, I'm, just curious.

B

Yeah so so the original state was uh that um we actually didn't really have a data architecture at all. We basically just dumped data into S3, um so I joined very early on uh claims plus, so the startup is now five, not not even five years old, so I I'm I joined very early okay and back then we basically just dumped our production, because we used dynamodb and um doing analytical queries on dynamodb is practically like it's. It's not possible at all.

A

Very difficult, yeah yeah, exactly yeah.

B

Yeah, we've done our data into S3 and uh use glue, crawlers to to use Athena to to query the data and again this this turned out to be very impractical because, um first of all, um this like we didn't, have any um schema enforcement there and okay.

B

um So we decided to to um team up or partner with AWS Solutions Architects and they helped us initially building our initial architecture, um which consisted then of a basically raw Landing Zone in S3, and then we loaded the data into redshift for further processing. So this was the initial um now architecture that we had basically consisting of the raw Zone and Athena, where you could do an S3, where you could do a top queries via Athena and then for further processing.

B

Loading in the dread shift doing all the queries on top yeah and then also feeding that into our Tableau quick site dashboards. Essentially.

A

Gotcha gotcha gotcha Okay, so you provided a ton of context. So then I guess the question is from your perspective, what were the issues? The problems with your initial setup? Basically, you know yeah. uh You mentioned that dynamodb was problematic to put um for those type of queries. That's fair that that you know that's a fair assertion. I'm just curious like what what else? Basically like yeah.

B

Oh yeah, what we? What we really liked about our like? We, we basically had this shared approach right. We had the First Data Lake stage that came with multiple advantages, so S3 is very cheap.

B

um uh We had schema on read, so we didn't, we could just dump our production databases and query at our talk, which was great right and um also the data Lake supported all of uh like all data formats. So we also have a ton of photos um and and video footage that we could also dump there okay, um but it also came with some disadvantages, so uh running. Queries on Raw, CSV, Json or even parquet files with Athena is.

B

This is uh provides some low performance also because we didn't have any asset transactions there right and um yeah. It was only a raw Landing Zone because we only were able to do append only stuff. Okay and we didn't have any merge. There was no merge or absurd capabilities on top of S3 gotcha. That's why we then had uh redshift as our data warehouse, but that also came with some disadvantages, mainly um maintaining two different places or two different storage.

A

To clarify the two different storages, basically you're you're, basically using the data in redshift, and you also have the data Nest free. That's the context for.

B

You yeah exactly exactly yeah. This leads to data Stanis. First of all, because you have like essentially two steps right and um uh redshift is really really expensive, at least from our experience, so sure you are not really able to to uh scale storage and compute independently.

B

At least then there was no ratch of serverless, which would maybe helped us with mitigating some of the costs, um but still it was very painful to work with redshift from our experience and we really liked the fina interface. We really like just being able to query the data on top of S3, but yeah. We didn't have the capabilities to to move all of our workload that was performed on top of redshift to the data Lake, because yeah, um because of that mentioned, disadvantages no asset compliant no um makes.

A

A lot of sense, oh by the way I I, would just pop into a question. uh One of the attendees actually asked. The question is for claims for services. uh This is back to business, but I I, so I apologize for missing this question. Originally uh you met eight minutes ago. So again, my apologies um is this. uh For what you're doing is this Property and Casualty Insurance industry claims.

B

Yeah yeah, it's only PNC. We only focused on PNC we're not in the health um business, so yeah yeah.

A

That's all which mid left is a whole other kind.

B

Of worms yeah, which.

A

We're not gonna get into today all right perfect. So is this because of the problems that you're facing with like data duplications costs, uh complexity of pipelines like that? Is that what like did you know? You wanted a lake house per se, or was this like yeah I'm, just curious like where? How did you get to this point where you sort of rocked you wanted, but like? Was it early on or was it like not to like?

A

After you did a bunch of stuff, and then you realized, oh I need to like I'm, actually building one I'm just curious honestly,.

B

It was, it was pretty early on so okay, so we quickly realized that also because of new requirements from from our customers uh um surfaced, for example, real time um data which gets really difficult with redshift uh early on, uh like notice that okay, this is not. uh This architecture is not meant to be there for eternity, and we need to come up with a different solution and yeah. Then I ventured it out and looked for Alternatives, pretty quickly stumbling um onto the Lighthouse approach and reading the white paper and then going on from there.

B

A

Perfect and the white people I'm presuming you're talking about Michael armbrust, uh multilink, high performance, acid table, Transit storage over Cloud object, stores, exactly okay, perfect. So.

B

A

So for everybody there, basically, uh that's actually I'll, send a link and include the link uh inside all of our different chats, but the context for everybody, basically, um hey, um sorry I just make sure there we go all right, I'm, just trying to send it to everyone. um The the context for everybody basically, is that this particular paper talks about Delta Lake, which you'll notice that Robert's talking about the acid transactions. So now, let's talk about that from your perspective, because you know cool that you found the paper, but obviously you were looking for.

A

You know either specifically acid transactions, the data lakes or something akin to that. So I'm just curious. What what? Why did you care about the asset transactions? uh What what was about it, because you know you're you're not to age myself, but you're younger than me? Okay. So so, since.

B

You're younger.

A

Than me, like, you didn't necessarily come with the idea that I have to have databases: I had to have asset transactions. So what? What led? You look because I I come from that world like I'm, a former SQL Server, so I'm like yeah, yeah, of course, you're new. You need to have data. You know databases for transactions until I realized I was wrong. So I'm just curious, like what led you to recognizing the fact that you did need asset transactions yeah.

B

So so, first of all, um data quality is very, very important, not data quality, but data correctness is very, very important to our customers. Sure so, um eventually, consistency is is, let's say, a no-go.

B

um We really want to have updated data and and correct data, and uh without like having consistency there and um like transactions that might fail partially and then write inconsistent data to a database. This is as far as a no-go and that's why asset transactions was our asset. Transactions were a must when considering our Solutions program. Okay,.

A

So, let's start with that, okay, um you said: eventual consistency is a no-go so to provide context to everybody in the uh who's. Listening who may or may not know this, like eventual consistency, is basically the default mode when it comes to working with um S3, okay, and for that matter many other systems.

A

In fact, you'll probably see old um comments of mine about Acid versus bass, right, like in terms of like we're basically base, is basically the the realm of like the Hadoop world, where we basically introduced basically available uh uh uh steady state, eventually, eventually consistent, which is the the key context versus acid, which is atom.

A

This is the E, because this is the isolation durability, which is basically the realm of the database world right and so from your perspective, because you're working with S3 and knew that you were dealing with an eventual consistency position basically, you're worried about data corruption, but I'm just curious like yeah. What what? What? What? Why were you concerned about data corruption? What was inherent about your pipelines or whatever else it is that was causing that to happen.

B

So um we like, we were using AWS glue as our um uh yeah Computing engine, essentially or as a spark environment, sure um glue had this bookmarking feature that uh captures which files has been have been processed uh already and this for some reason we weren't able to figure it out but for some reason, sometimes failed on us.

B

So we sometimes wrote deep duplicate like like wrote, duplicates into into S3, and this caused a lot of problems further down the line and that's why, when looking for a new solution, it was clear: okay, we want to have security mechanisms essentially in place that prevents us from writing the same data twice in one transaction.

A

Right right, right right, so basically the nature of the pipelines that you're processing, because there's so much of it and by the way uh this is related to a question that actually is coming from the Q a like. Do you also put your audio and visual data into your lake house, or is this primarily um like the the structured data I'm just curious from your perspective.

B

Both so uh we don't like, we don't have audio data but video and photo data, and we both have it reside in S3 uh yeah. Exactly so.

A

B

Both can be essentially used uh in the same breath when you want to do any machine learning tasks, for example, or whatever.

A

Right right right and if, for no other reason, even just the ability of like saying okay now, I, have a transaction that protects the the video and then forsake argument. If somebody tries to update it at least there's a there's, a transaction yeah.

B

A

It's like hey, you just went ahead and try to run over in an existing video, and you have an audit log for that. For you know the various complex last weeks, I'm sure you guys have a ton of compliance issues that you.

B

Always have this is this is the next thing compliance is as we are based in Germany and and uh yeah. This is a very important topic here. uh It was amazing that you have like an audit lock and also able to delete essentially right. So you can, you can do row level deletes on top of parquet files, which is great for our use cases. Yeah.

A

Makes a tons, and yes for everybody that is not German. That's on this call basically do.

B

A

Under do you understand that basically process is extremely important in Germany yeah uh all right. So uh let's not just go into that. Just because I know, I will be able to rat hole into this. So you started off basically going ahead and um basically having your data Lake, which was built in Delta Lake, to give you that transactional production um and then you basically start with the idea of I believe. Was it atheno and glue to basically start working with that? Okay, so yeah? How?

A

How did that go like in terms of so far like because it seems like it seems like this started off well, but then it seemed like there was a bit of a shift afterwards. Yeah.

B

Yeah so um glue inherently at least back then only supported uh two spark versions, essentially like the 2.4, I think and 3.1 and.

A

That's right at the time exactly yeah yeah at the time.

B

Exactly and um even the 3.1 was the fork from Amazon I think and yes, um there was yeah there.

A

Was no native Delta.

B

Integration, so you had to essentially uh bring the jars to glue, but that meant also that you have compatibility issues. You cannot use the latest data Lake versions, because data Lake needs a specific version of spark and um yeah that that's when we quickly realized okay. If we really want to do this in production, we need to look for an alternative.

A

B

Thank God uh glue comes with Native Delta, so this is not a problem anymore, but back then, when we made our decision, this was still no problem for us.

A

Yeah and a shout out to our friends over at glue yeah yeah. Basically during AWS re invent uh late. Last year they announced glue 4.0 glue, 4.0 actually includes native Delta Lake support, um so they had glue support in 3.0, but that was with the um manifest file and so.

B

Basically, yeah.

A

Glue crawlers would basically go ahead and read the Manifest files, and so in terms of actually not even real-time pseudo real time capability was not even possible right and so now, because it's actually able to make use of uh query, the crawls are actually able to query the Delta like tables directly, not a big deal. So now it's included so that that was, but that was back at the time right.

A

You know like and in fact your own blog post actually does call out uh uh that um uh noritaka sakayama actually went ahead and called that out, uh and the reason I like calling out is because we're actually working with Nori Taco quite a bit. We actually have a. We actually have a session. A separate D3 is.

B

Specifically, honest actually yeah.

A

So I want to give a shout out to him because I'm like no I, like doritaka, he's a cool guy so like.

B

Definitely yeah yeah funny enough. The day I I published the it was the second uh article I think it was the first day of re invent and they immediately announced the glue now negatively supports that.

A

Was funny actually yeah come on.

A

Come off quite well with him, so, okay, but let's, let's talk a little bit about so you were working with glue. You had some of the issues and that's ultimately, why you end up moving away from glue, basically because of that at the time, basically yeah exactly yeah all right and sometimes that it happens in terms of timing.

A

I I get that and so, and so the other thing you had mentioned, also which we'll probably delve into a little bit for the next like for the next couple minutes, is the fact that you wanted the ability to go ahead and work with the current or correct versions of spark and again this is before EMR announced their own native support for Delta lake.

A

So what are some of the issues that you have, but that were specifically related to like the EMR Fork of spark, for example, that was related to glue or I'm just curious about any of those issues.

B

So nothing with the spark version itself, but all like we wanted to use the latest version of of Delta Lake right because of the compatibility uh um issues or compatibility Matrix. It wasn't possible for us to use the latest version on top of clue three. So it's not! It wasn't about the spark version rather than the glue the diatolic version. Gotcha.

A

Gotcha, okay, perfect, so that actually provides that a little bit context.

A

So in your journey of switching over to this, um like how have you what what are the I guess like what are the efficiencies or the improvements to your own processes that you've been able to attain by switching from like what you originally were working with, like back in the beginning, when we're talking about uh um um redshift, right and and and uh um and and to now, basically going and working with Delta like I'm, just curious like what were the inherent efficiencies, improvements and processes like uh our data?

A

Is data process faster or is it more real time I'm just curious because it seems like yeah, like you know like it's just one little thing I mean if anybody's asked it's like. Well, it's just a freaking storage format in parquet that happens to have a transaction log, so doesn't really help so yeah, like what's your content? What's your context behind that statement or that question excuse me.

B

First of all, uh the biggest the biggest upside we saw is cost reduction, so just not having a redshift cluster up and running is just a breeze, although you now have increased costs for Athena and also for for on top of S3. But this is negligible and in comparison to what what the cost reduction we achieved by turning off redshift. So this is probably the biggest achievement, but also our um the data sayings uh got reduced, uh so we have less um writing from as like.

B

We don't have to move data from S3 to another storage system.

B

um It can stay in S3 and this really cut down our ETL times, because especially the load into redshift was always the most uh the the the slowest part of our Pipelines and also uh reading like automatically reading data um from our gold and aggregated tables. The loading times improved drastically, um especially for machine learning purposes. Because back then we had to use a jdbc connection to redshift, and now we can read directly on top of S3, which is great.

B

We can use like directly with parquet files instead of having to connect via jdbc to a Retros cluster, which also must be a great Improvement for us.

A

That's awesome so so it seems like basically cost the cost reduction and basically simplification of various processes, exactly yeah right, that's what it boils down to all right.

B

A

Really really cool, oh gotcha, very much so so uh how? How is your experience for like working with Athena on Delta like that, especially now that Athena itself also supports basically natively Korean Delta Lake? It's.

B

Very like it's, it's a breeze. It's very easy. uh um It's very easy to set up it's very fast um and I.

A

B

Data Lake API, so it's I love to like it's very easy to interact with the data table um and up until now it's it just has been great, so I cannot complain, uh we basically exceeded our expectations and and what we wanted to improve, and um it seems like we're still a small startup right and don't have the the data that a big Fortune 500 company has, but it seems like we can scale indefinitely with our current setup, um at least for now, and we are now completely future proof. It's great.

A

Oh I love that call out I love that call out about being future proof, yeah, like mainly enough yeah you're, probably not like uh the the sizes of like apples. Cyber security.

B

A

Like was it hundreds of terabytes a day or something right, like they're, literally storing.

B

A

Of petabytes taxabytes yeah, so yeah, if you guys ever get that large that'd be amazingly sweet, but you're also going like yeah so but technically yeah. If you needed to get that large, you should be fine. Yeah.

B

A

B

But what I really like is that we kind of can like we can operationalize the data and our Delta lake or lake house, um because with like um Frameworks like the dental, Standalone, reader or Delta rust, and you can actually really quickly interact with parquet files, which allows you to essentially write back to your data Lake, which hasn't been like. We weren't doing that on top of redshift. Obviously before- and this is really cool, so we can write applications that write back to our data Lake based on um yeah data visualizations, for example.

B

This is really cool.

A

No amazing, so one of the actually I'd love to talk a little bit more of that, but I just realized something because you like don't worry, even though I'm from databricks I'm not actually trying to pitch databricks here, or at least not that much I should put it. That way. uh There's obvious biases here come on guys, but um like I'm just curious. So it's interesting because you have the databricks environment, that's running a lot of your stuff, but you also have Athena.

A

So in other words, you actually have this mixture, where some of the stuff actually is sitting in databricks and some of the stuff is sitting in Athena, so you're, basically like picking choosing what you like from either world as opposed to necessarily doing all data bricks or necessarily doing all AWS native Services, uh yeah I'm just curious. What led to that and or is it just because that's the way it was built and that's good enough.

B

So, first of all, what enables this is I think the most important thing that that we bet on open source software like so so, our core, the the storage format, the table from it is Data Lake and our processing engine is a spark and- and this allows us to choose uh what we want to use to interact with the data and um for the for the reason why we picked Athena instead of, for example, databricks SQL. Again it just uh historical reasons.

B

So if we would be like back, then there was no serverless option for databricks SQL, at least not in public preview, for Frankfurt region and.

A

That's why we just went with Athena.

B

And we also had Athena already up and running and configured, but um the goal eventually is also to move over to databricks SQL, to like Leverage all the other cool features, but uh yeah. This.

A

Is the reason why but but it actually defense right forsake argument if databricks xql sucks it doesn't by the way, but the pretend it did. You have no problem, switching back to athenics, okay, no problem, switching all these other services, you have the flexibility and that's more or less the whole point of behind this.

B

Yeah, it's completely agnostic of what query engine runs on top of it, which is really great.

A

That's awesome. That's awesome! Now! That's really cool um I! Think that really covers most of my questions.

A

um I'm I've actually answered a bunch of the questions off the side, so I'm going to give this opportunity to let other folks go ahead and chime in if they have any other questions but and like I, said I've already posted uh Robert's blogs uh and we're gonna we'll update the YouTube video by the way so YouTube's, actually where we we have the final recording.

A

We also have reap we'll repost in LinkedIn and twitch anyways, and so we'll actually update to contain all of Robert's blogs, but anything else that you'd like to add or call out that you liked, or or like any gotchas or that you'd like to like advice, you'd like to give to people like in their own Journeys from like from in terms of building a lake house I'm just curious as the final question.

B

Yeah good question: let me think so what.

A

B

Like is and and what really helps or helped us during the migration of from redshift to to data lake is all the great um open source libraries that they are next to data Lake and next to spark. So I want to give a shout out to Matthew Paris. There he's he's maintaining a lot of great libraries that helped us transitioning in the transitioning period, so um definitely check them out, because this made, which.

A

One did you use, did you use Mac, Diane yeah.

B

Mac I'm using now, but uh definitely chispa, is one of the.

A

I think that's.

B

Rather, but one of the must use libraries for every uh developer there out there, but yeah Mac, uh definitely using that and also looking forward to contribute there. It's a great.

A

Price, oh yeah. Definitely, actually that reminds me. I did see your contribution to one of the libraries I forgot, which one it was. But yes, that's back yeah, yeah yeah! That's why I spread it up. There's a reason: I brought that up, yeah! That's right! Okay! That's right! You made a contribution of that uh we're by the way we're moving that to the entire Library I believe to Delta to Delta repo by the way, yeah.

B

So it's actually going.

A

To go there, but, like he's been going nuts, as you can tell so we're going to be moving a bunch of stuff to various locations, just because it's like okay, then we've got tons of contributions. Yes, but sorry, I completely went sideways on. You go on like advice for anybody, who's trying to make that transition to lake houses.

B

A

B

My biggest advice, don't reinvent the wheel. A lot of things have been done already um and then let me think so. For for us, everything went so smoothly that I I cannot like now uh tell any stories of any pitfalls and caveats because, like it, went really good um yeah trying hard to think of something I, some advice I could give. But um just do it essentially because uh for us like at least for us, it has been one of the best decisions we made so yeah.

B

If anyone's thinking of at least try it out, uh try using data and yeah.

A

Oh, that that's awesome, no, no I, I love the I didn't know that this would be the answer for everybody. That's wondering uh like it especially yeah.

B

A

I didn't I I would I could have sworn there was like their problems but looks like, and this is of course not a little bit of a tagline here. Delta make things so simple that you actually didn't have run into pitfalls. So I guess we got very lucky. No! No! Yes, uh there usually is amount of like both it's always a little bit of luck, a little bit of good Tech a little bit of Sparks and then the all three mixed together to to turn out that way.

A

So, yes, I completely agree with you, man perfect well! Thank you very much. Robert I really appreciate your time. uh Anybody, if you got more questions you can like Robert and I were already talking about join us in the Delta users GitHub, because we're actually doing lots of contributions in that Arena. Also don't forget to join us at Delta user slack, that's, basically, go.delta dot, IO slack! All of us are there actually actively acting asking questions. I I will call myself out.

A

I've disappeared for a couple weeks due to the last week's data, Rick cko and I'm still recovering from it. But but yes, we are normally there we're normally answering questions um and then one small final call out um actually I'll be in London. At the end of the month, we were going to have a meet up there to talk about lake houses, so come join us there in person.

A

If you happen to be in London on the 28th or the first one of those days, that tells you how well I'm organized on these things, but yes, so Robert again. Thank you very much appreciate your time uh for thank.

B

You very much for having me.

A

Perfect, well, that's it! Thank you very much. Take care.