Delta Lake Getting Started with Delta Lake, 19 Feb 2020

Previous Meeting

⏯

youtube image

►

From YouTube: The Genesis of Delta Lake - An Interview with Burak Yavuz

Description

We're re-igniting the Spark Online Meetup! In this live meetup, Denny Lee (Engineer and Developer Advocate at Databricks) interviews Delta Lake engineer Burak Yavuz.

Read more here: https://delta.io/
Learn more about Delta Lake Connectors: https://github.com/delta-io/connectors
Join Delta Community Slack: https://dbricks.co/DeltaSlack Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

Time feel free I will get that sent out to you ASAP, so without further ado I'll hand it off to Denny and Brock's gonna come over and switch chairs with me. Hi.

B

Everybody, uh my name, is thanks very much Karen hi, my name is Denny Lee I'm, a developer advocate here at data bricks based out of Seattle Washington. So that's why I'm actually sitting with a where-where'd espresso background, while Barack moves chairs and pops on over he's based out of San Francisco California. So without a do, you currently are watching our interview with Baraka for the genesis of Delta Lake, but before we go into it, the heart of this type of online meetup is that we get to interview the people behind the technology.

B

So Brock won't just we start with you, and you tell us a little bit about yourself. Yeah.

C

Hi Denny, so hello, everyone, my name is Brock I'm, a software engineer here at theta, Brix I'm, also a spark committer, basically I work in a team called the stream team at data breaks. Our goal is to make the lives of data engineers much simpler and our team motto is we make your dreams come true. That's.

B

Right, the old purple t-shirts that has say that make your streams come true. Yeah by chance. Did you make up that phrase, or somebody else did in that case, I mean there.

C

I, don't know we were joking about it like we had like versions of Maher streams. Come true. We make your streams come true, it's kind of like we're. Joking around, like we called ourselves to the stream team or like once, we started working on Delta. We like switch that off to like dream team and whatever it was kind of like just joking around right. So.

B

The standard rigmarole of geeks trying to think of marketing phrases this guy.

C

B

Let's go backwards a little bit, though, can you tell us how you even progress like what made you decide to even get into computer science or engineering? What's your background just curious here, yeah.

C

So I'm originally I'm a cackle engineer, but I've been programming myself since, like high school early high school I was like really interested in robots and kind of like Bionic arms, and things like that. So it's like I wanted to get like a full view of like engineering where you know mechanical engineering, you build. You know things that move work.

C

You know control systems and, on the other hand, like you, have to program them and I really enjoyed that whole area of you know like making something work, making something move and everything so studied mechanical engineering. But then I came to the u.s.. It's something called management science and engineering, which is kind of like Industrial Engineering, worked on large scale, optimization problems and that's how I kind of got into the world of big data cool.

B

So you you move finish. Your degree in mechanical engineer then decide switch gears go into the management aspect. Right though that was big dated, did you? Where did you go to school right? You notice that a lot of the folks are Berkeley based, but I, don't believe.

C

B

In this case, yeah.

C

I did buy so I came to Stanford University for grad school, that's where I did management, science and engineering and I was planning on doing like a full ph.d program like for six years kind of like learn. Everything about you know optimization and things like that. But then you know that kind of introduced me to the world of big data and machine learning, because you know every machine learning algorithm in the end uses some optimization algorithm and some optimization routine to actually get a result. So gotta.

B

C

I kind of got introduced to data brakes and smart as well.

B

Okay, so before you got introduced to spark because, as you know, that your committer spark I'm just curious, did you like what were the types of libraries or machine learning that you were doing? Was it like? You know, old-school Java Mallett was a like pre Python or we're using Python pandas just curious in terms of your progression into the machine learning cycles before then, you actually switch over to the data engineering cycles. Yeah.

C

I mean honestly, like we were doing very academic research. In that sense, like we had a lot of code in MATLAB, we had a lot of code in C. We had like some code in Java we're doing like pretty like I mean well-known zatia routines like us to cast a gradient descent and it turned out like I, was using like all these tools that you know people had built on top of like C or Fortran. Even right. You know you have the very optimized like matrix-matrix multiplication, routines and whatnot.

C

So that's what I was used to I wasn't used like things like you know, PI torch or you know those kinds of like more modern machine learning. Libraries didn't exist, gotcha.

B

Gotcha so we're going back to almost old-school Fortran 77 types, yeah got it got it: okay, cool, no, fair enough! That's even even what I did in the past completely okay. So then you finished off your degree at Stanford. But then how did you progress in the data Brixham for that matter? How did you even progress into spark in the first place, yeah.

C

So that you know that's like a very long story for happy hour, it's not fair that.

C

Basically it turned out that I was continuing the research of the maintainer of an ellipse hungry month, theater bricks and I, just like signed up for a spark workshop which was being held at Stanford at the time, and actually we went to a separate interview for a different company that day and missed the entire spark workshop, but received an email later saying that, oh you know, do you want to intern that data breaks I'm like yeah, sure and I noticed the email of Shangri and then I just like sent my resume over cool.

B

Alright, so very not the most direct path, but that's pretty cool.

A

B

Then now you're part of data bricks, you started I believe you had said start off saying that you had more focused on the spark side of things or would focus on the machine learning side of things when you first joined and.

C

When I first joined, it was kind of like building the tools, so I started off as an intern. One of the things back then, like spark 1.0, was just released and we were trying to build. You know all these tools to figure out regressions in spark like we were adding all kinds of code to spark. We just wanted to make sure that we didn't regress in performance. So you know the first things: I dawn was like spark perf, which was kind of like this library that allowed us to run benchmarks on spark run.

C

Some like machine learning, algorithms that existed during the day on, like our data, sets and ensure that, over time like we did in regressive performance and then moved more towards like linear algebra that work around spark within.

A

C

Lib and then data frames came out then worked on like the statistics routines within there froze got.

B

It got it: okay, cool so so basically started doing the the mathematics in essence behind a lot of this in the first place. But then, since when data frames came out, then is that when you progressed into more of the stream team early.

C

Yeah exactly so that was around the time like we started, adding all these like statistical routines into data frames, and then we created the data team at data breaks, and we were tasked with building a new logging data pipeline, and at that time we were like you know, data frames are cool.

C

Maybe we should come up with like streaming data frames as well, and so we were wondering like how that would be, because we had spark streaming that had this like D streaming API, we had sparked core, which was our DD API, and then he had sparked sequel, which had data frame api's like how do we connect all these two or three like? Are we gonna have addy stream of data frames? Are we gonna? Have you know like some other concepts, so that was kind of like early on? We were thinking about.

C

You know like it would be super cool to have like streaming data frames, but we didn't know what it was. Gonna look like.

B

Okay, so let's hold for that, so so basically, what you're telling us here is that the progression into the streaming data frames which we'll talk about shortly, but it was actually started off for that. The fact that on the date of excite can you without obviously diving into too many details. But can you provide some context of why this was such a big problem or how the type of problems that you're trying to solve yeah.

C

I mean so it was a humongous problem. I mean like early on. We had this like very simple batch data pipeline that you know took one day's worth of data processed, it put it into nice like table. We need to do something else, and then that's when dater bricks decided that hey a new college grad should build our new.

C

You know data pipeline, and so we kind of like looked into what the industry standards were at the time, and it was this thing called lambda architecture, and so we decided to kind of move towards building such a data pipeline at data brakes as well. God.

B

I got it so so, as you started, progressing toward the land doctor. This is where that delineation, between the concept of streaming data frames versus batch j frames, first started for your for you and.

C

Your team, precisely and like I, mean the idea was, you know, like we were streaming data, which was like a completely different API, which was in spark streaming with D streams. But then we had our batch processing, which was in data frames and like managing like two completely different code. Paths started becoming like a hassle, and we wanted to. You know like we were. We started thinking about like how can we unify these? You know, API is I'm, come up with this like logic where you know you wouldn't have to change too much code.

C

You just give us, you know the transformations that you want to do and you know doing transformations on data frames was like super declared. It's super easy, and you know it was kind of like sequel like API, is that people were used to so we're like? Oh, can we, you know like start doing this in a streaming fashion as well, god.

B

I got it so so then, from that progression, because you had all this data and because you want to do a land architecture where basically you're doing batch queries where there was whether it was machine, learning or bi or whatever else that you were doing it's that data. But you also needed to look at the data live right via streaming right and so because you're looking at the data live, you want to be able to use that same declarative, nature and batch and apply that to streaming.

B

So that was basically part of the reason why the spark community itself was saying: hey. We see the experience and we wanted to go ahead and build basically streaming on top of the spark sequence intact, using spark sequel. Syntax excuse me, then, versus these streams. I'm presuming, that's the progression I. How did that communication go with the community? How to like you how many others works are bugging you about the same problem, yeah I.

C

Mean like in the stream team, like we had many ideas, but basically people like Matt, Azaria and Mike alarm breasts. They you know like, but at their heads together and they're, like you know, how can we get this to? You know work nicely and then they were like yeah. We could just like unify this within this API and then you know like we started. You know like talking about it within the developer community as well, and then people were like.

A

C

Sounds like a good idea. You know like, let's just you know, make it as reusable as possible. You know. If something's, you know, you can point out the same logic in batch and streaming it'll make it easier to test your code as well, and so there were like a lot of things where we're like yeah.

C

If we can just like have this one unified API then, and all you need to do is like change two lines of your code, then it's gonna make it so much simpler for people to move from like a batch world to a streaming world.

B

Doesn't matter got it so right now, up to this point, we've we've got ourselves now at this point. Basically, you have your streaming native frames and you have your batch data frames and you've now successfully built up the land architecture right. So what we're?

B

By building the streaming data frames, how sorry, by building the streaming data frames and by building the batch data frames, and now you have your land doctor? What were the some of the issues that you and your team were running into in that case, because it seemed like you had a solution. You had the land architecture that was the popular concept of your per se. So what what led you? What were some of the issues that you ran into that? Oh.

C

There were like so many it's like I, like I, can't begin to count. It was like basically I mean we would get alerts like every other day. On our like pipe lines like something's. Failing, okay and I mean it was just cut like a issue with there like working on large-scale distributed systems. It's we had many cases where you know data would arrive late. We would forget to process that data.

C

You know we didn't know how far back to look. You know when new data came in, so we just like said okay, you know we expect data to come in in within three days. So, let's just like reprocess our entire data set over like three days and as we were doing all this like streaming work and also our batch pipeline, the streaming the latest data that we wanted to query.

C

It was always very slow and the reason for that was this concept of small files having a lot of small files and the idea there is that I mean we were initially up in AWS working with Amazon s3 and these kinds of like storage systems like data leaks, as we call them. You know they're, just like key-value blob storage systems. You know they're great for storing insane amounts of data, but they're not great at like telling you what data is there or which version of the data is there.

C

That's like I mean it's just very hard to give you. You know like very consistent semantics, so we would hit like so many issues around Amazon's s3, eventual consistency where we would write out a file, but before writing out the file, you would have to check that the files there just so that you don't overwrite it or you know, write garbage data and that check. Would you know, prime, a negative cache. You would write the file, you try to read it back, and then you have these issues of.

C

Oh, the file doesn't exist and you're like well. I just wrote it there: how does it not exist? You know, but those were kind of like you know the kind of problems that we had to deal with listing all those files were. You know super expensive because you know s3 wasn't just built for like listing things, it's very hard for like those kinds of systems to give you a list of what's there and then just you know, reading all those small files opening so many HTTP connections that would be super expensive.

C

You know we would be killed on throughput. Just trying to you know access every single file, open all those connections. So you know we had all kinds of like performance issues as well. As you know, correctness issues and like consistency, issues, god.

B

I got it so so the heart of the matter for at least when you were doing the land architecture for log analytics, was that the underlying file system, in this case, the cloud storage system itself- was not reliable, right and so I'm just curious. You know based on. Obviously this is what was the your experience, but then did you see the same thing happen with lots of data bricks, customers, yeah.

C

I mean like so it was, you know like I mean, like I said, everyone was building the same thing at the time and we're, like you know with these, like architectures I mean it's not that the storage system is unreliable. It's that everyone had to build their own database semantics on top of this storage system, and you know people were just used to working with things like my sequel or you know like data warehouses, and you know these kind of storage systems that were very easy to deal with.

C

You didn't have to think about a lot of the problems that you might face, but then suddenly one when you came into this data, Lake architecture, which you know all our customers also were in, and then you had to deal with all kinds of like how do I deal about files like can I delete files? How do I optimize my io patterns with these files? You know which file format do I, save them, and you know which file sizes do I want to have.

C

You know like suddenly, all these, like new problems, arise that people had never had to think of. As an end user right.

B

Right so so that transition from on-premise to the cloud that transition transition from single box to distributed systems, the the fact that you actually had to do deal with a distributed file system. This basically introduced a whole set of issues that you know not only you were suffering from, but you know in terms of doing the analysis of the data, but a lot. Many of the data books, customers themselves were actually going ahead and suffering problem. Yeah.

C

Precisely right.

B

And so then saying this, then you know sounding a little dorky on my part. This is the in essence. The genesis of Delta Lake right. The problems that you ran into basically was the reason why what ultimately became Delta like precisely.

C

I mean we were like trying to solve all these problems like in different ways. We would get all kinds of like support ticket saying. Oh my, you know, my queries are slow or you know like my listing is super. Slow can I make this faster. We were getting all kinds of support tickets around. Oh, you know, like two people tried this changes the same table at the same time, but then I have this like totally inconsistent garbage state of my table. People would be like oh I have duplicate records here.

C

Why do I have duplicate records? Well, you know you had partial failures. Those were kinds of issues, then you know like with what's spark. For example, a lot of things were built with like Hadoop, distributed file system in mind. Where you know the idea there was with an on-prem like HDFS, you could just like write to a temporary location, rename and then renames are super fast. That's like a you know, constant time operation.

C

Whereas with you know cloud storage systems, it could either be a you know very quick reading or it could be a server-side copy of the entire file and these kinds of like performance issues. People were you know, as people were like moving from on front to the cloud they just like had so much trouble. You know dealing with you know all these kinds of like inconsistencies and all these kinds of like performance issues, so that really led to the problem like like we came up with like intermediate solutions. Yeah.

B

Actually, with that I'd love to I'd love to actually dive a little bit into that in terms of like some of the intermediate solutions you had to put in place before you actually had the solution so just to give a little heads up to the audience here right, we are gonna, be talking about the Delta Lake transaction log, which ultimately solve some of these problems. But before we talk about that, I'd love to just as you hinted at baraka understand a little bit more.

B

So what were some of those intermediate solutions, for example like deduplication or the consistency for the file systems, or things of that? Whichever is your favorite pain in the butts I? Guess all of them.

C

Yeah yeah I mean so, for example, with structured streaming once we release so streaming data frames structure streaming. We came up with this file. Sync implementation, where you could take your data from anywhere Kafka Kinesis as your event hubs whatever or you know files, and then you would store it in some other. You know file storage system and what this file sync did was that it was actually kind of like the initial implementation of Delta's transaction log.

C

It would write out all the files with unique names, so this these unique names ensured that you know you wouldn't ever hit an eventual consistency problem you would never hit. You know like a task, a failed task, writing out a file and then a second task that you know retries. You know that's a retry writing out the same file. You would just like get new sets of files every time and once all the files were complete, it would just take the set of files that it wrote and then store it in a manifest file.

C

That said, okay, you know this batch. This micro batch I stored, I, wrote all these files and when SPARC would actually query this table, it would directly go to this manifest it wouldn't have to list any of the directories. It wouldn't have to listen anything. This manifest was kind of like the the source of truth about which files the SPARC had to read to actually have a you know full view of the table, so that was kind of our like one of our initial solutions to dealing with.

C

You know avoiding listing and kind of like having. You know, an atomic operation where you know if the trends you know, if the write fails, then you know sparks not going to read those files, Sparks still going to look at the transaction log to see or a manifest file to see what's the source of the truth, so that was kind of like a very early implementation of what was getting us. There got.

B

It so basically the the prior to the transaction log. In essence, you have a manifest file, basically just a file which lists all the names and because, if you have a lot of files listing, the files from s3 became relatively slow. So it was actually faster to read that one manifest file which itself contained the list of, let's just say, 25 files yeah, even though there may have been 50 files due to failures on write. For whatever reason, it would only grab the 25 files that you needed exactly.

C

Exactly and it wasn't just the single file, it was actually like a ordered operation, so it was kind of like the first batch wrote. These files second batch wrote. These files third batch wrote these files and sudden, like once, we saw this directory. We would have to read. You know like 1, 2 3. We would list that directory and then read all the files within 1, 2 3, and then you know, answer a query based on the file list generated from those gotcha.

B

Gotcha so so, like you know, this is the precursor to the transaction log. So the idea is then basically you've had a file I'm just curious. Was there any discussion on why forsake argument that transaction or so that manifest file would in fact be a file versus for seeker arguments? Some other, like you know, sequel or no sequel store or something like that? Were there discussions for that and looked from a standpoint of a queuing or in memory system instead, ya.

C

Know that's a great question, I mean so there were like I mean with spark. Every time like the biggest question is like scalability. How can we build something scalable and also like one other thing? Was you know we don't want to depend on external systems like avoid dependence on external systems, because it just adds more. You know, problems on to the users, and so we were like oh they're, trying to write to a storage system. Why not have this? You know like our source of truth.

C

Along with you know all the data files within the storage system, so it was kind of like you know we didn't want to have them set up a connection to you know some other database. We didn't want to have them set up. You know a connection to you know some other key value stores like we already have permissions. They've set everything up so that the right people can write to that directory or read from that directory. Why not just like store all the information that we need?

C

You know in there, so that was kind of our idea and they anything in memory like we. Just wanted to avoid as much as possible because it wasn't gonna scale, got.

B

It got it so, basically, the in-memory part basically would because it couldn't persist right you, there wasn't something that you could basically say. This is my source of truth by that definition, yeah.

C

You don't get fault-tolerance in that case either I mean like the thing. Is it's like the moment you're, you know drivers parked drivers gone, they just lose whatever you have about that table got.

B

It got it cool so that I mean that solves. Basically, the manifest file basically did solve a bunch of things and like especially the the file rights issue, but I'm just curious. Then you started off with the lambda Kurt that you're really talking about streaming right. So how did was a Justin manifest itself that would solve the streaming issues that you're in the resolution streaming or what else did you have to do in order to be able to resolve these things? Yeah.

C

I mean so the manifest file kind of like resolved the issues around like distributive failures, partial failures and like a file listing with streaming, but it didn't get rid of the problems of you know like having a lot of small files, because you know the manifest is the source of truth. It tells us you know like which files we have to read, but then it was only you know. It only worked with streaming rights. So how do you actually read from this table?

C

Unlike compact, your files, you know at the end of the end of the day, like do you have like one table? That's just like streaming and then you compact to a separate table- or you know, do you like ignore, like some customers would just like ignore that transaction log and then just like overwrite everything and like blow everything out at the end of the day, even though, like we told them hey, please don't do it like you're, not getting any guarantees this way, but you know like so some people, so we're okay with that solution.

C

But so you know still a lot of issues existed because there was no like unification with batch, yet, especially with this new like streaming file, sync. So in the end you know we noticed all these problems. People were starting to use, structure streaming a lot more and then we're like. Well, you know. Maybe we should start thinking about this again and come up with this like v2 concept of a streaming, because you know people.

B

Apologies there seems to be some connection issue, I'm, not sure if it's a my end or on your end. But can you hear me okay, actually.

B

Okay, well, this is I, guess the the love of doing live sessions where sometimes we have technical difficulties. I don't know if it's on my end or urine, but nevertheless, okay, let's progress, because we actually only have a few minutes left on the on the interview portion of Eric's, because we did want to try to time these interviews to be time for about the little the length of time for the average length of time it takes for somebody to commute in San Francisco, which is about 32 minutes.

B

So nevertheless, all right, so you've went ahead and told us a little bit about how that transaction log worked the streaming sync. So then, what were some of the other issues that you ran into as well right? You expecially with your customers in terms of like you know, because you progress with the file sync, you progress with the what ultimately turned into a transaction log or what were the other things like, for example, I'm presuming one of the problems was, as time changed. Oh sorry, it's time progressed. Excuse me.

B

Business needs changed, so the schemas of the data I had also as well changed.

B

C

We caught up, are you here, ask I, can.

D

Hear you now: oh you.

C

Know it's like your face was like yeah, so yeah to repeat your question. It was like what kind of like business need you know came up along with. You know time that the file sink manifest did not support another like big thing that came up was GDP. Are you know all this like issues around like data protection and data privacy and like their requirement? For you know, data subject: requests where people could ask specifically, for you know what is my data or you know like?

C

Can you update my data or just delete my data entirely, and people had to build like very complex systems and data pipelines or did architectures to kind of solve those issues, but we're like normally you know you should be able to write an update statement in sequel and be able to update your table. That's what people are generally used to, or you should be able to write a delete statement on your table and delete all the records for a user and, like that's what you know our users are used to from like on-premise.

C

You know data, warehouses or databases. So we saw this like new. More complex workloads, you know emerging from you know like all these new. You know requirements around the data world as well, and our transaction log, which only supported streaming rights, was never gonna support those use cases. So we had to come up with this like new protocol that actually was able to understand what changes are being made at the table.

C

B

I got it so now you actually have expand the transaction log. It understands what changed me disabled. What made you decide, then, that it would have to be a sequel, syntax I mean because it makes a think about the database world all over again. Mm-Hmm.

C

I mean it wasn't necessarily just sequel syntax. It was sequel. People were just used to sequel and people. It was the easiest thing that you know.

C

People knew across like different roles, and you know it was kind of like the sequel, syntax allowed them to transition a lot easier into this world without having to know any spark api's or you know, data frame api's, you know, maybe you know data scientists knew about data frames from like pandas or are, but not necessarily a you know, data analyst who was you know using BI tools and we're writing. You know dashboards, but you know we needed to empower all these people to actually build such things as well. When required, perfect.

B

D

B

Cool well, then, let me let let me end the interview portion of this well, basically, he talked about then. What spawn did you guys to ISIL? Also think of time travel Oh.

C

Time, travel yeah well yeah, so it's got like when we came up with Delta I mean it was kind of like well with all the intermediate solutions. Delta was kind of like with its transaction log and it's protocol it solved. A lot of you know: support tickets that we would get honestly around. You know like it would ease a lot of issues. Another thing that you know another issue that came up was with Delta.

C

We tried to enforce kind of like best practices on users, and what would happen was people with hive there's this concept of dynamic partition over rights. What that does is that you have a data set. You try to write it out in overwrite mode, so it's gonna overwrite some amount of data and it overwrites only the partitions that it writes new data to so. The idea was, you know like if my it was kind of like a lazy way of saying you know like I. Have this.

C

You know entire new data set just like overwrite, whatever I need to overwrite with so the initial users of Delta, who were used to that kind of like mode started, overriding their entire tables, which meant deleting all their data and actually just overwrite, inserting a very small subset of data and when they asked like oh, why did this happen? They would, you know, create a lot of support. Tickets.

C

Oh Delta lost all my data, we're like so here's the history, log and here's the operation that you wrote and you know we have this operation called replace, where which you should have like what you can use to actually guarantee that you're overwriting the right data, so you know like you're, not deleting spurious data. You know accidentally, but here you go.

C

Here's time-travel that'll, allow you to actually get whatever data was in a previous version and you can merge all this data back into your current version, so it kind of like the first like day we like the first week we released time-travel, actually like six or seven customers, add like saved all the data that they accidentally deleted. Wow.

B

Yeah, so that's pretty cool. So basically, even though time-travel is a pretty cool fun feature, it was born from the idea that people were not using replace work or music using overwrite incorrectly in a sense.

C

Yeah, it was kind of like I mean it was. It came from the idea that people make mistakes got it. um You know, if there's a way that we can prevent or, like I mean roll back from there from those mistakes a lot easily like much easily, then you know we should provide that. You know feature to users and it was actually like just from how the transaction log and like the concepts of like multi-version concurrency control, worked.

C

It was really easy for us to actually go back to the state of a table at any given time. So why not just like empower users to actually you know if they want to query the differences. Do that if you want to you know if you accidentally deleted data, add that back just provide all that functionality very easily perfect.

B

Perfect well, okay! Well, this has been an awesome interview, I'm glad you spent the time with us here today, Brock on the genesis of Delta like it, and also telling us a little bit about yourself. It's a really interesting journey that you went from basically mechanical engineering to machine learning. To being a hardcore engineer, I did want to leave a few minutes to ask some questions before we wrap this up.

B

So while we just switch or words the Q&A, would you like me to ask some of the questions, or would you like to do I'm cool either way, whichever one you prefer, please all right, perfect? Well, let's just dive into it all go in in order through the delta, reed stream and write string.

B

Do you not achieve what some people would term as the kappa architecture, in addition to the land, architecture style and I'm gonna just start off by saying to probably due to this question, in fact, we were probably going to want to talk about the delta architecture instead.

C

Yeah I mean so kind of like this concept of you know like having these two tables, and you know a streaming workload and a batch workload. What we wanted to do with, like the Delta architecture, was kind of you know you can stream into your table. You can do all kinds of operations on that table additionally and that came from the biggest power of Delta, which was acid transactions, which I am funnily we didn't get to in this interview yet but yeah through this acid transactions.

C

Basically, he had all the power to you know append new data delete. You know existing data, a compact existing data, without you know, like changing, you know, causing any transaction conflicts and whatnot. So by you know this. What we wanted to do was propose this new architecture, style called Delta architecture, where you would incrementally you know, improve the quality of your data. So what that meant was kind of you would have data coming in from this like centralized messaging queue message: queue which I was like a very short retention period.

C

You know like Kafka or Kinesis or event hubs. You know you can store.

C

Maybe seven days worth of data or, like you know, two weeks of data max and what we wanted to do was you know, like people, don't realize mistakes, maybe in that short period of time you need like a longer retention period, so the first step would be to like take all that data and store it in cheap storage with you know like leaving it untouched and from there you just like you, do one more layer of refinement where you actually okay, take this like raw data. Let me just parse it out, move it into.

C

You know nicely clean tables where you know, I have like my source of truth for, like all the events that I need and what this and then just like one more add one more layer. Add one more layer just like on combine all these event sources back again to tables that you know, data analyst Quinn can query optimally or very quickly, so you know there. The idea was.

C

If you had made a mistake, then you can always go back to your bronze table and reprocess everything all over again, if, like someone upstream that you don't know like creates a new timestamp format and all your timestamps now say you know you're doing something in 1970, then you know you could always go back, fix your code and like repopulate everything from like a certain Imperial of time, weird like from scratch altogether.

C

So you know these kinds of like operations really and like this architecture, really, you know took off with a lot of our customers, because you know they understood the pain points of like making. You know, unfixable mistakes and this architecture kind of gave them the flexibility to actually fix those mistakes right.

B

And so actually, you called up a really good point which again you're right. We should have brought up a little bit earlier, which is the context of transactions. Why? Why was this concept so important that ultimately led to that creation of that transaction or ultimately allowed you to be able to provide the reliability within Delta Lake in the first place, yeah.

C

I mean like the idea was like with a data Lake HDFS or you know, s3 whatever you didn't have this concept of you know, I, don't know what I'm changing like. There was no centralized point where you would have. You know like.

C

Oh I want to make sure that you know if I have two concurrent writers, like changing doing things to my table, I want to ensure that you know they are consistent amongst each other, like I, don't want to people trying to delete the same file and or delete the records from the same file and update with new values to it. So or you know like if you're doing compaction. You know compaction means you're. Gonna have a second copy of the data within the same table.

C

You don't want to break anything that you know just was running at the time, a query that was started before you start your compaction process. You needed to give it isolation so that you know that query can run for two days three days if it's kind of like a you know, a deep learning algorithm, for example, but the next time it runs.

C

It runs on a fresh copy of the data, which is actually more compacted more optimal, so we needed to like we wanted to provide those semantics, and the transaction log was one way to do it.

B

Perfect, okay, so I actually only want to ask one more question just because if we have a lot of other good questions but I figured, some of them are a little bit more detail. So it probably it takes longer than just a quick Q&A but I did want to ask answer. Look sue me. Ask the file question, which is: when do we expect the full hive integration with Delta Oh, full.

C

Hive integration, that's a great question, so we do have a hive connector right now that we're hoping that people are gonna, try out and help give us feedback. So Delta is an open source project. It's under a repository called Delta I/o Delta. We have a github repository called Delta, io connectors, and here we wish to have. You know like vectors for other kinds of analytics engines. Where you know hive is one of them.

C

You know press those another and even flink, and we do have a hive connector that you can build and run today and we would love to have you know feedback on it, so that we can, you know, fix issues. If there are any issues and like we just want feedback and like how people try it out.

B

Perfect, okay, well, hey uh Brock! Thanks very much for this great session. I really appreciate your time. I did want to do a couple things just to do a wrap-up, so Karen I, don't know if you're gonna do it do a wrap-up, but I'm gonna at least want to cause some quick things for wrapping up number one is that Brock and myself and a few other members of the stream team. We will be doing a three-part. Webinar series called a diving deep into Delta Lake, where we're gonna be talking about two transaction log.

B

How things will happen when you do updates elites and schema evolution, which we didn't actually get to talk a little much about so we're gonna dive into those details, so please be on the lookout for those sessions, because we're gonna actually have them in the US amia and also in Asia no time zones as well for the next SPARC online meetup we're planning to actually interview data science, extraordinaire from data Brooks, Brook, Winnick, so I think that's.