Delta Lake Getting Started with Delta Lake, 12 Mar 2020

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Simplify and Scale Data Engineering Pipelines with Delta Lake

Description

Online Tech Talk with Denny Lee, Developer Advocate @ Databricks

A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (“Bronze” tables), transformation/feature engineering (“Silver” tables), and machine learning training or prediction (“Gold” tables). Combined, we refer to these tables as a “multi-hop” architecture. It allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake.

Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

In this session you will learn about:
- The data engineering pipeline architecture
- Data engineering pipeline scenarios
- Data engineering pipeline best practices
- How Delta Lake enhances data engineering pipelines
- The ease of adopting Delta Lake for building your data engineering pipelines

See full Getting Started with Delta Lake tutorial series here:
https://databricks.com/getting-started-with-delta-lake-tutorial-series/ Databricks is proud to announce that Gartner has named us a Leader in both the 2021 Magic Quadrant for Cloud Database Management Systems and the 2021 Magic Quadrant for Data Science and Machine Learning Platforms. Download the reports here. https://databricks.com/databricks-named-leader-by-gartner

A

Hi everybody welcome to simplify and scale data engineering pipelines with Delta Lake. My name is Danny Lee I'm, a developer advocate here at Derek's I want to let you know that we're also live streaming this on our data box, YouTube channel. So you can go to D Brooks co /youtube here. So if you want to go, listen to this live stream, if you have to jump off a little bit early or if you want to listen to a little bit later on today.

A

This and other of these Delta Lake on online tech talks are gonna, be available at the data bricks YouTube channel.

A

Well, allow me to introduce myself a little bit before we dive into it. My name is Danny Lee I'm, a developer advocate at data bricks, I'm a hedgehog distributed systems and data science engineer with experience in Internet scale, infrastructure data platforms and predictive that alex systems I used to work at Microsoft now and helped build what is currently known as HD insight worked with sequel, server, customers and now, but we've been working with Apache spark since 0.5. So gives you a little background.

A

In my data engineering experience, so some quick logistics, the this recording and the slides will be available after this Tech Talk after this webinar, like as noted, it'll, be actually also posted on to YouTube so of the data books YouTube channel, since everybody is muted. Please put your questions in the Q&A panel, not in the chat panel. I'll be looking for questions inside there, ok, and we will also provide a link to anybody that logged in who saved the spot through zoom or through the YouTube channel.

A

So that will give you all the information you need. This does include by the way any of the the notebooks that we're using for this demo. It's actually included in this presentation, the link, but as well we're going to.

B

A

It inside the YouTube channel- ok, so let's get started. So if some of you have been attending some of the previous sessions, we did talk a little bit about the data engineering journey or the data engineering pipeline. Okay, so the I'm gonna do just a quick call out just to provide people context. But if you want to dive deeper into it, we had last week's Delta architecture introducing.

A

Basically, going beyond land our to trick introducing the Delta architecture and the previous week, we also had getting data ready for data science with Delta Lincoln ml flow, where we discussed a little bit about this. But the quick context here is that in if you look at this particular diagram on the left side, you see the events it's going into Apache Kafka or some other streaming mechanism like whether it's Kinesis or it as your VIN table or cosmos DB irrelevance of which approach. Typically, you have to build a lambda which they're true. This is lamb.

A

Dr. jerk allows you to deal with real time processing in the top part of it and batch processing at the bottom. The top is an apache spark or structure stream. That's processing the data from Kafka and then pushing this data into this unified view to the right, the database and that's what goes for a I'm reporting, but then that's for what you're constantly streaming per se.

A

Then there's also the batch date, which is to your left here, and the left data basically streams down and gets written continuously into some table that you can do normal batch processing against to validate this data. You're gonna need to actually build a unified view that looks at or actually does, the validation between the streams at the top and the stream to the left between the processing you're doing in the stream structure stream and the processing you're doing. Ultimately that goes into the table.

A

So that way, there's a reconciliation process to ensure that the data is actually the same. Once you've got your data written to this table as it's continuously written there, you're gonna run batch processing. This batch processing allows you to reprocess data into yet another table and it'll get compacted every hour because the size of your files, you have a lot of small files that are generated from the stream, so we're gonna go ahead and compact those files together.

A

So that way we can ensure better performance when you basically have too many small files in any distributed system. Basically, it's going to slow down the performance. The system- it's here, while you're them being able to do updates, emerges.

B

A

That data and then also, if there's any late, arriving data you're able to go ahead and reprocess it at this bottom tape. Okay, that's right in the middle, yet another spark batch process will then take that data from that table place it into this unified view. So that way you can do your AI and report. Okay, that's.

B

A

Traditionally, a lambda architecture would look like, and the usual question you want to ask is: can this be simplified? Okay, and so this is basically what the data engineers dream is right. You want to be able to process data continuously in incrementally, as the new data arrives in a cost-efficient way without having having to choose between Batra streaming, when we created a mark stream and restrictive streaming.

A

Those few years back the whole premise of us creating data frames when we called it things like static data versus dynamic data frames, was that you didn't actually have to think of a difference between how to deal with low latency data and batch data right. In other words, you would treat and run the queries exactly the same, whether you were running against a streaming data set or running it against a batch data set okay. So that's what you really want to do so that way, it's the same.

A

You know api's or the same sequel, syntax that you're applying whether it is a streaming data set or a batch data set now it would significantly simplify, not just the maintenance of the code, but just as important the type of thinking. That's involved that you mentally don't actually have to change your thought process just because you're doing things for batch or streaming, and so that dream then, basically, is on the left side.

A

You know your coffee Kinesis event hub your data like doesn't matter if it's a streaming source or a batch source, it doesn't matter SPARC takes it props it into a some store right, and this is where David ultimate gets Walt. You know just a little head there and then spark process it again, and then you have your AR reporting.

A

So that's what you want to do so what's missing, though, in order to be able to do this right now, the things that spark is pretty close to solving all these problems, but you're going like, but there's a bunch of things that are missing to be able to actually achieve this goal. Okay, so let's talk about those the the issues here, the first one is the ability to read consistent data. While data is being this.m. As noted in some of our previous sessions, we were really talking about the concept of asset transactions.

A

Okay, so acid back to this concept is atomicity consistency, isolation and durability. That's what asset stands for and this concept of transactions is that you, you want to be able to trust the idea that, when data is written to the store or to the disk or whatever it is being written to, you are sure that it actually happened right and- and there is no chance of corruption right.

A

If you look at traditional distributed systems when you write them to disk, whether it's as their blob storage or a TLS or Google storage or as three it's the same concept that these are these are base in comparison to acid. So there's a chemistry joke inside here, but the concept of base here is actually basically available.

A

Soft state eventually consistent and the most important aspect here, is that eventually consistent and what that means is that you, you know traditionally, when we have these cloud storage or even with the Hadoop system, there's by default there, three copies of the data written to disk or into storage. So the idea is that potentially client one hits node 1 and client. Two hits no 2 of this. Basically, the three copies of the data well, eventually consistent in this environment, musically means that I've written the data to Note one, but it haven't written to Note 2.

A

Yet because it's there's some delay, but it's possible that client 1 climbed to hit node when the at exactly the same time. So no for client, one that hits node one they'll see the data for client two, it hits. No, they don't see the data all right there. This is this concept of eventual consistency right, you don't you're, not consistent in whether the data exists or not. Okay. Well, this is important this a bit.

A

This concept of consistency is extremely important if you want to be able to look at streaming and batch data concurrently right, because if you're streaming, data and you're constantly writing to disk extremely quickly with all these small files right the distributes tours that you're writing whether it's on Prem Hadoop, whether it's your owns environment, whether it's cloud storage, doesn't really matter. There's multiple copies of that data sitting somewhere else, and if note again back to the example, I use client one is hitting node one inclined to is hitting two and climb.

A

One sees that data in client does not see the data those clients are about to make perform some action. It gets the data in which it's you've got inconsistent data, so, for example, if I'm gonna do an update or delete or something else, client one sees a client does not see it. Their actions are gonna, be quite different or the results of those actions are going to be quite different right. So this ability to have consistent data right is extremely important. All right, the second caller ability to read incrementally from large tables with good throughput.

A

That's a super important concept right, the larger the table, the more resources it takes. You want to actually have some mechanism to have consistent, processing that data. Okay, you want to have that ability to roll back in case of bad writes. You know, for example, I write the data to attempt to write the disk. There are, let's just say, five tasks that have to write it as part of this one job. The fourth task that fourth and fifth task fails all right. What doesn't what happens normally in this case?

A

What well the other three and possibly the the fifth one will write to disk and the fourth one will roll back and fail? Well, what happens that? That means four of the five tasks have written to storage right while the fit the fourth one like one of the four one of the five ends up, actually not be able to write it. Okay, so it fails writing. So you get the error, except now, your data is left in a bad state right.

A

So, in other words, you know you're not sure what change has just happened to your data and that's already bad enough, if you're just inserting data right, because if you're just inserted a- maybe you could just say: okay, I felt a checkpoint or I put the Q I know that I ran this by this batch ID or at this at a time just roll everything back, delete everything manually myself and that and that's cool right, but it gets a little bit more complicated, especially.

A

Firstly, it's already a pain just to do all that maintenance, but even if he you had no problem doing it running, the updates are running the deletes against that data. Well, once you've done that and somewhere halfway through the system, it breaks. How do I rollback that update? How do I rollback that delete right? So you want that ability to rollback.

A

So in case there is an error, whether it's error, the disk level or error in the source system or the business logs, doesn't matter where the error is, but in case there's a bad right of some type you're able to rollback.

A

You also want to be able to replay historical data, along with new data that arrives so, in other words, take all historical data and also, in other words, have a single table per se. That actually looks at both the new data, that's coming in as it's coming and replay historical data at the exact same time. So again, you, as the data engineer, are looking at data from the standpoint of here's all of my data at the point, the query whether it's brand new or old, as opposed to okay.

A

Let me take the new data and let me take the old data. Let me go merge them together. Let me go ahead and write a query: it's that and just as important with all those concepts that we're talking about the ability to handle late-arriving data well that have actually having to delay downstream process. Okay, so what's the answer to this? The answer? Really, it basically is combining this concept of structure streaming and Delta Li together in order to create this Delta architecture.

A

We talked a lot about this last week, but really you know, as we know, that this session is about simplify and scale data engineering pipeline, so I just want to provide this real, quick context. First, before we go into how do we do that? Okay, but the real call-out here is basically unified, batch and streaming with a continuous data flow, infinite retention to replay every process, historical events as they're needed, and you can actually have independent elastic, compute and storage. They can skill independently from each other. So that way, you can balance costs right.

A

That's actually how we want to do this, so let's try this concept all with Delta link instead, alright, so what we implied with the data engineering journey is basically broken out in this concept of bronze silver gold stated quality levels. Okay. So when you look at the in a generic line, we actually showed like one you database icon, but really it's broken down into different data quality levels. Now some people gonna ask the question: do I actually need to have three separate tables?

A

Are these three physical implementations and the quick answer is that in some cases, yes, in some cases, no it depends on the cleanliness of your source data. If you had already processed and cleansed data upstream, you probably don't need these three levels. You probably can just go write the gold, but if you're taking the raw data source, where you do need to ingest and keep the data because the store the REST API, that you're calling from is actually non persistent, in other words it'll, it only holds the data for our short time like Kaka.

A

Will you know? Let's say you set up your coffee, Kinesis or event hubs to only hold data for about a day or for a few hours you need somewhere to have that data sit all right. So that's what this concept of a bronze table is it's the wrong dish, make sure you have a source where that data resides all right. Then you go to this concept of silver. Now that you are your ensure that you've written the data to storage to disk right, let me go ahead and filter it right.

A

I don't need all these login calls. Are these IP traces or whatever else right clean it? So, in other words, there's data that actually has the with simple business logic. I have to remove this description or I have to change this ID or I want to filter out all from this particular geo region. Right this cleansing concept right. So that's what silver data does oh and also augment it, so in other words, if I have other sources of data that I want to join with this data.

A

This is where you would not do that and then finally, you're at the gold level, the gold levels based on it now that I've filtered cleansed it and augmented it I also will possibly aggregate it as well right. So, in other words, I, don't need to look at the 8 million transactions that just happened per hour. I can look at them by minute right, as opposed to PI second, using that as an example right, because I only need minute, lien C for the purpose of my recording.

A

So, if I only need minute latency for purposes of reporting, I can shrink that hour into I can streak the minutes down to hours. Yeah actually met the other way around. I only need our reporting, as opposed to min report, okay and so irrelevant, of the whatever business logic requires right. The idea is that this goal table is smaller.

A

More compact has exactly what you need for your streaming analytics or your a I'm reporting to go ahead, piggyback off okay, and so this concept of of basically these data quality lotion or the Delta architecture, which is what you see in services, this data quality levels right. It allows you to incremental improve the quality of your data until it is ready for consumption that that's the important call-out.

A

So if you're a data engineer, I would love to be able to online be able to just ask you raise your hands. How many of you are data engineer is that how many of you are enterprise data warehousing folks? What does it remind you of well this? This constant reminds you of basically okay, your standard data lifecycle. Instead, the the broad stable is more like your data.

A

Like you just dumped all the data in it, you're you're good to go right, then you build a staging database or staging table right which basically shoe to go ahead. Filter, cleans and come in to the data. Look sorry, and then you build a data Mart that actually has those business level aggregates right. So this concept that we're talking about is very similar, very close to this idea of a data lifecycle. It's just that with the Delta architecture. It's not just a single table in a single database.

A

The ideas that we're talking about different tables applied in a dish you matter in a distributed system. So that's what the architecture is about, and it also allows you to handle both streaming and batch data concurrently or so. How do we transition from this traditional data lifecycle to the Delta Lake lifecycle? Alright, so, like we said raw dition, the bronze is this idea of dumping ground of raw data.

A

You often with Oh will have a very long retention errors that oftens in the years you want to avoid error prone parsing, so in other words, it's really about just storing that data. That's that's! This really concept, the bronze. Okay, as we know it with this silver right, you've got a intermediate data with some cleanup applied. It's credible for easy debugging.

A

So, for example, if you are a data scientist and you want to go ahead and not necessarily work with the business level, a grits, but you want to go one level deeper or more detail right. You can potentially just run your your analysis on the silver data. This is also common for debug pranks, as example, if I'm a I'm just gonna use as an example I'm, an airline I have data and I want to be able to debug or provide customer support for somebody who logged into my website to order a ticket right.

A

This is that filtered and cleansed augmented data. It's usually there's enough information inside there, so you could debug your way out and it's all inside here, ok and then once you're good to go, you've got business level aggregates you've got clean data consumption. It's now ready to read with a spark or presto.

A

Actually, the stars there I should remove the star, but the call-out is that, as of Delta Lake, zero to 5.0, we've included the ability to create manifest files and with those manifest files, both athena and presto actually are able to go ahead and read a delta like table as well. So the idea is that Delta Lake is not just for spark. It certainly started with smart and I'm gonna show you an example using it, but other systems are absolutely able to make use of Delta Lake, because they're able to read the manifest file.

A

Okay, all right and then, as you as we're sort of noting here, the streams move data through the Delta Lake, whether it's low, lengthy or manually, triggered it eliminates the management of schedules and jobs right. So I forgot if I included this particular slide in this deck here.

A

But one of the cool examples I like talking about which was covered in sparkling I summit in San Francisco last year, was Comcast and Comcast went ahead and dropped from 80, so jobs down to three, because even though they were doing a session ization process that did not necessarily require them to have everything streaming, because they were able to run low latency streaming jobs. They were able to replace all their 80 plus batch jobs and shrink it down to three jobs.

A

The combination of asset transactions at Delta Lake, and they be able to to look at the problem whether it was streaming or batch in the same manner, allow them to decrease the complexity of their job structure and they're, able to now maintain that system in a much easier way. And so this is an important aspect right. You know, even if you cannot necessarily take advantage of the streaming per se, because you don't have a streaming job in a traditional sense, you can certainly break it down into small micro batches.

A

So that way, you have a job. That's consistently running, to reduce the complexity of what you're building all right. Now Delta like, in order to be able to provide you all this capacity now it basically revives you DML right. The ability to do insert updates, deletes merges over right now. You can certainly do inserts and gold and do overrides abroad, so this isn't necessarily meant as a as a catch-all right.

A

It's just simply stating that traditionally in a bronze level, you're either inserting updating traditionally in silver you're deleting data, so you can shrink it down and traditionally in goals at this point, you're merging or overriding data, but those DMLs, those data, manipulation language. Those statements can actually be applied all throughout right. It allows you to do.

A

Retention allows you to do Corrections, and it also even allows for gdpr general data privacy regulations, and this is actually an important concept within newer systems where, if you're, holding a lot of data CCPA GDB are these concepts of GRC or governance risk and compliance risk management compliance? It's an important aspect of how to ensure your data is not just safe, but you ensure the privacy as the individuals behind that data, and so that's an important aspect which we actually are going to cover next week by the way.

A

So in next week's session, Tech Talk, we are going to be talking about how to address gdb our CCPA yeah by utilizing Delta like so. Please do join us next week for that session, as well. So a little shout out to that session. Okay, but we're Delta like the concept, is that you know I can run if I need to recompute. If the business logic change for an I can clear, the silver I can clear. The gold I could just run to lease and then I could restart the streams or reprocess the data.

A

I could scale the environment or scale the systems to go ahead and improve a process, more data and then you're good to go okay, and so now, let's talk about demos, the remaining 20 minutes or so 25 minutes are purely demo. So now that you've been patient enough to, let me give you some of the context. Let's go right into the demo. Okay,.

B

A

A

So, like I noted, this notebook actually is available for you to download use I'm, actually only going to be using a small portion of it, but just to sort of give you some context about how to build these scalable pipelines. Okay, so I'm going to stick to just streaming and processing it and that's all I'm actually gonna go do all right. So so this notebook is currently running in data bricks, Community Edition, and because it's because we're doing it here.

A

Actually, the data sets actually in data bricks data sets, but you can actually run this yourself as in a Jupiter notebook if you want it to and the data set. Actually, we include a link here on where you get the deep okay, so you can certainly go run the state on your own environment. You do not actually have to run this undated mix, but if you do, you can run this on database communication and it's free, okay, so alright, so right now what we have I just want to start off.

A

We've got some data and it looks has a schema like this, so you have an alone ID funded amount of paid amount and address state. This is a loan data that we downloaded from Kaggle looks at it. We note we include the link there alright, and so how many records does it have right now? This particular one has only 14,000 rows inside there. So it'll be important to call this out later, but that's a that's. A quick afford, caller okay, so I'm gonna go ahead and run this particular function here.

A

B

All right there go okay,.

A

I'm gonna run this function now, I'm not gonna, dive too deep into it. But the whole point here is that I'm actually gonna create this generate and append data stream. The purpose of the stream is basically I'm going to take data that was based off of the the loan data that I have, but I'm gonna insert this into the same location, but I'm gonna. Do it in park', okay, so originally near the top here, I'm just gonna scroll back here.

A

Just give you some concept: I create this Park a path so basically I created this particular folder, so my table was Louis. Park a table is based off of the data that's residing inside here this temp SAS EU 19 demo plots okay, so I'm gonna go ahead and run this generate function right, which basically does that okay generate an append data stream. It's gonna put data into that parkade location, all right.

A

So now, when I run it basically a structured streaming, job is gonna, kick off, so we're gonna wait a few seconds for it to kick off all right and so we're doing so we're gonna. You know issue initialize, a stream and what's happening here. Is that the reason I'm showing this is date of explaination?

A

Is that has a cool little thing, where I'm actually able to show you the input, processing rate and batch duration in terms of okay, we're processing about 21 20.6 records per second, so we're putting data in okay, so so far, so good, all right!

A

So, let's see if any data is being added into it, so I'm gonna go ahead and just run a quick count against that same Park, a path or where that loans park a table is alright, so I'm, just gonna, run it directly off of here and when I run this you'll notice that there's a hundred so many rows inside here. Okay, that number seems a little off and all to talk to you about that in a second.

A

But what happens if I try to run a second stream right, and so this is an important aspect of if I'm trying to go ahead and scale. My data engineering pipelines right I want I've got more than one source kicking. It.

A

Sorry I hope you did not hear me STIs, hence my apologies for that. So III have more than one source because it's a distributed, multiple rest api s or distributed source or whatever right and so I want to run a second stream to go to that exact same table. What happens if I try to do that against a park? A table? Okay, so it's gonna kick in! It's gonna! Try to start writing to that same location, all right, except you.

A

What you'll notice is that sure the batch duration is almost second but there's zero records per second okay and so I that is nothing's going in alright. So the second stream can't write to the first write to the same location as the first street, so data is still going in from the first string right. It did jump up to 570, as you can tell from here right, so the data is still going in, as was notice from this one. But it's not going in here.

A

It just not happen, and why is that happening because, ultimately, what's happening? Is that when you're writing data to from two discs right, there is no concept of this asset transactions which every single right is protected and there's a transaction around of protecting this information? What's actually happening is just we it's in essence, there's a lock. If you want to think of traditionally DW or, in essence, is a lot to the table.

A

It's writing to disk, so the first stream is able to do it, but the second stream just can't do it all right so M, so you can't wind up happening more times without creating multiple tables, and then you have to merge those table together.

A

Ie increase the operational complexity of your system, but wait remember how I start off is that there are fourteen thousand seven hundred five rows right and right now, there's less than certainly less than that I mean I'm streaming data in so it's probably a little higher now so we'll just run it 870, but certainly not the 14 705 that we talked about right.

A

So if I look at the data and look at it, you'll notice that, in addition to loan, ID funded amount paid amount after state I, also timestamp and value inside here, okay, so the schema changed all right. Well, that's! What's that's our problem here, okay, so I'm gonna, stop the streams as I exploded swing this concept- okay!

A

Well, it's basically happenings and because we had the stream inquiry, one of them, because a second would work right right into the park a table because it's it's a stretch of streaming job automatically structure, dreaming jobs. We include the additional column of timestamp value, so, for example, if I go back and look at the code here, all right, so I'll just open up here right. My stream date is actually right here, right so I'm. Reading this data, I'm asking for the long ID funded amount, paid amount and outer state.

A

Okay, I did go ahead and do a write stream right, which is basically you'll notice, that is, stream data right stream format table format. That was one of the input parameters which was park'. Okay, the option is, is check on location. It's just a location, so we can ensure that our stream runs correctly. So we can skip that. For now we do a doc trigger, which is every 10 seconds. We're gonna go process, the data ie write the data and then basically it goes goes into that table path.

A

The table path is the same Park, a path that we had listed original. So that's great you'll notice that I never actually specified timestamp in value, so it was automatically added as part of the streaming process. Okay, so because I actually automatically added these columns, this timestamp and value. What happened is that I have two different schemas I've, one schema, which is the old one. That only has four columns, and then we have a new schema with six columns and so because I have six columns in essence, I, basically overwrote, my my original table. Okay.

A

So that's why? When I query the data I'm only seeing the 850 or the the last thousand I'm, not seeing my original 14,000 okay. So this is a problem of park'. Alright, there's this no concept of schema enforcement to ensure that the data actually coming in is actually going to be the the schema. That's already exist inside the table and you can't there's. No, it dropped I mean not just back to streaming.

A

Workloads I can't actually even have to streaming workloads right into the same table, concurrently right, which is sort of which sort of sucks okay. So let me just restart this process over again I'm just going ahead and clear out the data and I'm gonna re popular republish, the data so I'm just gonna run that process, and once this is finished, I'm gonna go ahead and run this step. We just a basic, create this as a delta table, so in other words, here's the original parquet file.

A

Sorry parquet file that we have here: okay, I'm gonna, go ahead and run this note. Instead of the parquet path, I'm gonna go ahead and do as a delta path. Okay, so I'm gonna go ahead and store this as a Delta table. Okay, that's what this this line is!

A

Okay, here, okay, so as you can tell it's pretty easy to switch from parque to Delta right, in other words, I read from parque and then I write with Delta, so exact same format, right, read, format, parque, right format, Delta if I want to read a Delta table and be readout format, Delta right, pretty straightforward. Okay, so now I've created this Delta table. I also haven't created a temporary view as well, just in case all right.

A

So let's see the data okay and we should back to 14 705 as you see here, and we want to look at the schema real, quick all right. So the schema is in fact four tables. Those are four columns, employees all right so I'm back to where I used to be now. But the only difference is that I've got a delta table as opposed to a porque table, and one thing I'm gonna go do is that I'm also gonna create a Delta loan still to stream.

A

Okay, so I have a load elta table which is for batch queries. I'm also gonna create a little Delta stream, which is for streaming queries, but they both go to the exact same location, this Delta path. Okay, so it's the same file system, whether I'm running a batch query or if I'm running a streaming query same thing: okay,.

B

A

Right so now that I've got that okay, so this is giving my quick count: whoo-hoo 14 7:05, let's go ahead and try to run this query again, all right so same one again, all right, but remember: I actually had six columns, not four cups. Okay, so with this one I have six columns, not four columns, but guess what? Because I'm using a Delta table I'm seeing a schema mismatch now right, it gets. It tells you right here all right.

A

If I want to merge the schemas together, I can use dot, option word schemas tricks, so in other words, if I wanted that those additional columns to be included all I had to do was change the code that I was using to include that option and I could have allowed those six columns are the two additional columns to go into my data, but I don't write for now. I want to do that, but it's a good call so notice if I have good business justification or good business reasons to do so.

A

There you go, but it's calling you out right here. Here's your table schema these four combs loan. I defunded them out pay to mount address state and here's. My current data schema right. I've got the timestamp about, so it's warning me that there's a proper all right, so, let's, instead just simply go fix this. Okay, all right.

A

So in this case, what I'm doing is that the stream data, because I know the stream data automatically includes the time staff eval, because it's streaming, let me just go ahead and specifically specified oxalic, where I'm only going to include the four columns. Now that I've only done that when I write this data down to my Delta path, location it'll only write it with four columns, not six okay.

A

So that's a great thing about Delta Lake, because there's both this concept of schema enforcement ie, we prevent data from going in to potentially corrupt the data you have, and also we allow schema evolution. So if you have a good reason that you want to change over time, we can merge the new schema for say karma. If you wanted to click timestamp value, we can merge the new schema with the old schema, but it won't corrupt. The data, the old schema will still be there. Okay, so that's what's pretty cool alright!

A

So now let's go ahead and run this it'll take a couple seconds for it to kick in, but so now we have the streaming query again. Alright, so let's go ahead and kick this off all right boom. So now we have a dashboard, its Stern, it's trying to process records. So now it's trying to put data inside in it. It's right now running about 54 or 55 records per second okay, which is cool all right. This, let's go back up here and you'll notice right away.

A

The number changed, 15 2003 MS ticking in Italy, able to put more data in okay, obviously increase, but what's also cool about it. Is that because I'm using a Delta Lake table I'm actually able to go ahead and run this again, all right so now words I have a second stream alright. So this is my stream query: 3 same same concept, same code, I'm just gonna, be running for, say karma.

A

This would be representative to different sources, but they have the same schema same data that are all trying to write to this thing at the same time and sure enough right away stream, query, 3 is jumping up to 147 records per second, and if you go back up you'll notice, the numbers keep on steadily increasing here, okay, so this is what we mean by simplifying what you create. It's simplifying your data engineering pipelines.

A

The fact is that in a single source, multiple sources, whatever, if I decide that I need to put everything down into a single taste, which is what this example is I, can't because I have multiple streaming jobs that can hit the table concurrent and because I've asked the transactions I'm, protecting the data underneath this at the entire time, all right and let's go ahead and run. This remember I've, said that we were looking at loans, underscore delta, underscore reed stream.

A

That's actually telling me what the reed stream looks like, but I can also look at the batch alright, so in other words the loads for Delta. This is a batch table. That's looking at the exact same source, right and again now, I have a batch query read which is telling me right now: it's 23 455 I've got two streaming rights all happening at the same time right. So this is how powerful it is right that you can actually because I can have multiple streams, read and write. Multiple batches read and write now.

A

I can actually simplify everything, because everything's shown in just single T right I can then organize my jobs to be streaming jobs, as opposed to just a bunch of batch jobs. Then some have you 20 batch jobs like an petition, one, a bunch of a single micro batch or streaming job to actually simplify things. Okay. So let's take a look at the file system underneath the covers right.

A

Remember so like so we're gonna, stop all the streams right and then let's take a look at it, and so this is the delta path that we were talking about before right. What you'll have here is basically, if you'll notice, it'll be just a bunch of Park, a fox alright. So here we go alright, so same thing as you're used to working with parkade. The main difference is there: is this Delta log folder? Okay, although is that all the original parking?

A

These are all these all these little small bucket archive files are from the stream, and you probably could see this big one larger one. That's the original Park, a file that had the original data alright. So if you look at the log, what is the log? Basically it's, it's showing you all the different versions of the data, so every single insertion has a JSON around it that JSON describes and tells you exactly what happens inside that transaction sort of cool that way. Okay, in fact, let me go ahead and just choose this one for the.

B

A

You open up the JSON file and take a look at it, so it tells you the time set the commit information who it was that did it. For example, it's me what what type information like it was: a streaming update and all the transactions right. This is a transaction that happened and all the values associated with that. Okay, all that's stuffed inside here.

A

That's actually, what's inside that JSON alright, so underneath the covers were put we're creating this asset transaction to basically protect the data underneath the covers okay and actually to make it a little nicer. We actually create a this history, so you just go describe history and then you see all the versions of the data, so all the streaming versions, all of it right here: okay, alright! So before I jump into any more of this, let me go back to the slides. Look like okay, so let me do that. Oops, sorry,.

A

Okay, so let's connect those dots back together. Okay, since we've got a feel about Simmons left for the presentation, I still want to leave some time for Q&A. Let me connect the dots right be. Am I able to read consistent data with Delta Lake I am able to because of the staff isolation between the writers and the readers right, just as you saw I was able to go ahead and run multiple stream writes and read all at the same time. All right am I able to incrementally read from a large table.

A

Yes, I can actually I've optimized file source with scalable made it metadata, Hanlin. Okay, this scale a minute and handling. Actually, let me go back and just show you this real, quick. Sorry, I gotta get this right. All right, you'll notice that there's this checkpoint park' here, okay, okay, so every tenth JSON inside here actually has a dot part key inside it. Alright!

A

So there's a question that just came in asking me: you know what what is that Park a they managed to know it say they saw the JSON, but if so, only what the park a so. Let me answer that right now.

A

The park K here basically is to go ahead and say all right after every tenth transactions I'm going to convert this to park a so instead of actually trying to read each individual JSON file, I'm, just gonna spark will be able to read the park, a file such that later on, if I shut down the cluster and then I'll relaunch the cluster at a later date. Instead of trying to read every single JSON file, I'm just gonna read the park: a files, okay and so that'll make things a lot easier.

A

It's already in the format it will stream into memory and boom, you're, okay, and so basically, instead of reading a 31 JSON files, I'm gonna read three park' files and then I'm, actually gonna read the first one additional JSON to be up to speed on what the current transaction is. Okay,.

B

A

Right, perfect, okay and so am I able to roll back. Yes, you can I, actually just have this concept of time. Travel. I, don't have time in this today session to actually go ahead and show you time travel, but if you're looking at they look at the previous YouTube, it's on YouTube right. Excuse me previous Tech Talk, on getting data ready for data science. We actually show a lot about rollback and time travel. This idea that, because you are able to see the history of the table, it's not just you see the history.

A

The data of that table still resides there, so that means I can actually roll back to that previous view of the data which is pretty cool right.

A

So, for example, I can go ahead and roll back to like the nineteenth version of that data right so for sake, argument I could do something like.

B

A

Forgot how to write the version statement. This is what happens when you try to do stuff, like so I'm just gonna. Fortunately, I actually have the version statement down here so I'm, just gonna copy.

B

A

A

So the good news of what I'm showing here is I'm actually showing you the full notebook that full notebook actually shows you the roll backs and everything side here. So it's pretty cool, but let me go ahead and actually get the time-travel correct.

A

A

Okay, you know what I'd, probably with the time bus left. I'm probably gonna, skip this, but let me go ahead and actually finish off what I was talking about and then I'll get back to the demo, okay cool so back to presenting here. Sorry, sorry for that.

A

All right, so, we can also replay here circle data right stream, the back filled historical doo doo through the same pipeline, which is sort of nice right, so, in other words, whether it's historical data or streaming data I can actually merge the two concepts together without any problem whatsoever, and this is important because I want to be able to look at the data without any problem whatsoever. I want to be able to look at the data irrelevant of its new data or old data.

A

I want to be able to look at it without without actually going in and seeing it. I have to build a separate table for streaming and a separate table for backs. No. Instead, I can just go ahead and look at the same table that has both my batch data and my streaming data at the same time, which is sort of nice right. So that's that that's an important concept as well.

A

Okay, all right and then also just as important I- want to be able to stream any late-arriving data and add it to the table as they get added right. This is basically this concept of like okay. Well, I have data that's coming in when the data is coming in, I want to be able to go ahead and make updates to thee or I wanna be able to do deletes to that table.

A

Well guess what with Delta Lake I'm, actually able to do that now, because the I have asset transactions protecting this data, the entire time, because I'm able to protect the data without any problem, then sure enough, I'm able to go ahead and listen to I'm able to go ahead and reprocess it, and this goes back to why we talked about this silver, gold, blue silver, sort, bronze, silver, gold concept, because that way, I can go ahead and look at the data at reprocess the data from the original bronze concept and delete it and reprocess it all over again.

A

Okay, all right so altogether. This allows Delta lake to basically put everything. Allow us to put everything together. How we can build this Delta architecture is because Delta Lake allows us to do all of these things, to allow us to simplify our data engineering pipeline. So some quick questions who uses who's using Delta Lake all right. Well, there's tons of organizations tons of customers last night, there's actually two exabytes process, not one to expose process in last month. Just a look right. So I talked about the Comcast Universal case.

A

There are many other customer scenarios that you can check out on the spark and AI summit, which actually has that information. Okay, oh here we go, I didn't include the slide. So, for example, the Comcast was session, ization Delta Lake the petascale jobs for because they had in order to improve the reliability they had. They had these petascale jobs that they need to make more reliable by using Delta Lake.

A

Not only did they get the 84 jobs down to three okay, they also have the data latency and just as important, if not more important, they actually had 10x lower computer. Instead of 640 instances, they were really able to knock it down to 64, which is pretty cool right. Alright. So how do you use Delta Lake right to use it? You can basically add a spark package you can use, maven or and, like I said, with data frame instead of dot format, Parkay just switch to doc format, Delta and Ben, now you're using Delta okay.

A

So you can build your own Delta Lake right now. If you want to all right and then I actually wanted to just do the time-travel little trick now that I remember what the versioning it helps if I actually go ahead and.

A

So hope helps if Delta.

A

Okay, so this is actually supposed to run against a delta table so again, I guess in this case I screwed it up. So my bad on this one I'll actually have to give you I'll, give you a different version of the code base in a second, when I actually have a chance to go ahead and switch to the API, looks like the sequel of context actually isn't working. So my bad on that one haha all right. Nevertheless, okay, we're go back to the session here. Alright, so you want to use this notebook.

A

Well, let's notebook actually is relatively straightforward. You just download the notebook at D, brick, CO essay is EU 19 Delta. Alright, we actually are gonna, go ahead and put this link inside the inside the the the YouTube channel link and also the if anybody who signed up for the session they'll get to go ahead and log into it so either way. I think you'll be good to go so by doing that, then I think you'll be able to go ahead and try it out and give it a lot. Okay,.

B

A

I did want to call out some quick other things right. Delta Lake is, as you can tell see here, is rapidly becoming a standard for us, okay, and so because it's rapidly becoming a standard there there's a question here: will it work with? Yes, in fact, we're working with the community right now to go ahead and figure out how to basically see me how to work with hi? So this is right now pertinent in private preview right now, and so now that it's in private preview we'll make it to public preview shortly.

A

But the idea is that we'll be able to go ahead and make it available for hive to work okay. We can also give me one second here, like I noted in the Delta Lake 0.50 blog we're actually gonna be able to go ahead and showcase it using with presto and Athena. We also working with redshift and also snowflakes so- and this is just a current numbers right way that we have right now.

A

Okay, so we actually will have more shortly, but, as you can tell here, the they're gonna, actually a great number of connectors already able to work with a delta link which is pretty sweet. Okay, also Delta, Lake providers. Okay, those partners and providers are already working with Delta Lake. You can go ahead and actually worked, as you can see see here.

A

Working with down below you can also work with Professor ax to neat on these stream sets click why this go informatica and in addition to that, google data proc a ghoul did a proc mate recently made an announcement to actually be able to go ahead and work with delta lake as well. We're expecting more and more announcements to come out shortly, so so, basically to take your time and we're gonna go ahead and a be able to showcase that.

A

Okay and then users delta, like this, is just a small sample small example of that, but there's a lot of cool examples of customers that are currently using Delta Lake, whether they're using data bricks or they're, not using data bricks, doesn't matter they're all using Delta Lake. So it's a pretty cool thing right so saying that I'm gonna leave a few minutes left to go ahead and answer some questions.

A

But if you want to listen to this session- and you can subscribe today by using d, Brooks tech, CEO, YouTube or read, listen and in between all of that, I did want to go ahead and rerun that demo, because I kept on screwing up, so you notice originally I have let's see here. We go at least 23,000 inside my loans Delta table, so I'm gonna go ahead and run this little code snippet.

A

Instead, this will say it was what version 19 looks like so now, I'm gonna go run that okay, and this is at the nineteenth version of her stream here. Okay, so right here, so I'm actually down to 22 for 55. What was it looked like when it's version 0 I'm gonna run that again, so this is when the table was initially just created, the first time, okay and it's 14 705 right. So this is this concept of time travel that I was talking about.

A

So my apologies for the delay here, but that small, it's the concept that you're able to run time travel to roll back also is very handy for gdpr purposes, and that's also, like I, said a great segue for us to go ahead about next week. We actually have a session on how do you address gbbr and CCPA, using delta link all right, so a few questions left. So let me go ahead and try to answer them. If you do have questions like place them in the Q&A.

A

The first question: the when answer is: can I use, Delta Lake for all TP workloads, and the answer quick answer is is not designed for all TP purposes. It's really designed for the purpose of data, warehousing or bi type queries right. The data warehouse and data like workload, if you want to use it for LTP I, would actually just suggest an actual TP system. We're not actually trying to simplify that part of the process. We're trying to simplify the process where you get to analyze the data and go ahead and run machine learning against it.

A

Okay and I believe there was also another question about. Can we create this over using hive? As noted the hive connector is currently in private preview. It will become publicly available shortly. Alright and can I use Delta Lake on-premise, that's a great question and you absolutely can use Delta Lake on print. Okay, Delta Lake itself actually is is basically, for all intents purpose, it's just a jar right that you add to your spark process.

A

Okay, so, for example, you go to stash packages, so you can basically include the jar or include the Maven coordinates or basically use maven if you're, compiling or SVT or whatever else. But the point is that once you've done that now the Delta jar is included. So what's the step as if you've got an on-premise spark environment, you absolutely can go do that. I just include the jar, the latest Delta Lake job that matches your spark version and your Scala version and then you're pretty much good to go.

A

Okay, alright and I'm gonna finish this up with the answer. How likely can we use this in production right and the fact is actually this project Delta Lake has already been in production for the last two years. Okay, when we originally created Delta, the back story for Delta was that we we were trying to build. It dress some of the gooks, like as an I, made a mistake, errors that people made into their data, or we and or we're trying to address the issues that we had for streaming.

A

Okay, so because we were running into those issues, we actually built data, brick Delta as a originally a data bricks project which is sort of an add-on to spark, but due to its popularity, due to all the questions asked due to everything that people really wanted to see, we realized that this project made a lot of sense, not just for the spark community or for that matter of data bricks, but not just for the spark of meaning, but for the data engineering community as a whole. That's why we open-source the project last year.

A

Actually so it's an open source project, you can use on-prem, you can use it for your own environment, EMR hdinsight, you just have to make sure everything's matching it just longs. Using the correct versions of spark your lining up the spark version and the scala versions, you will be able to use it. Okay, now saying all this.

A

If you want to learn more about how we got here in terms of like how we can use in production things like that, the other thing that would actually ask you to do that was: go to D, brick, Co, YouTube or David's YouTube channel and check out the genesis of Delta link. This is where I interviewed Rakhi Ava's senior, engineer software engineer at data bricks, who actually was part of the creation of this project right. So you can learn a little bit more about the back story behind it as well. Okay, alright! Well, that's it!

A

For today, I realized that there are other questions. I apologize, I couldn't get into all of them, but I want to be a cognizant of the timing here. So I do. Thank you for your time for attending the session. Please do number one be patient with me for my mistaken I'm. Turning in particular time, travel query done for you, but at least I got it done so thinking this.

A

For that number, two you've got comments, go ahead and ping me directly at my Twitter handle act any lis or just as likely go ahead and go to the data box YouTube channel and go ahead and chime in I'm. Put your comments on the the directly on the YouTube channel. I will regularly good login and answer those questions and finally, I also go to Delta IO. That's the Delta Lake page, okay, so the lease videos Elyse notebooks tutorials. All of them are actually there. So please go ahead and attend.