Delta Lake Delta Lake Discussions with Denny Lee (D3L2), 7 Mar 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: D3L2: Massive Data Processing in Adobe Experience Platform Using Delta Lake

Description

In this session, Yeshwanth Vijaykumar, Senior Engineering Manager and Architect at Adobe and our host Denny Lee will discuss how the data lake house architecture at Adobe Experience Platform combines with the Real-time Customer Profile architecture to increase our Apache Spark Batch workload throughputs and reduce costs while maintaining functionality with Delta Lake.

Quick Links
Read Our Newest Blog Post: https://delta.io/blog
Yeshwanth Vijaykumar: https://www.linkedin.com/in/yeshwanth-vijayakumar-75599431/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
Join the Google Group: https://groups.google.com/forum/#!forum/delta-users

A

So we're currently being live streaming now. So just give us a couple of minutes uh if you're wondering what is this session, what is this random set of people here, uh you're actually currently logged into massive data processing in Adobe experience platform using Delta lake, so, like I, said, give us a couple minutes uh to get ourselves all set up and established. We're live casting to both LinkedIn and YouTube, so that's being set up as we speak.

A

um Meanwhile, why? Don't you tell us uh where you're based out of where are you from um my name is Denny and I'm based out of the Evergreen city um and I'm I.E from Seattle I'm super excited that Geno Smith got resigned, so yes for Seahawks football, so I had to toss that a little bit in yes, she wanted to go ahead and kick kick us off here.

B

So I'm from San Jose well right now it's sunny and nice I don't know how it will be one hour from now. So we'll see how that goes.

A

Actually, how cold is it right now in uh in San, Jose.

B

It's pretty cool, well I'm from India. So for me anything is cold yeah, uh at least for me. It's like I'm, wearing a hoodie inside my house. So it's cold.

A

Okay, so what what temperature are we talking roughly like for you, because is it like? Is it still like I was about to take talk to Celsius I, just realized, okay, no in in Fahrenheit like are we talking, like you, know, 50 60 degrees or like right.

B

Yeah I think it's what right now like around 26 yeah.

A

Oh okay, okay, we're actually roughly the same temperature right now. Actually so yes, I get I, get I, guess you're a little chilly, fair enough, all right cool enough! All right! So.

B

How about everybody else, that's on the line. Why.

A

Don't you tell us where you're based out of uh so you hear yesterday's from San Jose myself from Seattle? Why don't you chime in while we uh finalize the logging in LinkedIn and YouTube, and it looks like uh it's being set up right now.

A

All right, I think we're good to go excellent, all right Lowell then, let's start the show. Oh we've got somebody from Spain from Pedro from Spain. That's awesome. We have Goro from Brussels, oh I, love Brussels by the way, a huge fan of that town and hey, hey Pedro. You said you're from Spain, but whereabouts uh uh um yeah, just curious anyways, uh uh oh for Valencia, nice. Okay, that's pretty sweet! Okay, uh let's start off the show, so just to provide context.

A

We are currently live casting on the Linux Foundation Zoom, uh we're also uh on the LinkedIn and on Delta Lake LinkedIn and the Delta Lake YouTube. So today's session is massive data processing uh with the Adobe experience platform using Delta lake, so saying that I actually want to have yesh introduce himself uh provide a little background. So I want to tell the audience who you are and um yeah. Actually how did you even get into the data engineering space? Like you know, you can even go back to college if you want to so sure.

B

So I'm yashon Vijay Kumar I'm, currently like an engineering manager at Adobe, so I started off uh I, think all the way back to college and so on. So I started off as a machine learning engineer, doing churn prediction and stuff with Ericsson then was building machine learning models for ads at Yelp and then I moved to Adobe. Then I became all things distributed. Systems like distributed day-to-day databases, distribute computation, Frameworks, so kind of became a generalist kind of started off from the machine learning side. Because how do we get data into the models right?

B

It's in for us to build the models. A lot of the times most of the time is probably spent trying to clean up the data or trying to get the data in the right way. Building the model is much more easier. The time you spend for the pipeline is probably much was much lesser than what you spend on the data slide. So that's how I got into the data space?

B

A

Yeah, let's talk a little bit about that: I actually want to go backwards, a little bit because I, don't know this about you so so uh for starters: complete Rocky from the ad space of models, because I was actually doing a very similar thing I. Actually um this is going to age me now, but I actually helped with some of the display ads with Bing. Initially, when we were writing Pearl scripts to go ahead and process data. Okay.

A

So exactly to your point, we're building models uh we were using God knows how many different programs at the time, it's even worse than what you think it was now.

A

um We would go ahead and then use Pearl to go ahead and actually try to do our data pipelines um and yeah. Like exactly to your point like you, you start off like you know, trying to do machine learning models, and then you end up spinning the vast majority of your time for data cleansing data pipelines just making sure the data is even you know at a point where you can actually generate the actual model. So I'm just curious like when you join Yelp, like was, were you actively fighting that process?

A

Or did you just finally like just like me, gave up and like okay, I'm gonna become an engineer now.

B

No actually so you're kind of very good uh infrastructure, setup, I. Think a lot of smart folk has set up a lot of things. So I was a so I hated, uh Java I.

A

Do it, you can still continue yeah.

B

I think I'm gonna say we're.

A

Done we're no judgments here, yeah yeah.

B

So uh yeah so I still I love scalable by the way, so that before any pitchforks come out uh so yep had this amazing thing where, like in Ericsson, we had a problem where we had to set up our own clusters right, like we literally set up a data center, a mini data center within the office, to try to simulate some stuff. At the end they were full AWS. So there was a lot of EMR going on, but still to write the goddamn job.

B

You had to write it in something right, so Hadoop in Java was not exactly really good, but Yelp had this system called Mr job I, don't know if it's still there. It is basically a python wrapper on top of Hadoop mapreduce. So.

A

B

You you get to write python right, so, okay, all the stuff was in Python, uh so that was amazing, so you're still doing all the hadoopy stuff, but it was in Python, of course, Earthwise it's going to be there, but then, when you have so many thousands of cores to throw at it, who cares in a way? But of course the AWS bill will care, but it was good from that point onwards right.

B

But then, as you go out, uh it's not as structured or data as in the data teams were not, as invested in um I would say even eight years back right now you see a much larger focus on that area, um but yeah, but getting to the question right and yeah at least I didn't have to do too much of a thing. A lot of these things were set up for me, but then I got out of Yelp. That's when I kind of noticed, oh okay. So this is not something that's common everywhere.

B

uh So that's that's. What got me into it right so to say, like okay, I need to actually understand what's happening from an infra level from uh Storage level. Okay, somebody's not going to solve this for me, I need to go and understand what exactly how the data is stored, what the data is uh and how can I get them the more in the most efficient way. So that's kind of what you say was my Inception into the whole data engineering space.

A

Got it got it so you you did this in Yelp you that was your introduction or Inception to it, and then you joined the Adobe right. So why don't you tell me a little bit about the? What is for starters, what is the Adobe experience platform? Let me just start with that statement.

B

uh Okay, that's.

B

I would say uh not that the first but then um it's a completely organically built uh platform, uh which is focusing on getting uh the user data or the marketing data or like first party data second by data from the various uh customers that are there so for for us, for example, uh like Target, could because remember, Nike could be a customer like and then they bring their data into our platform. They.

A

B

It using the unified profile to make it into one single profile, which is the team that I work with and then they're able to create experiences using that data.

B

But then, or even more critical experiences right, like the emails that you get like say from your Airlines uh saying, like hey I'm, targeting you for some points or anything like for discounts Etc. So a lot of these daily interactions that you take a good amount of them flow through the Adobe Systems, particularly the Adobe experience platform. It's all the way from data collection assimilation as well as, like you know, weaponizing your Sparty data right into ex into customer experiences. That's kind of what the Adobe experience platform is about right.

A

So, basically, it's that it's all about that personalization and recommendation right to be able to go ahead and take that information, combine it with other piece of information. So that way you can personalize which, whichever customer you've got, can personalize specifically for the for their users, gotcha, okay, exactly well, then why don't we start with what was it like when you first started with the Adobe experience platform like in terms because it didn't start with Delta Lake I mean we got there, which is great. Obviously, I want to talk about that.

A

But before we talk about how we got to Delta and for that matter, why you're using Delta like what we start off with like what was it like before what was like when you joined like what? What were the some of the inherent problems of the system.

B

So, quite a few right, especially when you're building something like when you join an already mature uh product or team, it's much different right. A lot of the groundwork has been put in for you in the in the case of AP. This was being built from the ground up right, so a lot of the stuff had to be like best practices or even infrastructural. Best practices had to be laid down from the beginning.

B

So, with respect to the challenges that I personally faced right like trying to get of like a even a proper spark cluster right, we had our own homegrown thing, which was built on top of I. Don't know if everyone's on the K it's trained right now, I, don't know how many people remember uh marathon dcos, and these meets us so.

A

B

You've tried working with that I I've self-hosted uh for one of the teams that I worked with a dcos cluster, guess what it's not good for your hairline, so you can see the remnants of that right, yeah, yeah so, but yeah so trying to even get a single like a simple sparkler setup right. That was a problem and then we are primarily on Azure right. So that was another thing: um gen 1, as in ADLs gen 1, the data Lake gen, one that wasn't as an ass scalable as what Gen 2 is right now.

B

So that was a different thing, so we had to go discover the issues that were there, but then, in terms of the actual data storage problem that we that that is my team, that is the unified profile team, which is kind of responsible for getting all this fire hose of data from everybody right, like every single click, app link, whatever you do, is getting fed into the system in real time. So we were initially trying to model it with headspace, and then we move and headspace on Azure right.

B

There was no uh at least whatever was touted to be like a managed service is not a the HD inside version was not exactly uh something that you could go to war with, uh so we were trying to build that on top of Usher right and that didn't work out. Then we moved to a nosql store, a managed, no.

A

Sql, okay, so actually let me pause you on the the national SQL server with hbase. Were you actually like running your own hbase and K8, or were you trying to like, for example, use HD and service like that to run it.

B

First, we tried with HD insights and.

A

Then we basically.

B

Ran on top of our own VMS, no kids on top of the Azure premium, storage and so on right.

A

So this was a lot of fun to put it lightly. Okay got it.

B

Okay, yeah yeah, so we uh others, tears and blood was uh spent on those things right, but all those were learning lessons to say like okay. How far can we push something? What works? What does not work, because for every lesson we found out where something is not working. We also found something. Okay. These are some good stuff too, that we should probably carry on. So there were some good things that were okay, I, don't know, but that was basically the state of it right when we're trying to build and explore.

B

At the same time, uh there's a lot of uh issues on that front, uh but yeah, then storage wise. Then we moved to our own, manage nosql, store uh right and that worked out great right. Okay, it in terms of scaling out and all of that, that kind of worked out pretty well but manage nosql stores on any Cloud they're going to be very, very expensive right. They basically fail through the most so that's places and then the way that you have to model the data.

B

That also is another thing in in the case of our uh product offering right. We are kind of in a very weird space, where uh we also do real-time transactional uh operations like what you say you you have hundreds and thousands of requests per second like which are one ingesting data or they're, even querying data right. So it's like Point lookups, but at the same time the majority of our workloads are, you know the spark based scan workloads right right.

B

So now you can maintain two different ways of doing it, but then you know there was no Delta back then uh yes, in trying to do like analytics, so our viewer had to build it on top of which, in terms of latencies and everything, would not have worked out for us with respect to our data packing. So we.

A

Basically ended.

B

Up consolidating, at least in the build phase, to say, let's have one store and let's try to do to the best of our abilities like a spark connector on top of the same store but and we'll also do the transactional loads on top of this. So now that obviously has limitations, but I'll stop at that point, but that's basically uh how it was when we were building out this entire thing uh and kind of the challenges people face. Acc.

A

Ent, so so in in a nutshell. From this perspective, then you're about to shift into a solution in which you basically a cycle chunk of, is your traditional, like batch processing type like where, like a standard, smart cluster could actually go ahead and query and get you like. You know: okay, these number of people were recommended X or saw these ads, or these events occurred or yada yada, yada, okay, standard stuff, basically that approach okay.

A

But then you also have this problem where you're going like I I need to be able to say, like point in time, snapshot stock queries, which is like okay right because, like okay I need a recommendation of this particular model right, and so that's why the split between a nosql store in essence initially and then yeah, okay, okay, right.

B

And then so, I just look at it. It's yeah in in terms of the number of functionalities that go right, you can just classify it as like a scan based uh system and an island store versus, uh as in a lot of use cases catering to that side versus right. You know just a single transactionals, uh Point, lookup kind of a scenario right.

A

Right, okay, so so, then, when you went in like how did you before we talk about the actual solution, how did you go about the reasoning? What would be the solution to the problem because I mean I? Think a lot of people here would like to understand like how did you get to that point? And you know what were the gotchas, because the like this isn't? This is actually, unfortunately, quite common right.

A

This idea that you have this scanning query access plan right, the a la spark or and then there's a hot store type partition planning where you actually have to query these, like you know, uh point in time type queries. So how did you go about reasoning that were you prioritizing one over the other? Did you basically have to treat both as the same priority I'm just curious.

B

So that's a great question right, so we had to treat both of the same priority. We couldn't just give away one because in terms of the business offering what we were trying to give, it was kind of like first in class and first to Market in that uh right.

A

B

Kind of put it in that CDP uh ecosystem, like we were kind of there before other remarketings or something that was happening so there was a lot of um I, would say, push for us to get there quickly too. We can wait for the perfect solution, but we need to make things work so right, our priority was okay. We know we want this. We know we want this and we cannot trade off uh say like oh we're not going to give this for the next one here, like one part of it right.

B

We cannot do that.

B

Yeah right, so we were like okay, we were willing to take on that architectural debt to say, like okay, let's make this work, let's make as long as we are spending. uh What do you say less money than what you're bringing in we're kind of okay with it and let's evolve The Arc. Let's make sure we have enough Escape hatches built in so that we can go get out of this architecture as Tech improves and that that kind of pay to pay to paid off.

A

Perfect, okay, so this is excellent. I love your flow, so basically you know you have to solve both the hot partition, heart store and the batch store at the exact same time. You know that, basically, just as long whatever you do will end up lowering the cost so you're like you're, so you're, basically not losing money running the system, it's a migration path. So it's not like a flip. The switch it's very much a pathway between, like we're, you're, not even sure what the end goal looks like yet right.

A

You just know that you're going to go ahead and iterate through this is excellent. I love that piece of advice, because a lot of people are always often looking for like I, know exactly what the solution is and and my user response is like. I can't tell you what the technology is able to do six months from now, let alone a year from now. No that's the wrong way of looking at the problem. I.

B

Would love to say that yeah it would look like I'm a genius, but no uh the entire team right. That's not just me the way, a lot of smartphones in the team. We all came to the same conclusion. Look we don't know what might be there five years. Is it when we were designing this? There was no Delta. There was no Delta make yes, so we didn't know something like that would come our as in terms of our plan. B plan C was okay. How can we make uh our representation of the data more light?

B

Can we kind of optimize on that and we still knew we had? We could make some Headway into that, but then okay, you do have like you know, literally like groundbreaking stuff, coming out. It changes the equation completely. So that way that time it makes you look really good. You're, like oh that's great, oh.

A

Yeah yeah, no, no, you you were, you were omnipotent. You knew that's.

B

A

I thought you would would.

B

A

Two years from now and you would magically be able to make use of it, no so so no! No! This actually is a really good lesson for a lot of people, because this is what I constantly trying to remind folks is like it's like you design things, so you can plan for a future. That's different, as opposed to designing for things that you think are going to happen because more times than not whatever you think happens, isn't going to happen. Anyways, true, okay! So let's, let's talk about those Lessons Learned. So now you you joined.

A

You have this awesome scenario where you have to actually do two things at once to competing things at once, right like without without like, because if we go into the Uber details, this will be five hours long conversation, so we only have about 30 more minutes. So look yeah. What like, why don't we break down some major Milestones, like of how did you get there? Basically.

B

Right so actually related to the previous thing that we've had that actually ties into this part. So one of the Escape hatches that we were kind of talking about is uh very early on right from the V1 of our system. We kind of took an even sourcing design, uh which was meant to feed into other parts of the platform, because we are kind of the Hub at the center of the experience platform powering a lot of the applications depending on what app it is.

B

So what we thought was: okay, let's be a bit even driven, so that we can at least lessen a bit of the load on our system and that probably that decision that the team took probably paid off the most in terms of the migration there. So what what I mean by even sourcing in this thing is we have any database any nosql DB that we provisioned? Basically, we built our own um CDC or change data capture. So every time a mutation happens to this primary store that we have single store that we have.

B

It would emit a notification of the change to our centralized Kafka topic so basically think of it as even if you have 1000 customers or a thousand DBS, all the notifications are basically fanned in into a single fire hose. So anybody who is listening Downstream to us knows basically when something has changed or what has changed specifically like on a per a row level basis. Think of this as a row, level change feed on any of the databases that we have or like in the postgresite The Logical replication.

B

Like you, there are a lot of parallels to this, but what we did is we didn't depend on the database to give us the CDC. Instead, we built our own custom layer, so.

A

B

That we have a bit of agnosticity over here to say, like okay, I have a particular data. Modmod portal I have modeled all the changes and I have a way to consume those changes in one particular way. So that was the first part. So we before we decided to migrate to anything or move or replicate, we kind of hardened that system to say like okay is the replication behave the same and then now comes the Milestones part right.

B

So now we have the transport layer built in to say, like oh, something changed from the source of Truth, which is the current source of Truth. So now, then, we started building the layers to say like okay step, one, let's just duplicate everything row by row like uh a mirror of the day-to-day data that we have as in the advantage of that is because it makes our verification process very easy, like we could choose to have a different data model.

B

We could choose to have go completely crazy to say when we moved to Del Delta like let's do com something absolutely brand new okay. But then the problem is the burden of Truth falls on the team to say, like okay, a equal to B right, which is not exactly.

A

Very simple reconciliation process became right, become basically gigantically painful to put a library.

B

Right so our first Milestone was to simplify that entire process to say like if I have a hundred rows on my left hand, side, which is the source of truth, I'm going to have 100 rows on my right hand, side right and that what the guarantee that this replication system that we're giving is very simple, we're going to say within a particular time bound that we can measure we're going to replicate all the rows from A to B. And now this becomes a much more easily verifiable thing through hashes and jacquards and stuff, like that.

B

A lot of statistical processes was built around that to make sure that this replication process is hardened and it's good and we are continuously investing with that. Right.

A

So this just to reiterate the point. This replication, though, is of the changes, so in other words, yes, in other words, is that, like forsake argument, if there were a billion rows in the source of truth, you're replicating a billion rows. What you're doing is you're saying no, no I'm focusing on the CDC, the change data capture side of the house and I have the verification, you've basically hardened the replication, so that way, I you never have to rebuild the whole system. You basically only need to grab the 100 rows or whatever.

A

It is at that point in time that time window that batch window to basically ensure that those 100 rows came across and whatever you do. Downstream doesn't really matter at that point because you've validated that in fact, the 100 growth that came through first, those are exactly what you were expecting exactly exactly: okay,.

B

And okay, so that was our first Milestone right in terms of finishing it out, because we we wanted to narrow down the problem now for the second milestone right that we're acting with this thing. uh But before now the first thing was the replication the session. The second milestone was to get the actual.

B

um You know the workloads that are dependent on the hot store to move on top of this gotcha gotcha right, but but there was a big Advantage now because we narrowed the scope to do a one is to one match right like right. Every row on LHS equal to every row on rhs now the spark workload in terms and then for those familiar with spark I'm, assuming everyone is from this path. In terms of our reader and whatever DF dot read. We do.

B

We just give a binary compatible a mapping to make the rest of the spark job thing. We have gotten the exact same data as we've ever gotten from the hot store.

A

B

Terms of migration, in terms of so we're, not if you, if you, if you have 10 different workloads which are dependent on this hot store, we are not rewriting 10 different workloads. One team writes the mapping function to make sure the rest of the job thinks it's just talking to the hotstar, because the job doesn't care. It just cares about what schema you input schema. Are you getting into in the top of the pipeline? The rest of the flow does remains unchanged. So.

A

So for for, for, for all intents and purposes, as opposed to 10 different replicas, you're, basically multicasting it in memory out so each of the 10 jobs just say using the magical number of Ten um each of those 10 jobs basically is able to access that data in memory, as opposed to being written in storage, has the 100 rows or whatever the number rows or I've been cheers. Yeah.

B

So like, how do you validate yeah yeah? Sorry, no slightly differently right! Think of it. Okay, you don't have to go into 10 jobs. Just think of you have a single job which reads: let's say it has a schema of uh Focus right now that was from the hot store, so whatever struck schema, I was reading from that it was reading from the hot store. Now, when we switched this job over to use the source of the Delta Lake replica now we need to ensure like we can do two. We have two options now right.

B

We rewrite the whole thing. If the schema in the Delta lake is different, we basically have to rewrite this job to be more efficient to fetch it in the correct data, assimilate it in a particular way. Basically, I feel really right there.

B

The other option we have is to say just write, a very thin mapping function, sure to make make sure that the data that's coming from Delta Lake looks exactly the same as whatever it's looking at from the hot store, at least because from a milestone point of view, it reduces the amount of friction that we have. We need to get you know multiple teams to align on, to use the Delta Lake store, instead of because the it goes. There are two types of block.

B

I would say blockages that you have, uh one is I can say: oh look, Delta lake is 10 times faster than reading from the hot stone right people will be very happy, but then there's also the human element or the engineering element. Oh by the way, it's going to take you six months to Port your workload to work on top of the Delta Lake. Sure sure right.

B

So when you have, when the flight has already taken off, trying to put a complete rewrite solution becomes very cost prohibitive, exactly yeah as in irrespective of how good the solution is.

A

Yeah exactly because the actual development time, the engineering time, the validation time testing time itself, basically makes it a non-starter for All Dental practices.

B

So that's kind of one of the reasons we decided to do the Oneness to one approach, that's in the mirroring approach and then build this mapping layer in order so that the rest of the teams Downstream can easily consume the data and for them basically nothing has changed right.

A

Right so, in other words, you've abstract for the downstream systems, you've abstracted away the problem, they don't know they're occurring whatever it doesn't really matter to them. Correct right, okay, got it. Okay, so basically, you've got this hot store. You've placed it into Delta, Lake, you've built this mapping functions, so the downstream systems doesn't actually have to care, uh which is great. Is everything solved or no.

B

No I'm pretty sure it's not right exactly. No, no! No! It's not solid exactly so yeah.

A

So why don't we talk a little bit more so so this this gives us the idea of being the extractor way and minimize the downstream Engineers what they actually had to do in order to be able to query the same data right okay, so this is great right, so you've basically you've stabilized it, but then and you've made it easy for Downstream systems, but obviously there's a whole bunch of other problems. So let's get into that now so.

B

If you so now, we have to just go into the nosql itself right, okay, so take any nosql store, that's there right sure sure as a concept of multiple partitions right. Yes, so when, when you I said- and let's look at this- the status quo, when our spark Jobs go talk to the nosql store, the parallelism of the read is going to be proportional to the number of partitions that that no SQL server has because if you have only 10 I can throw 100 cores I.

A

Can easily queue up exactly yeah.

B

A

But I'm not going to use basically you've got mad either doing nothing. You only have the 10 that actually you're actually doing anything. Just because that's not it can each nosql store basically can only optimize per the each basically per partition. At that point, you.

B

A

Multiple threads hitting the same partition because duplication data, fun.

B

Stuff yeah, so we actually got a way around it right just so that okay cool, so what we try to do was: okay, let's try to get a cursor on each partition. Let's do some partitions right. Let's do sub particles right so now, you'll think like! Oh, you did sub partitions great.

B

So now you have twice as much or n times as much by the way, the more sub partitions that you do, which means that you're doing more things parallely, which means you're going to incur more cost because of life in terms of higher concurrent operations that are going to happen so yeah. So we solve the time Problem by parallelizing it, oh, but guess what we have to pay through the nose and years in order to actually get the functionality that we need so right now it was like okay, fine.

B

We found an engineering solution to it, but not necessarily A Business Solution, a.

A

Cost solution exactly.

B

Right basically,.

A

It starts getting prohibitedly expensive because any nosql store as soon as you go ahead and basically it says: okay I'm going to knock up the my concurrency, my ability to do concurrency. The cost is just through the roof at.

B

That point yes, the second problem that we had was just because of the data that we have and the volume of the data. It's like what Every Spark developer has right like shuffles right.

A

Yeah data uh we.

B

Have to assimilate data at a profile level like like, for example, if Yash gmail.com is one profile and I have a bunch of events associated with me. I could also have six three one, one two three four, which is my phone number and some events associated with that.

B

So when, when some actions, personalization action is happening on me, the data has to be combined from across both these individual fragments before a decision can be made right, because if you don't include it you're going to have wrong what is say, logic getting fired in the downstream system, but sure the problem was shuffled. So if we were to characterize it, shuffle was a problem, so these were the two main problems that we had. So okay now comes one of the points that you were saying right with the CDC approach.

B

We basically had an incremental handle on top of what changed now with just the raw Delta Lake mirroring. We basically increased our read throughput right because the data stored in Delta Lake in terms of compression, we see anywhere between 10 to 15 x, compression from what we store in the this. This one, so our read performance has gone up because more partitions, more core usage, but then another important part was our main workflow, which is assimilating these profiles. So now we are doing this on an incremental basis.

B

So now Delta, because you have the change, feed and all that stuff that you're going to put in in terms of like only when the comment happens in a particular table. Is it going to the change fee is going to be there, which basically serves as a notification for another system that we have to say like okay, incrementally materialize, whatever you have right right, it's not possible right, as in I, wouldn't say not possible.

B

If you throw enough money at something you, it is possible even with the hotstar, but then the the problem trying to do a materialization like that is because there are like multi-level joints that are happening across spark and a nosql store. So now you have exactly plan a query like this right, but then now what happens? The materialization, whatever the incremental materialization, that you're doing, is within the same Delta, Lake, Universe right right, so like parque parque.

B

So in terms of spark, it knows exactly how to optimize this incremental materialization pretty well so right, the change feeds so Delta. Of course, thanks to parquet, we get the compression and all the statistics and everything we are able to do the replication uh much more efficiently, but then because of change feeds, it enables us to do incremental materialization, so that solves the second problem. That I was talking about. Okay,.

A

No, that's actually really cool so basically you're able to leverage CDF to change data feed for basically solving that second problem, and then it also solves technically part of the first problem, because the cost of it, the read hits, are actually no longer on the nosql store they're actually on Delta Lake and because it's all a bunch of smart queries, anyways. It's not that big of a deal for you basically yeah. It is some, let's catch the memory all on its own anyways.

A

If you have to query, run multiple concurrent queries, it doesn't really matter it's hidden off the Delta, just like you said, you're actually able to leverage um the calm stats, uh the compression algorithms, the snappy code. So all that's basically put in play. So this is this, makes your life easier and your Reliance on the nosql store has reduced um not so much on functionality, but so much on cost like we're able to reduce the cost. That's that because you no longer have to throw that much concurrency on the nosql store exactly.

B

Like the nosql store is still doing a pretty good job on the transaction.

A

All right, good right, yeah.

B

We were kind of hacking it or trying to use it for our own needs, and uh what do you say uh our own timeline we're trying to fit our timeline with the solution right? So, uh but now we don't, we can we have two systems which can be kept in sync. We know that right and each of them do what they do best. So that's kind of right. It's done something yeah.

A

Like choosing the right tool for the job right and right right, there's no SQL stores I've tried to basically do operational type, queries on that and like for the occasional one. It actually makes a lot of sense too, because if you need it immediately because like for example, you're debugging something- and you need it's a customer scenario where, like like, we haven't even transformed the data yet we're just simply trying to like hey. Oh no, we need to deal with it right away because we got a debug call.

A

That's within the last like six hours, yeah go query it it's a little expensive, but considering it's a premium customer yeah yeah. We don't care about the expense of this particular case right, but then the idea of applying that forward. Let's do that for everything I'm like no.

B

A

The concurrency cost alone is going to basically kill you yeah. So um oh actually related to this there's a great question on LinkedIn and actually I'll pose that question right now uh you did I believe I'm, saying your name correctly. I apologize. If I'm not uh ask the questions, uh how can we do hash comparison for finding the new changes, and so, in this case, he's trying to understand better like the the new changes, because I think there's there's two concepts that we want to talk about.

A

There's the new changes that went from the hot store into Delta Lake, which is the hash comparison, so we'll talk about that a little bit, but then there's also the downstream systems which are actually piggybacking off the CDF for from Delta lake. So I'll answer the latter one just because that's straightforward like for the latter part, basically change that you feed is an option directly within Delta Lake Delta Lake itself has its own transaction log anyway.

A

So, by definition, all we did was really uh make it available, and so that way it became really easy for you to go ahead and understand what were the changes that happened to the Delta Lake directly. That's why yes was able to talk about like the incremental views or things of that nature like because you're able to or the notifications go with it, because it's you, you basically have that information right directly in the CDF.

A

The question basically, then, is more a matter that I'll have you answer. You is about like the hash replica of between it looks like my videos Frozen. So that's pretty cool uh yeah like wow I, look really weird right now, but okay uh is to basically go ahead and do the replica between um uh the hot store and the.

B

Right, so no that's a good question so, and this is something that we have spent a good amount of time and we have an injury, an individual work stream dedicated collaborating with Adobe research just on this. So there are multiple answers to it, but I'll kind of give you the more simple version of what we're doing the naive version. So.

A

Think of it as like.

B

Every so from even from the hot store point of view right, we have indices on a bunch of columns.

B

One of the columns that you have an implicit in the in this index is the primary key of whatever the document or whatever that we have is on, and the other is on the underscore TS or the internal timestamp of when that record or document is committed right.

B

So, if I were to make a query to the hot store uh to say, okay, give me all the documents that change for this DB in the last, let's say: 15 minutes, that is kind, is a cheat query in terms of what uh as in and then we also have like a pseudo document hash that is kind of represented to indicate the content of the document itself.

A

B

The hot store itself so now that is also getting replicated right so now, at regular intervals, we are able to query the hotstar to say give me all the hashes that have occurred in the past 15 minutes for our comparison. Now you can look at Delta Lake and see like okay. How many of these hashes have actually made it into this system?

B

Got it I, think, as in you can think of this as an exact set match, or you can think of this like jakart similarity. But then your imagination is your imagination and your budget is your uh limit on this thing right, because you can throw a lot of computation. You can do exact matches too, depending.

A

On how much data that you have.

B

But we are right now at kind of like the terabytes a day, petabytes in total kind of a scale. So we try to optimize this into more probabilistic techniques to say, like okay, we're probably okay with say, like a five percent diff between this and this. But we need to make sure that all the data does progress into the system. So that's the verification process that we're doing right.

B

But apart from that, any data that gets into this CDC approach we kind of maintain elaborate metadata on a source level so that we can track so this entire application process is keeping track of at every source. So we have. Every database has multiple sources hydrating into it. So on a per Source level, we are maintaining counts and how many things are happening. So we see like okay, you've got an input of, say, Source One for let's say Adobe analytics celt in 500 million records or not 5 million records in the last 15 minutes.

B

So now we'll be able to know okay. How much did the processor or the replication system also do so that's more from uh systems engineering, point of view to keep track or like give some transparency on how much data flowed so we'll be able to measure it at that way. But then, assuming the entire system is a black box and let's say the other thing: is there so we built this mechanism so that we can actually check uh the the? What do you say the strength of the replication process through this hash comparison.

A

Got it got it got it? No, that's actually really cool and it's a lot of work. I mean to build systems like these, but I completely understand where you're coming from, especially from a Systems Operations perspective, where basically, the replication itself.

A

It's itself, like basically you're, almost going to the point where, like old school, like active, active like clustering or active, active replication where, because and the fact is like yeah- you probably at some point there at some points where you're doing exact match, but some point you're, probably just only going to go, bother with probabilistic just because of the nature of how much data you're, processing, correct right.

B

A

Okay cool now this is awesome, but I'm gonna go switch because I just realized. We only have about eight minutes left. Okay. So up to this point, we've talked about the replication. It's it works. Great we've talked about uh how Delta Lake and has actually helped solve a lot of the problems because it simplified how the downstream systems were able to do it.

A

You had the mapping function that made things easier so that way, the downstream systems didn't actually have to make many changes if I need anything at all for for them to be able to go ahead and still query the date. That's awesome, but there's got to be some problems. I mean right up to this point, I mean basically you've made almost like a great ad for Delta like it's like. Okay, definitely solves everything, we're good to go run away, so we know that's not true, and because this is a d3l2 session.

A

We also wanted to call that out right. We want to call out what were the problems that you saw within building these systems, even with how awesome Delta lick is, and how are those solved? I'm just curious no.

B

Okay, cool, that's, uh okay, so, in terms of problems right there are, there were a lot of problems. There are still some problems which will fit around, but then, in terms of, if you look at the bigger ones, I think the underlying infrastructure, so Delta Lake gives you a good Paradigm in terms of doing your absurds deletes and all of that on top of Market data.

B

But what people forget is you are bound by your by the performance of the underlying hdfs that you have, whether it's S3, whether it's ADLs as in the Azure data lake, so in the case of ADLs, like I, was I. Think I did say this before we were earlier on gen one gen 1 did not scale. I. Think even Microsoft acknowledged that, because there were a lot of limitations in terms of you know the number of metadata files, as in the number of nodes that you could have within the cable folder Etc.

B

So once your table starts getting really really big. Your there's a lot of hdfs metadata operations happening from the Delta side, which will basically nuke your ADLs gen1. Thankfully, at the exact time that we started, Delta Delta, Lake, Gen 2 came up right.

A

B

Changed the game completely like in fact uh unpopular opinion right, but uh Gen 2 in terms of Delta Lake support is better than S3, because in S3 you don't have the uh as an out of the box, the multi-cluster right. You actually need a coordinating database to make sure that you can write for multiple clusters, but on the ADLs, Gen 2. You don't need to do that. So thanks. That's right! We didn't have to yeah.

A

B

There was not yet another component that we needed to add to our infrastructure stack right so, but that was a problem for sure in terms of scaling out gen 2., um because people forget it's not oh I just did an operation on Delta I just work. No, no. You have to definitely monitor your underlying hdfs solution to see what throttling limits are there? What bandwidth limits are there so that you can work around it or you can work with your cloud provider to make sure that you have some exceptions ahead in place.

A

So specifically like for that for ADLs Gen 2, just to provide some context for folks Alias Gen 2, you in all seriousness, is basically the the uh Azure blob storage with a hierarchical store on top of it, with the option for higher call stores, and that actually often makes a lot of sense. So that's why we pretty much tell you yeah yeah go! Do it, and so that's what basically, what adl's doing too it provides an excessive amount of like scalability but and I'm. Pretty sure this is where the butt is going to kick in.

A

I. Take it that you had to basically work with the cloud provider in this case as here to basically make sure that, and now, when I use the word partitioning I'm talking about the partitioning from the infrastructure, not from the data perspective right right, that you had to basically pre-allocate the partitioning of of ads Gen 2 or or did they speed that up?

A

Because in the past now, I'm literally using numbers from like years ago in the past was something like you had to basically have uh hit the threshold of like something like five thousand or ten thousand iops uh for 24 hour period before the auto uh partitioned underneath the covers. Is this something where you actually had to uh you waited for that long? Or did you go ahead and create.

B

No, no so the good thing about AP is a big enough team with a lot of different I would say expertise. So we have a dedicated team who is managing a lot of the data on the lake as in the data Lake, and they were able to basically have relationship with Microsoft to make sure all these pre-allocations happen. I heard of time gotcha so.

A

They took care of.

B

Took care of it, so there is a dedicated team kind of you know, working behind the scenes to make sure that all these, the the exact the infra details that you're talking about are kind of nailed so that the rest of the system does not need to worry about it. So this is one of the advantages of being in the platform being uh grown from the bottom up. Yeah yeah.

A

No, this is it's also good learning, because that a lot of folks don't know this right, and so the fact that you have a platform team that focuses on that, like it's analogous to like what I talked about in the past, where, like when you're dealing with database systems like you, actually have to care about like when this is before the cloud became such a big thing right. You actually had to care about like the idea of random iops right how random iops, especially on spindex, would screw you over.

A

So you actually had to understand the underlying infrastructure and recognize. That's why we told you all to use ssds because it allowed you basically have random iops without any problem right so with Cloud objects, the cloud Object Store. Yes, we abstracted that away. We said yes, it's all just rest. Api calls for All Dental purposes that are against the cloud.

A

Object Store, but then also in Cloud listings, become notoriously slow as all heck and, to your point exactly now, you need to basically have people who are Specialists, who actually do understand how to work with the cloud providers to pre-allocate to these things. Yeah.

B

There's no freelance there.

A

Is no exactly yeah, it just shifted, or it's slightly different, but what's interesting is like the analogies are still similar like maybe the actual Tech is right is different now, but the analogies in terms of even stuff that we did like 20 years ago, still environment's the same play.

B

A

Realized that we only have a few minutes left, okay, so I did want to call out, though, um and I'm gonna call myself out. Okay um on this one Delta Lake being awesome, you had the infrastructure figured out, but there was a problem again I'm, going to call this out myself that we hadn't open sourced enough of Delta Lake fast enough. So why don't you talk a little bit about that, because I.

B

A

Would love you to tell me how you beat us up, so we could speed that up because that's completely correct you did, and rightly so. I want to be very clear, rightly so, right.

B

So the other problem, I was going to say, was actually yeah. We kind of got hooked on to zero thing.

A

B

So it was amazing because a lot of our workloads like because I just talk about replication, right and people, think of like okay, you just replicated A to B fine. No, this replication consists of inserts updates and deletes right right. You need to know how to do that efficiently, so that that time bound that we're giving is not like Harvest Right, it can be minutes, but it cannot be harsh. So the only way we were able to do that was I. Think two important features that kind of made the thing was.

B

The ordering was the one super important one: another one was Auto optimized, but Z ordering made a huge difference, because we were kind of reordering based on a particular primary key, which we knew all our update statements or absurd statements were based on now. The problem is the test came out really well. Performance was really good, but then in terms of selling it to the rest of platform and upper management, it's like. Oh so you're going so the entire business functionality or the performance hinges on a proprietary feature.

B

uh That was a huge pain point right because, yes, you could say all the yada yada yada about Delta Lake and then oh by the way we are dependent, oh and it's open source, oh, but by the way, we're dependent on this uh proprietary feature from the from databricks that we love by the way. But it's a risk that we need to go ahead with right, yeah, because any organization right it has a combination of risk covers versus risk taking. So how do you rationalize it? That was a problem.

B

So then, after a lot of talking with databricks uh a lot of discussions and trying to rationalize it, I was super happy that you guys actually just said like: okay, fine, let's go for the open source, let's just open source this and yes, and that I would say that was a big thing, mainly because one it was like internally, we had the results and everything showing up, but then there was this thing: that code cannot change right.

B

You can't write enough code to get around a situation like this, so it kind of was good to have like a working and willing partner to say like okay, fine, we got this. We hear you and we'll get that change. So I really appreciate that part.

A

No, no that's awesome, and, incidentally, we were actually trying to figure out if we could do it even earlier than 2.0, but as you can tell, we actually had a bunch of the calm stats stuff that we actually had to do first before and the cluster we had to do before we could get the Z order out. That's why it took us to Delta 2.0 to do it and to be very clear.

A

It wasn't just Adobe asking I do want to give yes a ton of call outs here, but this wasn't definitely an ass by the community. So in typical fashion like this is where customer and Community were extremely aligned like there are times. Obviously, the the two won't be aligned right, but in many cases when, especially when it comes to Delta like the customers on computer are extremely in line.

A

So we're like okay cool well, like we heard you, we listened and we just went ahead and did it and I think you also called out, like um you needed, open source if for no other reason, also for local development right if I want to compile and run those code based on my local machine you're, not running a databricks cluster, on your on your your MacBook Air Max,.

B

Will never sell right as in right there's, so many hundreds of developers it'll be like okay, fine, oh by the way you want to test something or you want to build some something. Oh, it has Z order in it, which will now be a part of every single thing, because this is in the hub. Oh by the way, go, do it in database? No, that's not that's! Not it's not going to happen! No! No you'll have to or to say, pry IntelliJ and vs code or whatever from offer.

A

B

A

We've tried to make it easier, which is like basically, by the way, just as a call shout out like with uh a patchwork through that forward, including spark connect and Spark connect and databricks connect. uh Basically are the same thing you just databrix connect has um authentication authorization stuff, that's specific to databricks, otherwise, it's the exact same code base, um and so the idea is that you could presumably, from your IDE from your vs code, write the code and actually submit it directly to Native cluster.

A

But even with me saying that you and I are on the same page, no you're, gonna pry the development of my cold dead hands. Right, no I need to run this stuff locally. First, before I do anything yeah. So um so we only have a few minutes left, but I did want to leave off with a cool tidbit. You had told me this one. We had chatted before about exactly how much easier from a manageability perspective like in terms of the number of tables of the number of tenants that your team actually has to manage.

A

So why don't you tell the audience here, like with your team, how much? How much are you managing and like, and that's why Delta Lake actually ends up becoming this awesome thing, for you guys right.

B

So if you think about it, I think we have close to I'm not going to get attendance but in terms of a sheer number I say close to 5 000 to 6 000, Delta tables right now and actively growing. So usually when people talk about the migration, it's like you have one big fat table which should basically split into probably two three in terms of making it a good alignment.

B

But in this case we have internal clients, external clients and all that, when you put of total number in, we have close to five thousand to six thousand tables Delta tables that were actively replicating to Fanning out uh to uh at a regular basis right in a real-time basis. So now maintaining that and everything has become is actually pretty easy. Contrary to what people might think, because we just have one huge maintenance, job that runs and all it does is it takes care of orchestrating the vacuums, optimize and everything.

B

So everything is broken down right like if you even had a trustless TV, you still had to sometimes schedule your vacuums and stuff like that yeah. What we did is we basically understand we're basically running a database on demand like compute, on demand for a database. That is what we are managing right right, that's, that is the acknowledgment that we have to make and that right we have to take care of all these maintenance operations ourselves.

B

So from our side, we already have a good orchestration system that can paralyze, which is again just confused, put on databricks and the spark cluster, so we've just managed that and the rest of the operations in terms of the vacuum optimize and everything. As long as we have Awareness on what it's doing, you schedule it and you're good to go.

B

Your tables are fine, uh of course, I say this very easily, but then, as long as you read the docs and you understand the concurrency conflicts and everything that happens, uh that's a separate talk of its own. But as long as you do that you should be gold golden to manage. How many other databases that you want. uh We have learned a lot of lessons from my name and management and energy multiple candles before so. A lot of them came into. What do you say came handy at this time.

A

Perfect, um yes, uh this is excellent. uh I think this is a great way for us to end today's session. uh If you do have any questions or want to chat a little bit more, uh we will we'll be updating this by the way, we'll be a follow-up blog that will be writing together on this topic as well. Based, basically, all this conversation that we're having today uh this. This video is already uh being live, streamed of both LinkedIn and YouTube, so you can already see us there, but just, as importantly, go to go.io slack.

A

A bunch of us are already there answering questions as well, so um that that's really it yes, anything else that you want to add. Potentially at.

B

um No I can keep going on all day.

A

Yeah I figured as much, which is why I wanted to type it, but no really, yes, I really appreciate your time. This is super insightful. Super helpful, I'm, really glad there's a lot of really cool Lessons Learned for the community, but then, like I, said yeah. If you want to continue chatting or you have specifics that maybe be worthwhile for us to have a follow-up either session or follow a Blog on uh yeah, just join us at go dot, Delta, dot, IO slack. So with that. Yes, thank you very much. I really appreciate your time.

A

B

Thank you so much Danny. Thank you, everybody. Thank you.