Delta Lake Delta Lake Discussions with Denny Lee (D3L2), 17 May 2023

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: D3L2: SageMaker Studio for Machine Learning on Delta Lake

Description

Combining SageMaker Studio and Delta Lake brings state-of-the-art machine learning to your data lake. In this session, we show how you can train ML models and how you can take advantage of the capabilities offered by Delta Lake using Amazon SageMaker Studio.

Quick Links
Vedant Jain: https://www.linkedin.com/in/vedantjain/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
GitHub: https://github.com/delta-io
Join Google Groups: https://groups.google.com/forum/#!forum/delta-users

A

Live stream hi everybody we're just waiting for a few minutes to get ourselves set up on the LinkedIn and YouTube. So uh take your time go back, uh get some coffee. You know, get yourself all ready and then we'll be ready to start the show and by the way, if you're wondering uh this particular session happens to be about ingesting data from Delta Lake into Sage maker studio, so yeah you're on the right channel, so just give us a few minutes to get all set up. Okay,.

A

And meanwhile, while we're waiting for everybody to chime in and uh to get in, LinkedIn YouTube almost ready, uh why don't you go ahead and tell us uh where you're based out of, uh for example, I am based out of you're, not gonna. Believe me sunny Seattle. So that's where I based up? How are you.

B

uh I'm based out of the sunny places but I'm based out of Houston Texas.

C

And I am based out of Sunny St Petersburg Florida. That's.

A

Right so man we're all sunny today. This is pretty sweet. All right, well, okay, I think we're good to go. We have the LinkedIn live stream, that's up and running, and we have the YouTubes for straight. Youtube live stream up and running, and now we have- and we have zoom which a bunch of you guys are are here as well. So, let's start the show, um we're gonna start off and kick it off with Carly Carly.

A

uh You've got a bunch of announcements to make, so let's go ahead and do it and by the way, if you're not familiar, Carly's are uh go-to person when it comes to All, Things, Delta, Lake, uh social and uh marketing, wise events wise so happy to have here and have her actually uh join and join the conversation more often, as opposed to just you know, hiding behind the scenes so we'll take it away. Carly.

C

It's so easy hiding behind the scenes, though Denny I believe. Do you have a few of those slides that you could present Denny.

A

Oh sure, actually I did not happen to have them handy. My apologies I thought that this is me still going ahead and trying to get myself organized, but not a big deal at all. I will give me one minute uh and I will have it all taken care of.

C

Perfect well. In the meantime, we have some really exciting events in store in the next two weeks and in June.

C

So one week from today, May 24th Wednesday May 24th at 9 00 a.m. Pacific, we will be going live on the data bricks, LinkedIn and YouTube mate zaharia co-founder and chief technologist at databricks, our lovely Denny, Lee and Martin run senior software engineer at databricks are coming together for an online Meetup. You spark from anywhere this is going to be a very exciting conversation. I will drop in the RSVP Link in this uh chat. After this we also have Thursday May 25th at 9 00 a.m.

C

Pacific a wonderful d3l2 session with Andy Grove, who has been specializing in query, engineer, engines and distributed systems. I will also include that RSVP link and, lastly, uh June 26 to the 29th we are coming together for data and AI Summit top experts, researchers and open source contributors across the data and AI Community are coming together, June 26 to the 29th.

C

If you use the code ET Linux 400, you can save 400 off the regular price of a full conference pass that code expires, June, 2nd and now I'm, going to flip it over to Denny. For today's discussion um with vedant, Jane and they're, going to show you how to train ml models and how you can take advantage of the capabilities offered by Delta Lake using Amazon sagemaker Studio.

C

A

Perfect Carly. Thank you very much, really appreciate you, diving in like that, uh so hey uh without further Ado, the Dot's gonna start presenting the slides uh so because he's gonna be the guy to run things. Meanwhile, uh I wanted to call out that hey, vidad and I are old friends.

A

uh We used to work together at databricks, uh and so when we found an opportunity for him and I to go ahead and speak together, it seemed to make all the sense in the world um a lot of the stuff that he's been working on is very much related within the realm of the sagemaker world, which is awesome and sweet, uh and myself I'm very well. You probably know me already a little on the bias on the Delta Lake Side.

A

So uh that's why we figured we have this session together and uh talk about stage meeting studio for machine learning on Delta lake. So, but not you want to take it away, or would you like me to go ahead and continue on uh didn't know? We didn't prep very well wow.

B

A

B

Mean I can just introduce myself real, quick and then um yeah I can and then you can start talking about Delta Lake and you know, do a brief overview of the agenda.

A

But not once yeah, once you introduce yourself, please, okay,.

B

Great uh hello, everyone, my name is vidhan Jain and I'm, currently working at Amazon web services.

B

um I am a senior AI ml, specialist and I work directly with all of the AI ml products uh within within AWS and I work on strategic Partnerships. So there are multiple, very specific companies that I work with, and my prior in my prior lifetime, I I used to be at databricks. I was a Solutions architect at databricks and that's how I got to know about Delta Lake and got to work on some very interesting projects regarding Delta lake.

B

So um you know when we got this opportunity, it was a great time to bring sagemaker studio and Delta Lake together because of all the great Innovation that's going on in the open source community. So that's exactly what we're going to talk about today, talk about today! So back to you today,.

A

Awesome thanks very much all right next slide, please so um there we go so our agenda today.

A

Basically, is we're going to talk about the Delta Lake fundamentals: we're going to start with that, and so just in case uh you all are not as familiar with Delta Lake I'm sure most of you are so we're going to Breeze through it quickly, but at least if you are not as familiar uh the slides will be available and for you to go ahead and actually access them and refer to then we're going to talk about the benefits of machine learning on Delta Lake, because that's always a good thing and from there um we'll switch over to the DOT to talk about an introduction to sagemaker Studio sagemaker, build ml models, a demo and call to action.

A

Now, if you have any questions, if you're on our Zoom more than welcome to go ahead and chime in on the Q a panel to ask your questions now, please keep your questions related to sagemaker studio and Delta Lake, please, okay, we do want them to be around these two concepts: okay um and if you're on LinkedIn, again you're more than welcome to chat on the comments section uh and also YouTube same idea: go ahead and chat there so I'll be monitoring.

A

All three I'm gonna be a little slow in the beginning, just because I'm going to be talking first, but yeah, uh let's go ahead and switch the next slide and talk about the Delta, Lake fundamentals, and so for those of you that may or may not be familiar with Delta Lake I, don't like Delta lake is an open storage format that brings acid transactions to big data workloads on cloud object stores. This is the key ingredient to this concept of building lake houses.

A

uh Lake houses, there's this culmination of uh um of the data warehousing World well from terms of business data and the data Lake World. Bringing these two concepts together. Now next slide, please.

A

Okay, so we're going to do Delta, Lake foundation or lake house yeah just go ahead and skip to the next slide here all right. So the context that we want to talk about is that when we we talk about lake houses in the first place, it's to say that you have the manageability and the uh Simplicity and the transactional reliability of a data warehouse plus the scalability and the flexibility um of a data. Lake take the two concepts: the bust of Both Worlds together and then that's what you have with the lake house.

A

The lake house is basically taking advantage of your Cloud optic storage. It has the scalability and flexibility of it yet at the same time, having the transactional protections around that data, I could go on for hours about the pain of a data. Lake about schema on reads: I: won't because then you're all going to get bored real fast, but that's the context. The idea is that you take the best of these two worlds. So that way, you have the flexibility to make sense process.

A

Read, uh do AI uh make do run machine learning, algorithms AI against all of your data. Next slide, please, and so the key features of Delta Lake are as to transactions again to protect your data, the context of scalable metadata, the ability to go often more times than not the metadata of your data Lake ends up becoming more uh slowing, processing or slowing querying down than the actual data itself, just because it actually has to identify what uh files it's supposed to read or which files it can skip.

A

So the metadata process actually becomes extremely painful and so Delta lake has the capability to handle petabyte scale of tables with billions of partitions and files and because we have very scalable metadata uh time, travel the ability to revert to old uh files. Oh so, oh sorry, to old versions of the data, so you can audit roll back anything else. uh It's definitely open source where um this is coming from databricks.

A

So we've got a lot of the open source pedigree around Apache, spark, mlflow and, of course, Delta like so Open Standards, Community Driven, always a fan of that um United unifying the concepts of batching and streaming. This is super important because, as you look at more and more data processing streaming is not just for real-time applications, so that's super important by the way I'm not trying to disregard that.

A

The idea is that you should be able to apply your business logic I.E the what logic you apply to your data, irrelevant of what that latency is whether it's supposed to be super fast or you haven't latency of four hours or one day, and so the business logic shouldn't change. Delta lake has the ability to handle extremely fast queries and processing of both your streaming and your batch data.

A

So that way you can go ahead and separate the business logic from the latency of your processing and with advancements in structure streaming or for them, or the Flint connector, or you want to go ahead and go old school. Well, maybe new school rust and actually use a rust apis to go ahead and write directly to Delta we're good yeah absolutely so this is the context, uh schema Evolution enforcement. Why? Because the schemas might change so you're going to allow them to evolve yet at the same time you want to enforce them.

A

So that way people are just going randomly writing files directly to your data Lake to destroy them right. You don't want to do that. Audit history, DML operations, all very important key features next slide, please!

A

So, if I leave you nothing in terms of the key differentiators of this community, it's about performance, community and reliability in terms of the performance, uh some of the concepts I'd like talking about now. Of course, this number 1.7 exabytes. This is from databricks, because that's the only place we can actually get the numbers from okay, but uh in the context of database, there's 1.7 plus exabytes processed a day not stored processed, so this is in store.

A

Storage is actually much higher, but we like using the process numbers, because this is showcasing just the fact that we're talking a ton of data. This is based off of 7000 plus companies in production. So that's actually how databricks is able to go ahead and provide a reliable service, because it runs on Delta, Lake um and then over the last three years.

A

uh This is last year's number, so I don't I'm, actually not sure what this year's numbers looks like, but last year, numbers there's an increase in contributor strength over the last three years of 663 percent I, that is, there are more and more open source contributors uh contributing to the Delta Lake Project, which is really really cool. I'm super happy about that next slide. Please, and so this is a also from last year's number I'm, not sure why I accidentally blocked out the data and AI Summit to 2022 logo.

A

So that's my bad, that's on me, but this is from last year's and so basically there's. uh uh um If we go to November last year, basically is 1.1 million downloads a month.

A

um The numbers are great, but that the key things I wanted to call out here is that every month there are multiple releases, whether it's still to spark whether it's uh the Delta rust python apis, whether it's Flink Delta sharing, whatever else there's the community's Super Active and so we'd love you to join us. If you actually have some cool ideas, for example, there is already in the alpha stages a uh a go um Delta Lake um uh API as well. There are apis for Delta, dask and Delta array.

A

So again, pretty cool stuff, so come join us all right next slide. Please now give you the context of why Delta Lake that's great, but for those who are into machine learning, which is the bulk of the folks I'm sure here, who are here today like you're, wondering why? Why do you care about Delta Lake? And why are the benefits of machine learning? So next slide? Please, and so the key thing I sort of like to remind folks is that the data science life cycle isn't just about the part where you're serving models.

A

It isn't just the part about your even going and training the most. Obviously, they're super important, so don't get me wrong, but there's a lot of other work that has to be done. You have to take that raw data. You have to be able to scale that up somehow to and prep it and process it. So there's all sorts of different tools that are out there, we're obviously I'm leaning, more towards spark or um or rust, but you know there's other systems like Flink like trino like Presto.

A

This is all good like we're we're not trying to tell you which one to use. You have your reasons for doing this. The point we're trying to get as that, but it involves taking a lot of raw data and prepping it and processing it and making sense of it and filtering it and so forth and so forth. Right and then, when you do the training you're going to go ahead and use things like you know: Pi torch or r or tensorflow or Barker she boosts or whatever it is that you uh like using out there.

A

Okay, because your data scientists have their particular tools of choice. Okay, they're going to tune it and you need to be able to scale that process as well and so I'm more than sure. But now it's going to cover about how sagemaker helps with a lot of that stuff, which is exactly the point. That's why we have today's session, but that's an important aspect.

A

You have to have these systems that can scale for your data prep for your raw data scale for your tuning scale, for your training and then you have to deploy all those models. Okay, there's a model we're going to push those models out that could be Docker. It could be sagemaker, it could be MFL I, don't care again right, you're, going to choose a uh your tool of choice, again vadod's going to talk about sagemaker, because that's today's session. But the point is that you have to be able to scale that too right.

A

So all of these things fundamentally require some form of data reliability, the ability to ensure that whatever you're doing with that raw data, whatever you're prepping it when you're training it and then you're deploying it such that the model such that they're going to go back and re-look at the data again, because you're going to incrementally update the models based on new data.

A

Guess what you have to know what data you're working with you have to ensure the reliability of the schema. You have to ensure that their scale relating to all of the stuff right, so putting all this together and, of course, hey uh wanted to just give a call out to uh to kunmi. uh Hey, don't forget Mojo, yes, there's a ton of Mojo uh of around all of these things. Well guess what Delta like allows you to have all of that next slide.

A

Please, and so that's the context, data the role of day engineering is super important. It allows and enables your data size and your analytics. It allows you to develop test, maintain your pipelines and allows you to productionize those data science models. There's that yin and yang of basically data engineering cannot exist without data science and data science can't exist without data engineering. It used to be back in the old days about 10 years ago, I think of a DOT when I want to say old days.

A

In this case yeah, it literally was a matter of like it was the same person that did both well. As we start applying actual software engineering practices we're recognizing. There is a fundamental differences between how you run uh do run the practices around data engineering and run the practices around data science they're super related together, but they're.

A

It is definitely more than just one person's job or if you're one person, then you maybe you can go ahead and justify going and get a race, but that's a that's a whole other conversation all right uh next slide. Please I just uh finish it up uh since yeah. There you go, and so, when you look at these architectures, basically the big data structures and data Lakes. This is what opponent's done.

A

You've got these input sources, whether it's Batcher streaming, you're going to put this into a data Lake and you're, going to go ahead and have data consumers do the AI and Reporting well. Delta Lake covers that concept from the input sources to the data Lake and prepping you for the data consumer such that you can store the data in structured and semi-structured pull data from various input sources. You have a single central location, so you're not actually going and having these data silos anymore.

A

It's open and accessible and again designed for SQL and, more importantly, for today's session machine learning as a single source of Truth next slide. Please.

A

uh So in the end, because of this, you could presumably say: let's go ahead, build ourselves a super complex pipeline to basically uh to be able to handle the streaming of your data in the batch of data. I'm not going to go through all the details. I've done this before. But the point is that this is picture is an accurate view of when you try to handle streaming and batch day before you ever get to Ai and Reporting.

A

But if you flip to the next slide, what you'll notice is that we're actually we're going to ask that question? Can it be simplified and that's what it comes down to Delta like allows us to do all that, because it handles the batching the streaming, the updates deletes the reads: the ability, the rollbacks optimize, live uh blue optimize file, layout formats, uh next slide, please and so uh yeah. Let's skip that! That's fine! Every I think I've already beaten that one so perfect. So this is a great segue.

A

Now we're going to have uh go ahead and cover about sagemaker studio, just in case you're, not aware of that.

B

Yeah thanks Denny, so uh you know quickly, walk through some slides and then and introduce everyone to sagemaker studio, so Denny talked about. You know the whole data science life cycle and the importance of having high quality, reliable data for building your machine learning models right. So that's that's a separate pipeline.

B

It's a data pipeline involving data engineers and data stewards and then, once you add that data in the right format you want to derive, derive value out of that data right and that's where all these different machine learning tools come into place and sagemaker studio is, is a state-of-the-art, fully integrated development environment designed specifically for that purpose. It's just designed specifically uh for building machine learning pipelines and accommodating all these different personas, not from the data world, but now from the machine learning world.

B

So you have data scientists, machine learning, Engineers ml, Ops experts Etc right. So all these different users need one environment and one Unified visual interface. So that's exactly uh what sagemaker studio provides. It provides you purpose-built tools for every step of machine learning development, including labeling data data, preparation, feature engineering, biased detection, explainability and then all the way down to hosting these models in a very efficient way um and and doing model monitoring Etc.

B

uh You can write code track experiments, visualize the data debug and monitor all of that within a single environment, and all of these different steps of your machine learning workflow are tracked within that same environment, as we will show you in the demo. So we discussed earlier the benefits of using Delta Lake for machine learning. Well, sagemaker Studio makes for an ideal user interface and offers an underlying compute platform for building machine learning. Applications on that data, reliability, layer built on top of balcony.

B

um So that's because uh when we combine sagemaker studio with Delta Lake, uh you get this optimized storage uh alongside data governance and data reliability capabilities, um and then you also get the end-to-end machine, learning, capabilities and model governance uh through sagemaker, along with access to state-of-the-art machine learning models and solution templates.

B

um So what are these state-of-the-art machine learning models? Well, uh sagemaker, Studio, being this machine learning platform gives you access to all these latest and greatest uh built-in algorithms.

B

um So you know for all the different kinds of applications that you may want to build different modalities of data that you may have and being able to perform machine learning, uh training at scale. uh Sagemaker Studio provides all these different kinds of algorithms right, so you have supervised machine learning, algorithms, you have computer vision, algorithms, you know Advanced computer vision, algorithms, like semantic segmentation, um and then you have text-based algorithm. You know. Large language models is a big deal today.

B

You know these days uh and we will talk about that in a little bit as well, and then you have purpose-built algorithms such as forecasting Etc right, so we have unsupervised supervised and also semi semi-supervised algorithms. So all of those are packaged uh within sagemaker studio. You can also use the API to call these models which, which I would show you in a little bit, um and then we have the service called sagemaker jumpstart.

B

So all these models are pre-packaged within containers uh and there are some uh you know: proprietary models as well as publicly available models. uh These are basically targeting uh generative AI use cases, so sagemaker Studio makes it very easy for end users to access all these different models uh within one location, fine-tune them train them and and deploy them behind endpoints within studio as well, and all of that is happening obviously in the AWS Cloud.

B

um So you can get access to all these different features. If you have the AWS Cloud already and now in order to actually get access, you need an environment and that's where stage maker studio notebooks uh come into play, so sagemaker Studio allows you to do data pre-processing, analytics and building machine learning workflows all within one notebook. uh There are built-in Integrations with spark.

B

We talked about the importance of spark in building these data pipelines, uh so there's built-in integration with spark and also other open source projects such as Hive and Presto that are basically running behind what we call Amazon elastic mapreduce clusters, um and then you have data residing on S3 and if you have other data source, this is uh you know.

B

Studio has built-in connectors for those data sources as well, and you can browse and query these different data sources, explore metadata the schemas and run analytics jobs as well as run end-to-end machine learning workflows, depending on the kind of framework that you're using if you're, using pytorch, tensorflow and others. There's built-in support for that as well. Using our deep learning containers and you don't have to leave our notebook environment in order to build these workflows.

B

uh So these studio, notebooks, they're, fully managed they're built on top of Open Source Jupiter lab notebooks, and we have a great blog post talking about all the open source extensions that awssl build for Jupiter lab.

B

um But the point is that these notebooks are fully managed and they run on elastic compute resources, taking full advantage of the AWS for the scalability of the AWS Cloud, along with the economies of scale. So you can pick all these different algorithms, From Within, These notebooks. There are 15 built-in algorithms at this moment. Probably more now we keep adding more and more new algorithms based on you know the latest and greatest innovations that are happening in the open source.

B

Community as well as algorithms that will be built within Amazon, um you know, Amazon has been in the business for uh machine learning for the past more than 20 years, I believe now, so there's plenty of knowledge that is being transferred from Amazon, also intersection request Studio. uh So you can run these models at a small scale. You can run these models at a large scale in a distributed fashion. You know we provide the controls to the end users um and then you have some pre-built solution templates. Also.

B

So these are you know, cloud formation, templates service, catalog templates that allow you to um you know, take end-to-end use cases such as you know: fraud, detection uh for for the banking industry or uh visual inspection, automation for the manufacturing industry. It brings in different AWS components and with a with one click, you can deploy these machine learning, Solutions prompt data, all the way to inferencing endpoints, and then we also have Automated machine learning capabilities. We all know the importance of automl in in what we do as machine learning practitioners.

B

uh So we have that capability built in using autopilot, Within, sagemaker and then sagemaker using our. What I talked about earlier are deep learning containers. We have optimized um these different popular open source, Frameworks such as tensorflow, pytorch, mxnet and even hunting face now uh to to be able to run these model training jobs uh at scale and with minimal. uh You know with minimal modification of code and so on.

B

So today, data scientists can use sagemaker Studio to spin up these notebooks and start building these machine learning models, um and uh you know machine learning, Engineers, as well as data scientists and ml Ops exports, can all collaborate and come together within this one environment to build these intimate workflows.

B

So that being said, the focus is going to be on on Delta, Lake and Studio integration for today. So uh they're. Really these two concepts within the sagemaker studio. One is you: can you know you can run these data prep jobs, analytical jobs, data explore exploratory jobs uh locally within the studio and notebook environment, uh but then there's also the option like if you are already looking for that scale, to prepare your data and to be able to run those data preparation, jobs at scale.

B

uh You can also connect to these remote EMR clusters with a click of a button, okay, and that that gives you the ability uh to bring in the data you know, use Delta Lake to um have you know the data reliability layer to read the data from Delta Lake uh into EMR and run your analytical pre-processing workloads at scale in an optimized fashion.

B

uh Before you really get to the point of building, you know having the stream test, uh trained, validation, test, split and really start building these machine learning models and doing everything else that comes after right, um and so uh we also enable you know: fine-grained access, that is a credential push down capabilities.

B

So if you want fine-grained security permissions, where you have multiple users in a single environment, sagemaker Studio gives you uh that capability and then, finally, if you have to automate all these different machine learning components from the data exploratory and data pre-processing stage, all the way to the inferencing and model monitoring stage, uh Studio gives you the ability to automate all of that as well via the API and also we also have now a scheduling feature within sagemaker studio. That allows you to do that so I'm, going to walk through the demo.

B

I have a couple of studio notebooks to show you. um So this is uh the sagemaker studio environment in order to access uh sagemaker Studio number one. You have to be an AWS customer, which means that you have to have an account on on Amazon web services, and you should be able to log in and have the required permissions to be able to access sagemaker Studio, which is all proactively managed through our IAM roles.

B

um And so once you have an AWS account, you can just go into Amazon sagemaker. You can search from here. You can go into Amazon sagemaker, and this is the the grand stage maker console right, which which has all these different machine learning capabilities uh from data labeling, using what we call a round truth Service uh to to model inferencing. We have many different kinds of endpoints. We have batch influencing Service as well and uh um and then there are other capabilities around machine learning. You can also build.

B

You know for every model that you create, there's model governance capabilities, so you can build these model cards um and have metadata around these different models and see how these different models evolve. How the data sets that were used to create these models. You know you can get access to those data sets and look at those as well uh in order to access sagemaker Studio, specifically, uh you have to go into the siege maker domains section you can have uh you know multiple domains here. uh We also have the ability to build collaborative workspaces.

B

So um you know that's when you have multiple users and you want to track changes on the notebooks Etc. So once you have a domain yeah, you know it takes a few minutes to create a studio domain. This is backed by an imro. So in order to be able to log into a domain, you need to have access.

B

You know you need to have IAM access to that particular Studio domain and you can have multiple domains based on the kind of use cases that you're working on and also within within each domain, you can have you can have multiple users um so, depending on you know, you can have multiple data scientists. You can you know again, depending on the job profile, you can have ml Ops Engineers Etc uh collaborating with one another in a single collab collaborative space.

B

um So once your domain is spun up, you can access sagemaker Studio from there directly, and this is the UI for sagemaker Studio. So when you spin up sagemaker studio for the first time, you're going to land up uh on this home you're going to land on this home page- uh and here there are many different options, so I talked about uh you know ingesting and preparing data. uh There are multiple different sources where you can ingest data from their. You know.

B

Built-In Amazon data services, like redshift S3, obviously, which allow you to you know, bring your data in directly with a single uh point-and-click, and then we have other third-party Integrations as well. As you can see here, uh and then uh there are, uh you know: I talked about jumpstart models.

B

You have access to all these different models here uh with a point and click uh and then there's a whole Suite of all these different kinds of models, depending on the kind of use case that you're dealing with, uh and then there is the automl capability that I talked about earlier by the way. All of this is also accessible through the sagemaker SDK, so yeah. If you don't want to necessarily use the UI, you want to use your own, um you know Visual Studio or any other IDE.

B

You can access all of these features through or most of these features through SDK as well. And then, as you build these machine learning models, uh there is the ability to track these models. The evolution of these models through a single UI through this experiment tracking feature uh you can schedule notebook jobs, so I'll show you that when you know once you have your notebook and I see uh prepped and ready, you can even use.

B

uh You know the notebook scheduler in order to schedule these jobs on a particular schedule, and then we have sagemaker pipelines which allows you to you know plug different components of the of the pipeline. uh You know your machine learning pipeline from you know: data pre-processing to model training to you, know automl to model deployment. All of that can be done automated through sagemaker Pipelines and then finally, we have our own model registry, uh and then this is the deployment section where you can see.

B

You know all the different uh models that are that are deployed there again, like I said uh before there are multiple different kinds of endpoints that you can use to deploy your models um and so on and so forth. So talking about Delta Lake.

B

um So one interesting feature is that uh you know you can actually connect to uh a remote cluster as I talked about earlier. uh So you know if you want to use spark to process your Delta Lake data that is sitting on Amazon S3.

B

um You know you, here's a here's, a cluster feature within you know this very familiar: Jupiter lab kind of environment. uh You know you have uh this this cluster feature, so here, for example, I have these two clusters um uh running so one is, let's say a pre-processing cluster, which is a you know, standard optimized, spark cluster um that allows you to run these. These spark workloads and then I have a very specific machine. Learning uh cluster as well.

B

So I can connect directly to these clusters from within the notebook itself, and then we have these concepts of these kernels. You know, depending on what you're trying to do. uh You know we have many purpose built and optimized container, kernels and and images for that purpose. So if you wanted to use mxnet latest and greatest version of mxnet or Pi torch, tensorflow Etc, all of that is available to you in a fully managed fashion at your fingertips.

B

Without you having to install that manage all those dependencies yourself Etc right and then you have access to all these different instance types. We also enable fast there certain fast launch instances as well there's support for spot instances. uh You know for for better economies of scale Etc right, so um uh you know as Delta Lake users.

B

You might already be familiar with a lot of these Concepts uh that I'll show you in the first notebook I have two notebooks that I'll walk you through, um but um you know the first notebook is where we're going to take this, uh this data that is openly available. It's the Lending Club data uh and it's uh you know it's. It's got this loan risk data right, so it's like basically certain feature columns that show the all the different loans uh given out in these different states to different users.

B

Along with their you know, some information about the users um you know, such as their FICO score and then information about the loan term and then also information, whether or not the user was approved for the loan right. So you can get this data in in Delta Lake. You can, you know, manipulate the data run analytics at scale, get that data, reliability layer and then, once you have that you can build and manage a machine learning life cycle.

B

uh You know which, uh which determines whether or not a user will be approved for a loan right. uh So that's the idea, um so here uh Forest, once we have this cluster access to this cluster, you can see uh this is the cluster ID. You can get this cluster ID from the EMR console, The Classy mapreduce console with an AWS, and here we can see we are connected to that cluster.

B

We have access to the spark UI from here from within the notebook um and then in order to be able to read uh Delta Lake data into EMR. We need to run some. You know we need to configure the cluster to grab those dependencies uh grab the open source Library. um You know the Delta core, Library um and and passive external configurations, uh and once we do that, it's going to restart this work application, and then you, uh you know.

B

Essentially, you have your Delta Lake Library loaded into your EMR cluster, for you to be able to run all these processing steps for your uh for your data, so of course we're going to import the data and we're going to create uh create a raw table, uh so the data that we have is in in parquet format. By the way this is data that is openly available. uh This notebook.

B

This is going to be on GitHub or it's already on GitHub, so you can just download or clone the GitHub repository and run this notebook and yeah. So the data set is in a parquet format. It's uh I think it's a few hundred gigabytes of parquet Finance taken. So uh you know we. We read this data set into using spark into the EMR cluster that we are connected to um and that's exactly what we are doing here uh and once we do that uh you know we can.

B

We can run some pre-processing steps on the data right um now. We will. We want to kind of highlight the differences between open source parquet and the open source Delta Lake format. You know, Delta lake is built on top of parquet. It does use parquet as a default storage and then it has an additional meta data layer, for you know the data, reliability and schema enforcement, and things of that nature that Denny talked about previously in this slides.

B

So here we are going to convert data directly from parquet into Delta Lake, while keeping it in place. So of course we create a database, we Define the location of the database, and this is where we're gonna.

B

You know once we have read this parquet data into this spark data frame called Data, uh we're going to write it out in a Delta, Lake format, and it's very easy to do that with you know using spark- um and um you know we're going to Partition by the adder or the the state where the loan was given out um and then we're going to write that out back into S3 and then we run ran some pre-processing steps and if you're familiar with a with a bronze silver gold, uh you know stages within Delta Lake we're going to write the processed Delta V table.

B

um You know that's the silver table in a separate S3 location, uh we're going to create two separate Delta Lake tables for those um for those data sets, and so one is for the raw data, and then one is for the uh for the cleansed data right and this is we can see that here within uh Studio. You know you can call and you can use a SQL.

B

um You know here is SQL extension to um you know analyze your data there's also built-in visualization within studio notebooks. So here, if I wanted to create a bar chart or a pie, chart I can I can do that from within the notebook itself.

B

um I can do a describe table here. You can see the schema of the table and then, if I look at the S3 location, where the data Delta Lake data is stored. This is the metadata all of that is being stored in a Json format. Right, so uh you know, that's that's kind of the difference between Park, a and Delta lake.

B

Is that Delta Lake ads that external metadata storage, for uh you know for giving end users data reliability on top of uh on top of Open Source per k, and then Delta Lake also provides a full DML support on on top of uh that storage. So, if you wanted to, you know, Run update operations. If you wanted to delete uh uh your, you know certain rows in your data based on some logic or you wanted to merge.

B

uh You know you may have new Loan Data come in uh in a periodic on a periodic fashion, and you wanted to make sure that you maintain this pristine data repository, which has the cleanse data set for.

A

B

Learning uh you know, Delta Lake gives you this full DML support for that purpose right. uh So that's exactly what we're going to show here. uh First, uh we're gonna try to run these kind of operations on parquet, uh so uh you know, for example, if you run this delete operation on top of parquet, you can see that it errors out.

B

um It won't work on parquet, um but if we were to run that exact same delete operation on Delta Lake, it does work right, um and so so that's the that's the concept we can see that those rows uh regarding you know with the loans that were giving up given out in the state of Iowa, for example, were uh deleted it from uh from this uh table Delta link table, whereas the same internet operation failed on Arcade, same concept here with update, in this case I'm, going to update the count of the number of loans uh in the given out in the state of Washington, um and you know the same thing: it'll fail on parquet, but it'll go through on Delta Lake and that's exactly what happened here.

B

We updated the number of loans given out in the state of Washington to 2700 and we can see that show up in in Delta Lake.

A

I I just did want to chime in real quick, nothing against the state of Iowa I I used to live there. That's the reason I chose I currently live in Washington state. So again, that's why so, just for any of you folks who are wondering yo, no there's nothing against Iowa at all. Here.

B

Yeah in full disclosure- this is originally I think this is uh Denny's notebook that I just imported in the sagemaker studio um and and basically there's uh you know you can take uh other notebooks as long as they're in ipnb format, you can just import them, choose the right kernel within studio and just uh you know, make some slight modification like the storage, Etc and just run through with these notebooks.

B

So it's very straightforward um so in similarly for merging uh I think everyone is uh very well aware of this concept within Delta Lake, but uh if you wanted to do In-Place merges within uh within um these Delta Lake tables, you can do that as well without having to create all these separate views that we had to do back. uh You know 10 10 years ago or prior to the Delta Lake days.

B

um You know and use spark And, Hive and cons. You know other other open source projects. In order to have this, uh you know merge capability. You can do these in page Place merges uh with with a simple SQL statement on top of your public storage.

B

um And then you also have uh schema Evolution capabilities. um You know we I showed you the metadata layer, which was uh you know, also stored on S3 in Json format. uh So now you can use that to make changes. You know you can use spark to make changes to to your schema on on your Delta Lake storage right. So uh that's exactly what we are going to do here.

B

We you know once we've created that merge table, we merge the schema and append to that Delta leak path and then eventually we'll end up making a gold table right which we can then use for other purposes, such as reporting and parameters. Etc, uh there's also time travel capabilities. So, if you wanted to you know, let's say you inject it and you got some bogus data you injected that into your silver table, and you wanted to clean that.

B

You wanted to go back and restore that to a previous version without having to necessarily run compute jobs.

B

um You know you can you can do that very easily with uh with time travel within Delta Lake uh there's, uh you know, there's a described history capability and then you can also see you know. All of that is being versioned or the data, as new versions of the tables are created. Those are versioned within the Delta Lake metadata. So and you can you can roll back and forward um on on different versions right. So uh you know you can see a very robust data, reliability here for running these machine learning workloads.

B

So once you have that um you know you can give using stagemaker Studio, um you know you can give you can bring in your data scientists, ml Ops, Engineers Etc to come and now start building machine learning models on top of that Delta data sitting within your S3 right. So let's, let's look at that now, real quickly.

B

um It's the same concept here: I just switched notebooks and here I'm, going to take the data that we just downloaded and created a Delta lake table out of to run machine learning, yeah and run machine learning at scale and basically take a very simple task. I mean we I showed you. We have these really.

B

um uh You know state-of-the-art machine learning models for, depending on the kind of use cases generative AI. You know semantic segmentation Etc, but in this case we're going to do a simple binary classification just for demo purposes right so same concept. Here, I'm going to connect to an existing cluster I showed how you can do that with the cluster tab. Previously, here I'm going to collect, connect to the ml cluster um and then I'm going to read the deltaic files right. So now you can have two things one you can there is.

B

uh uh You know: there's glue integration and Athena integration with Delta lake, so you can have a glue catalog of all your Delta Lake data and there is sagemaker Studio integration with glue as well. So you can read that data directly uh from the blue catalog right. These are. uh These are permanent tables that are registered, they're, managed tables uh that are registered with the glue catalog. You can read them directly into Studio using spark apis right.

B

uh In this case, though, you know I'm not connected to Google catalog, so I'm going to go directly to S3 and load those files directly into a spark data frame by simply saying format, Delta and load those files into a data frame. I'm going to do some data munging, for you know specifically for machine learning. We are trying to create this column.

B

um You know whether you know certain feature columns right, we're trying to do some feature: engineering such as if the individual defaulted on a loan. If there was a charge off, if there were any if there were any late payments, you know we have kind of this transactional data in our raw storage and I'm, going to create some feature, columns that are necessary for machine learning.

B

So that's exactly what I'm doing I'm doing here uh and then I create a view right and so, like I mentioned here earlier, you can use the SQL view to view your data um and the other great thing about our studio. Sagemaker studio and using this master integration is that it's very easy for you to move back and forth between.

B

um You know using a studio instance versus using a cluster, so you can prototype on a sagemaker studio instance right on a smaller data set to test out your code um and then once you need to really run that at scale. You can simply just change this little extension here and then and run that on on the cluster itself. Okay, so it's uh and that's exactly what I'm doing here, if I just say, percent local uh I can see you know this data frame essentially gets surfaced.

B

The spark data frame gets surfaced into a pandas data frame now on my local Studio instance. So it's running on a single node and that's exactly what's going on here, but if I wanted to run this as a push down SQL job and visualize, the data uh you know on a larger scale, I can I can do that by simply pointing uh to this. You know by changing this little uh line here within the jupyter notebook.

B

Okay. So uh once we have that, uh you know we're gonna start with the actual machine learning process where we, uh you know, create the Target and the uh the Target and the feature uh variables. So um you know we have some categorical variables, some numerical variables, that's exactly what we Define here. uh We will need this uh when we actually run the particular model training job, because you need to specify uh um to sagemaker to build an algorithms within sagemaker studio which variables are categorical and which variables are numerical.

B

Now, if, when you're running an automated uh tuning job or using autopilot or Auto ml, there are also capabilities for automatically understanding which variables are categorical and which variables are are numerical, so that capability is already is also available to full users.

B

um So that's exactly what we're doing here we are going to first convert. uh You know the categorical columns uh as category type for built-in algorithms and we're going to create these two data frames. There's a training data frame- and you know which contains about 70 of your data set and then there's the validation data frame right, which, once the model is trained it or once the model is going through the training process.

B

It needs to validate the output from the training process on on a data set to make sure that uh it's improving its accuracy as it goes through. Various steps of training, um so you know once we have these two spark data frames. Now uh built-in algorithms won't directly read data from Delta lake, so uh there is a another step involved here. uh So these training and validation steps, a validation data sets or the data frames that we just created uh need to be written out into a separate storage.

B

Now built-in algorithms will not read from Delta Lake, but they will read from open source parquet, okay. uh So that's exactly what we're going to do here, uh we're using parquet, because our data set is fairly large. You know built-in algorithms also support other formats like Json and CSV.

C

B

Others, but in this case, because we have a large data set, we are using parquet and, like I mentioned earlier, you know this is a you know, a match made in heaven in a way: Studio spark and Delta Lake, because you can take. uh You know this. These spark data frames and and write them out into parquet back into S3 uh and and built-in algorithms will read that in those part that parquet data from S3 and and spin up these distributed training jobs in in sagemaker studio.

B

Okay, and that's exactly what we're going to do here. uh So we're going to Define some parameters uh for the training jobs. We have to define the S3 bucket, um and this is the S3 bucket, where our train and training and validation data sets uh are stored, which we just created in this step, and then we have to Define some other configurations for the built-in algorithm as well, by the way we're using extreme boost. Here you have access like I mentioned earlier, to a plethora of different.

B

You know libraries and algorithms such as um pytorch tensorflow, XG boost mxnet and then proprietary, algorithms and some other open source and many other open source algorithms as well, uh and then, once uh you know, we, we have to give this training job a name and you'll see why that uh why? That is the case in a minute um and then we'll Define, some hyper parameters right. So there are many different tuning strategies.

B

We're using uh you know, automl, uh so there's different tuning strategies that you can use to get to the best machine learning model uh and for those tuning strategies. We are essentially hinting to the uh uh to sagemaker to take certain parameter ranges into account uh when tuning the machine learning job. Okay and then we have resource limits. uh We can Define the number of training jobs and the parallel training jobs that can run at the same time.

B

uh This is this is important because you want to get to a model as quickly as possible, but you also want to want to do it in uh with the uh you know, in an economical fashion. So this this can be. You know, as distributed as you wanted to be. These clusters can be very large uh depending on your data set and depending on your machine learning problem, um but uh you know so this is where you can basically hard code, those parameters, um and then here you can Define the strategy right.

B

So there are genetic algorithms that are there this in this case we're going to use and then there's grid search, random search, et cetera. In this case, we are going to use the Bayesian search. So that's exactly what we're doing here and then we provide these training job. We pass these uh specifications to a variable in Python and then we create a hyper parameter tuning job and and we go ahead and launch uh this tuning job.

B

So as this tuning job, uh you know starts to kick off, um you know in this case it you know for this demo. It takes about seven to ten minutes to complete, uh you can go to your stagemaker console and if you go into the training section uh first of all there are you know, list of algorithms uh that you can view here. You can also bring your own algorithm. So if you have developed an algorithm in-house, you have a container.

B

You can you can just import your existing algorithm to H maker studio um and then you know. Obviously there are algorithms that are already there that we are using, for example, in this demo, and here you can see you know, you'll see in progress. uh You know when your training job gets, kicked off.

B

You'll actually see that in progress here and you can go into your sagemaker training console and you can go to each stage uh of your training job and you can look at your or you know you can look at different uh log information right. You can look at the location of the training data set. You can look at the parameters, the hyper parameters um and you can look at the different metrics and then you can also.

B

This is directly integrated with the cloud watch, so you can also go directly to the cloud watch console from here and and debug uh your training job. So all of that is happening essentially in in one location in a very easy to use fashion right.

B

So this took about seven to ten minutes to complete. Obviously uh you know you can speed that up. If you throw more compute to your uh job, you can you can speed that up.

B

um But uh you know once uh once that is done. uh You know I register I go ahead and I have the data set. You know the final uh you know, I have the data set in S3 and then the final model artifacts or the model metadata is also stored on S3 once the training job is completed, so I can do one of two things so I can either call the sagemaker or SDK to register the final model.

B

If it's to my satisfaction to the built-in model registry or I can take the location and there's a UI right here and I can create uh create a model version. So once the training job is run, I can point to the artifact, which is a zipped file and point. uh You know the model registration, stagemaker model registry uh to that S3, location and and create that model in the registry which we can then use for tracking. As we you know, train new models on new versions of the data set so essentially with Delta Lake you're.

B

Getting this data version capability and with sagemaker model registry you're, getting this model versioning registry as well right and all of that each uh version of the model also comes with the appropriate training and validation data sets. So you can also tie back to which data set you use to build these particular models. uh You can also create endpoints uh from this UI itself. Of course you can do all of that with sagemaker or SDK, but you can also create these different kinds of endpoints.

B

There are asynchronous synchronous, real-time uh endpoints, you know different kinds of modalities, depending on your use case. That sagemaker provides, of course, they're all scalable. um You know they Scale based on traffic Etc, um and so all of that can be done through the UI as well. And here, if I go into this, is this is exactly the job that I just ran. This is my uh uh my training job. If I go into this I can see the different versions of the model right and then I can reject.

B

Some I can approve some, and then we can have some have some actions taken once these uh these models have been approved or rejected or appended Etc.

A

Perfect so yeah we're we're running pretty low on time. So why don't we just wrap it up and answer at least a couple questions before we, we run out uh anything else that we want. You want to call out before we do that. No.

B

Perfect I'm right on time and uh yeah. That's that's the end of the demo excellent.

A

Cool so, for starters, we're not going to answer all questions because we're we ran a little long and that's on us uh I'll primarily be me uh so with that um join us at go.delta, dot, IO slack uh without myself. Are there to answer questions uh number one number two um I just posted to everybody, uh the GitHub repo which these notebooks are going to be posted to so we're gonna, resolve and merge that PR later today, I think the question I want to leave it with is uh Harry asked.

A

A great question is what are the differences between Apache spark, Delta, Lake, databricks and EMR, and- and this is what I also want to call out a sagemaker studio. So uh for starters, Apache spark is a uh is a big data processing engine Delta lake is a is a storage format, okay um and then databricks, EMR and sagemaker are Services. Databricks I can talk about that and I'll. Let uh talk with sage without talking about EMR and sagemaker. Databricks is basically your lake house platform service.

A

um We both work well together and slightly compete against uh EMR sagemaker Studio, but this is great, um we're all friends too. So that's why we're on the session together and uh but not I'll leave you with the last words on EMR and sagemaker.

B

Yeah so yeah EMR is is a big data platform. That is it's a service. That is it's a first party service within within AWS. So uh you know it now supports their colleague as well, and it offers um not only spark but they're a plethora of different uh Big Data open source projects that it offers so I think uh there is high, there's Presto um and multiple others that are built in which are really for building these big data workloads and then sagemaker studio is really like.

B

You know we talked about in slides, it's like it's that Landing. You know that uh course, party platform, first party service within AWS, for building your machine learning workloads end to end right. So there are places where you know you like I show like EMR and sagemaker can play really well together. uh There are places where sagemaker databricks uh can play really well together as well, which we won't cover in the session. We can have a separate session for that, uh but uh you know yeah.

B

There are overlaps between the different services, but there are also uh places where we can be complement a complimentary to each other.

A

Perfect, okay, we're actually at the top of the R, so I'm gonna. Unfortunately, we're gonna have to wrap it up. So I wanted to say thank you very much for everybody for attending today's session, uh the recordings on LinkedIn and also on YouTube um and as well.

A

um If you have any questions again, but not myself are both on the Delta users, slack go.io slack and um oh yes, that's right and Spotify. The session will also be on Spotify very soon so again, but not thank you very much for attending today for speaking at today's session and to everybody else again. Thank you very much for attending.

A

Thanks thanks. Everyone.