►
Description
Combining SageMaker Studio and Delta Lake brings state-of-the-art machine learning to your data lake. In this session, we show how you can train ML models and how you can take advantage of the capabilities offered by Delta Lake using Amazon SageMaker Studio.
Quick Links
Vedant Jain: https://www.linkedin.com/in/vedantjain/
Denny Lee: https://www.linkedin.com/in/dennyglee/
Join us on Slack: https://go.delta.io/slack
GitHub: https://github.com/delta-io
Join Google Groups: https://groups.google.com/forum/#!forum/delta-users
A
Live
stream
hi
everybody
we're
just
waiting
for
a
few
minutes
to
get
ourselves
set
up
on
the
LinkedIn
and
YouTube.
So
take
your
time
go
back,
get
some
coffee.
You
know,
get
yourself
all
ready
and
then
we'll
be
ready
to
start
the
show
and
by
the
way,
if
you're
wondering
this
particular
session
happens
to
be
about
ingesting
data
from
Delta
Lake
into
Sage
maker
studio,
so
yeah
you're
on
the
right
channel,
so
just
give
us
a
few
minutes
to
get
all
set
up.
Okay,.
A
And
meanwhile,
while
we're
waiting
for
everybody
to
chime
in
and
to
get
in,
LinkedIn
YouTube
almost
ready,
why
don't
you
go
ahead
and
tell
us
where
you're
based
out
of,
for
example,
I
am
based
out
of
you're,
not
gonna.
Believe
me
sunny
Seattle.
So
that's
where
I
based
up?
How
are
you.
A
Right
so
man
we're
all
sunny
today.
This
is
pretty
sweet.
All
right,
well,
okay,
I
think
we're
good
to
go.
We
have
the
LinkedIn
live
stream,
that's
up
and
running,
and
we
have
the
YouTubes
for
straight.
Youtube
live
stream
up
and
running,
and
now
we
have-
and
we
have
zoom
which
a
bunch
of
you
guys
are
are
here
as
well.
So,
let's
start
the
show,
we're
gonna
start
off
and
kick
it
off
with
Carly
Carly.
A
You've
got
a
bunch
of
announcements
to
make,
so
let's
go
ahead
and
do
it
and
by
the
way,
if
you're
not
familiar,
Carly's
are
go-to
person
when
it
comes
to
All,
Things,
Delta,
Lake,
social
and
marketing,
wise
events
wise
so
happy
to
have
here
and
have
her
actually
join
and
join
the
conversation
more
often,
as
opposed
to
just
you
know,
hiding
behind
the
scenes
so
we'll
take
it
away.
Carly.
C
A
C
C
So
one
week
from
today,
May
24th
Wednesday
May
24th
at
9
00
a.m.
Pacific,
we
will
be
going
live
on
the
data
bricks,
LinkedIn
and
YouTube
mate
zaharia
co-founder
and
chief
technologist
at
databricks,
our
lovely
Denny,
Lee
and
Martin
run
senior
software
engineer
at
databricks
are
coming
together
for
an
online
Meetup.
You
spark
from
anywhere
this
is
going
to
be
a
very
exciting
conversation.
I
will
drop
in
the
RSVP
Link
in
this
chat.
After
this
we
also
have
Thursday
May
25th
at
9
00
a.m.
C
Pacific
a
wonderful
d3l2
session
with
Andy
Grove,
who
has
been
specializing
in
query,
engineer,
engines
and
distributed
systems.
I
will
also
include
that
RSVP
link
and,
lastly,
June
26
to
the
29th
we
are
coming
together
for
data
and
AI
Summit
top
experts,
researchers
and
open
source
contributors
across
the
data
and
AI
Community
are
coming
together,
June
26
to
the
29th.
C
If
you
use
the
code
ET
Linux
400,
you
can
save
400
off
the
regular
price
of
a
full
conference
pass
that
code
expires,
June,
2nd
and
now
I'm,
going
to
flip
it
over
to
Denny.
For
today's
discussion
with
vedant,
Jane
and
they're,
going
to
show
you
how
to
train
ml
models
and
how
you
can
take
advantage
of
the
capabilities
offered
by
Delta
Lake
using
Amazon
sagemaker
Studio.
A
Perfect
Carly.
Thank
you
very
much,
really
appreciate
you,
diving
in
like
that,
so
hey
without
further
Ado,
the
Dot's
gonna
start
presenting
the
slides
so
because
he's
gonna
be
the
guy
to
run
things.
Meanwhile,
I
wanted
to
call
out
that
hey,
vidad
and
I
are
old
friends.
A
We
used
to
work
together
at
databricks,
and
so
when
we
found
an
opportunity
for
him
and
I
to
go
ahead
and
speak
together,
it
seemed
to
make
all
the
sense
in
the
world
a
lot
of
the
stuff
that
he's
been
working
on
is
very
much
related
within
the
realm
of
the
sagemaker
world,
which
is
awesome
and
sweet,
and
myself
I'm
very
well.
You
probably
know
me
already
a
little
on
the
bias
on
the
Delta
Lake
Side.
A
So
that's
why
we
figured
we
have
this
session
together
and
talk
about
stage
meeting
studio
for
machine
learning
on
Delta
lake.
So,
but
not
you
want
to
take
it
away,
or
would
you
like
me
to
go
ahead
and
continue
on
didn't
know?
We
didn't
prep
very
well
wow.
A
B
Mean
I
can
just
introduce
myself
real,
quick
and
then
yeah
I
can
and
then
you
can
start
talking
about
Delta
Lake
and
you
know,
do
a
brief
overview
of
the
agenda.
B
Great
hello,
everyone,
my
name
is
vidhan
Jain
and
I'm,
currently
working
at
Amazon
web
services.
B
I
am
a
senior
AI
ml,
specialist
and
I
work
directly
with
all
of
the
AI
ml
products
within
within
AWS
and
I
work
on
strategic
Partnerships.
So
there
are
multiple,
very
specific
companies
that
I
work
with,
and
my
prior
in
my
prior
lifetime,
I
I
used
to
be
at
databricks.
I
was
a
Solutions
architect
at
databricks
and
that's
how
I
got
to
know
about
Delta
Lake
and
got
to
work
on
some
very
interesting
projects
regarding
Delta
lake.
B
So
you
know
when
we
got
this
opportunity,
it
was
a
great
time
to
bring
sagemaker
studio
and
Delta
Lake
together
because
of
all
the
great
Innovation
that's
going
on
in
the
open
source
community.
So
that's
exactly
what
we're
going
to
talk
about
today,
talk
about
today!
So
back
to
you
today,.
A
Awesome
thanks
very
much
all
right
next
slide,
please
so
there
we
go
so
our
agenda
today.
A
Basically,
is
we're
going
to
talk
about
the
Delta
Lake
fundamentals:
we're
going
to
start
with
that,
and
so
just
in
case
you
all
are
not
as
familiar
with
Delta
Lake
I'm
sure
most
of
you
are
so
we're
going
to
Breeze
through
it
quickly,
but
at
least
if
you
are
not
as
familiar
the
slides
will
be
available
and
for
you
to
go
ahead
and
actually
access
them
and
refer
to
then
we're
going
to
talk
about
the
benefits
of
machine
learning
on
Delta
Lake,
because
that's
always
a
good
thing
and
from
there
we'll
switch
over
to
the
DOT
to
talk
about
an
introduction
to
sagemaker
Studio
sagemaker,
build
ml
models,
a
demo
and
call
to
action.
A
Now,
if
you
have
any
questions,
if
you're
on
our
Zoom
more
than
welcome
to
go
ahead
and
chime
in
on
the
Q
a
panel
to
ask
your
questions
now,
please
keep
your
questions
related
to
sagemaker
studio
and
Delta
Lake,
please,
okay,
we
do
want
them
to
be
around
these
two
concepts:
okay
and
if
you're
on
LinkedIn,
again
you're
more
than
welcome
to
chat
on
the
comments
section
and
also
YouTube
same
idea:
go
ahead
and
chat
there
so
I'll
be
monitoring.
A
All
three
I'm
gonna
be
a
little
slow
in
the
beginning,
just
because
I'm
going
to
be
talking
first,
but
yeah,
let's
go
ahead
and
switch
the
next
slide
and
talk
about
the
Delta,
Lake
fundamentals,
and
so
for
those
of
you
that
may
or
may
not
be
familiar
with
Delta
Lake
I,
don't
like
Delta
lake
is
an
open
storage
format
that
brings
acid
transactions
to
big
data
workloads
on
cloud
object
stores.
This
is
the
key
ingredient
to
this
concept
of
building
lake
houses.
A
Lake
houses,
there's
this
culmination
of
of
the
data
warehousing
World
well
from
terms
of
business
data
and
the
data
Lake
World.
Bringing
these
two
concepts
together.
Now
next
slide,
please.
A
Okay,
so
we're
going
to
do
Delta,
Lake
foundation
or
lake
house
yeah
just
go
ahead
and
skip
to
the
next
slide
here
all
right.
So
the
context
that
we
want
to
talk
about
is
that
when
we
we
talk
about
lake
houses
in
the
first
place,
it's
to
say
that
you
have
the
manageability
and
the
Simplicity
and
the
transactional
reliability
of
a
data
warehouse
plus
the
scalability
and
the
flexibility
of
a
data.
Lake
take
the
two
concepts:
the
bust
of
Both
Worlds
together
and
then
that's
what
you
have
with
the
lake
house.
A
The
lake
house
is
basically
taking
advantage
of
your
Cloud
optic
storage.
It
has
the
scalability
and
flexibility
of
it
yet
at
the
same
time,
having
the
transactional
protections
around
that
data,
I
could
go
on
for
hours
about
the
pain
of
a
data.
Lake
about
schema
on
reads:
I:
won't
because
then
you're
all
going
to
get
bored
real
fast,
but
that's
the
context.
The
idea
is
that
you
take
the
best
of
these
two
worlds.
So
that
way,
you
have
the
flexibility
to
make
sense
process.
A
Read,
do
AI
make
do
run
machine
learning,
algorithms
AI
against
all
of
your
data.
Next
slide,
please,
and
so
the
key
features
of
Delta
Lake
are
as
to
transactions
again
to
protect
your
data,
the
context
of
scalable
metadata,
the
ability
to
go
often
more
times
than
not
the
metadata
of
your
data
Lake
ends
up
becoming
more
slowing,
processing
or
slowing
querying
down
than
the
actual
data
itself,
just
because
it
actually
has
to
identify
what
files
it's
supposed
to
read
or
which
files
it
can
skip.
A
So
the
metadata
process
actually
becomes
extremely
painful
and
so
Delta
lake
has
the
capability
to
handle
petabyte
scale
of
tables
with
billions
of
partitions
and
files
and
because
we
have
very
scalable
metadata
time,
travel
the
ability
to
revert
to
old
files.
Oh
so,
oh
sorry,
to
old
versions
of
the
data,
so
you
can
audit
roll
back
anything
else.
It's
definitely
open
source
where
this
is
coming
from
databricks.
A
So
we've
got
a
lot
of
the
open
source
pedigree
around
Apache,
spark,
mlflow
and,
of
course,
Delta
like
so
Open
Standards,
Community
Driven,
always
a
fan
of
that
United
unifying
the
concepts
of
batching
and
streaming.
This
is
super
important
because,
as
you
look
at
more
and
more
data
processing
streaming
is
not
just
for
real-time
applications,
so
that's
super
important
by
the
way
I'm
not
trying
to
disregard
that.
A
The
idea
is
that
you
should
be
able
to
apply
your
business
logic
I.E
the
what
logic
you
apply
to
your
data,
irrelevant
of
what
that
latency
is
whether
it's
supposed
to
be
super
fast
or
you
haven't
latency
of
four
hours
or
one
day,
and
so
the
business
logic
shouldn't
change.
Delta
lake
has
the
ability
to
handle
extremely
fast
queries
and
processing
of
both
your
streaming
and
your
batch
data.
A
So
that
way
you
can
go
ahead
and
separate
the
business
logic
from
the
latency
of
your
processing
and
with
advancements
in
structure
streaming
or
for
them,
or
the
Flint
connector,
or
you
want
to
go
ahead
and
go
old
school.
Well,
maybe
new
school
rust
and
actually
use
a
rust
apis
to
go
ahead
and
write
directly
to
Delta
we're
good
yeah
absolutely
so
this
is
the
context,
schema
Evolution
enforcement.
Why?
Because
the
schemas
might
change
so
you're
going
to
allow
them
to
evolve
yet
at
the
same
time
you
want
to
enforce
them.
A
A
So,
if
I
leave
you
nothing
in
terms
of
the
key
differentiators
of
this
community,
it's
about
performance,
community
and
reliability
in
terms
of
the
performance,
some
of
the
concepts
I'd
like
talking
about
now.
Of
course,
this
number
1.7
exabytes.
This
is
from
databricks,
because
that's
the
only
place
we
can
actually
get
the
numbers
from
okay,
but
in
the
context
of
database,
there's
1.7
plus
exabytes
processed
a
day
not
stored
processed,
so
this
is
in
store.
A
Storage
is
actually
much
higher,
but
we
like
using
the
process
numbers,
because
this
is
showcasing
just
the
fact
that
we're
talking
a
ton
of
data.
This
is
based
off
of
7000
plus
companies
in
production.
So
that's
actually
how
databricks
is
able
to
go
ahead
and
provide
a
reliable
service,
because
it
runs
on
Delta,
Lake
and
then
over
the
last
three
years.
A
This
is
last
year's
number,
so
I
don't
I'm,
actually
not
sure
what
this
year's
numbers
looks
like,
but
last
year,
numbers
there's
an
increase
in
contributor
strength
over
the
last
three
years
of
663
percent
I,
that
is,
there
are
more
and
more
open
source
contributors
contributing
to
the
Delta
Lake
Project,
which
is
really
really
cool.
I'm
super
happy
about
that
next
slide.
Please,
and
so
this
is
a
also
from
last
year's
number
I'm,
not
sure
why
I
accidentally
blocked
out
the
data
and
AI
Summit
to
2022
logo.
A
So
that's
my
bad,
that's
on
me,
but
this
is
from
last
year's
and
so
basically
there's.
If
we
go
to
November
last
year,
basically
is
1.1
million
downloads
a
month.
A
The
numbers
are
great,
but
that
the
key
things
I
wanted
to
call
out
here
is
that
every
month
there
are
multiple
releases,
whether
it's
still
to
spark
whether
it's
the
Delta
rust
python
apis,
whether
it's
Flink
Delta
sharing,
whatever
else
there's
the
community's
Super
Active
and
so
we'd
love
you
to
join
us.
If
you
actually
have
some
cool
ideas,
for
example,
there
is
already
in
the
alpha
stages
a
a
go
Delta
Lake
API
as
well.
There
are
apis
for
Delta,
dask
and
Delta
array.
A
So
again,
pretty
cool
stuff,
so
come
join
us
all
right
next
slide.
Please
now
give
you
the
context
of
why
Delta
Lake
that's
great,
but
for
those
who
are
into
machine
learning,
which
is
the
bulk
of
the
folks
I'm
sure
here,
who
are
here
today
like
you're,
wondering
why?
Why
do
you
care
about
Delta
Lake?
And
why
are
the
benefits
of
machine
learning?
So
next
slide?
Please,
and
so
the
key
thing
I
sort
of
like
to
remind
folks
is
that
the
data
science
life
cycle
isn't
just
about
the
part
where
you're
serving
models.
A
It
isn't
just
the
part
about
your
even
going
and
training
the
most.
Obviously,
they're
super
important,
so
don't
get
me
wrong,
but
there's
a
lot
of
other
work
that
has
to
be
done.
You
have
to
take
that
raw
data.
You
have
to
be
able
to
scale
that
up
somehow
to
and
prep
it
and
process
it.
So
there's
all
sorts
of
different
tools
that
are
out
there,
we're
obviously
I'm
leaning,
more
towards
spark
or
or
rust,
but
you
know
there's
other
systems
like
Flink
like
trino
like
Presto.
A
This
is
all
good
like
we're
we're
not
trying
to
tell
you
which
one
to
use.
You
have
your
reasons
for
doing
this.
The
point
we're
trying
to
get
as
that,
but
it
involves
taking
a
lot
of
raw
data
and
prepping
it
and
processing
it
and
making
sense
of
it
and
filtering
it
and
so
forth
and
so
forth.
Right
and
then,
when
you
do
the
training
you're
going
to
go
ahead
and
use
things
like
you
know:
Pi
torch
or
r
or
tensorflow
or
Barker
she
boosts
or
whatever
it
is
that
you
like
using
out
there.
A
Okay,
because
your
data
scientists
have
their
particular
tools
of
choice.
Okay,
they're
going
to
tune
it
and
you
need
to
be
able
to
scale
that
process
as
well
and
so
I'm
more
than
sure.
But
now
it's
going
to
cover
about
how
sagemaker
helps
with
a
lot
of
that
stuff,
which
is
exactly
the
point.
That's
why
we
have
today's
session,
but
that's
an
important
aspect.
A
You
have
to
have
these
systems
that
can
scale
for
your
data
prep
for
your
raw
data
scale
for
your
tuning
scale,
for
your
training
and
then
you
have
to
deploy
all
those
models.
Okay,
there's
a
model
we're
going
to
push
those
models
out
that
could
be
Docker.
It
could
be
sagemaker,
it
could
be
MFL
I,
don't
care
again
right,
you're,
going
to
choose
a
your
tool
of
choice,
again
vadod's
going
to
talk
about
sagemaker,
because
that's
today's
session.
But
the
point
is
that
you
have
to
be
able
to
scale
that
too
right.
A
Guess
what
you
have
to
know
what
data
you're
working
with
you
have
to
ensure
the
reliability
of
the
schema.
You
have
to
ensure
that
their
scale
relating
to
all
of
the
stuff
right,
so
putting
all
this
together
and,
of
course,
hey
wanted
to
just
give
a
call
out
to
to
kunmi.
Hey,
don't
forget
Mojo,
yes,
there's
a
ton
of
Mojo
of
around
all
of
these
things.
Well
guess
what
Delta
like
allows
you
to
have
all
of
that
next
slide.
A
Please,
and
so
that's
the
context,
data
the
role
of
day
engineering
is
super
important.
It
allows
and
enables
your
data
size
and
your
analytics.
It
allows
you
to
develop
test,
maintain
your
pipelines
and
allows
you
to
productionize
those
data
science
models.
There's
that
yin
and
yang
of
basically
data
engineering
cannot
exist
without
data
science
and
data
science
can't
exist
without
data
engineering.
It
used
to
be
back
in
the
old
days
about
10
years
ago,
I
think
of
a
DOT
when
I
want
to
say
old
days.
A
In
this
case
yeah,
it
literally
was
a
matter
of
like
it
was
the
same
person
that
did
both
well.
As
we
start
applying
actual
software
engineering
practices
we're
recognizing.
There
is
a
fundamental
differences
between
how
you
run
do
run
the
practices
around
data
engineering
and
run
the
practices
around
data
science
they're
super
related
together,
but
they're.
A
It
is
definitely
more
than
just
one
person's
job
or
if
you're
one
person,
then
you
maybe
you
can
go
ahead
and
justify
going
and
get
a
race,
but
that's
a
that's
a
whole
other
conversation
all
right
next
slide.
Please
I
just
finish
it
up
since
yeah.
There
you
go,
and
so,
when
you
look
at
these
architectures,
basically
the
big
data
structures
and
data
Lakes.
This
is
what
opponent's
done.
A
You've
got
these
input
sources,
whether
it's
Batcher
streaming,
you're
going
to
put
this
into
a
data
Lake
and
you're,
going
to
go
ahead
and
have
data
consumers
do
the
AI
and
Reporting
well.
Delta
Lake
covers
that
concept
from
the
input
sources
to
the
data
Lake
and
prepping
you
for
the
data
consumer
such
that
you
can
store
the
data
in
structured
and
semi-structured
pull
data
from
various
input
sources.
You
have
a
single
central
location,
so
you're
not
actually
going
and
having
these
data
silos
anymore.
A
A
So
in
the
end,
because
of
this,
you
could
presumably
say:
let's
go
ahead,
build
ourselves
a
super
complex
pipeline
to
basically
to
be
able
to
handle
the
streaming
of
your
data
in
the
batch
of
data.
I'm
not
going
to
go
through
all
the
details.
I've
done
this
before.
But
the
point
is
that
this
is
picture
is
an
accurate
view
of
when
you
try
to
handle
streaming
and
batch
day
before
you
ever
get
to
Ai
and
Reporting.
A
But
if
you
flip
to
the
next
slide,
what
you'll
notice
is
that
we're
actually
we're
going
to
ask
that
question?
Can
it
be
simplified
and
that's
what
it
comes
down
to
Delta
like
allows
us
to
do
all
that,
because
it
handles
the
batching
the
streaming,
the
updates
deletes
the
reads:
the
ability,
the
rollbacks
optimize,
live
blue
optimize
file,
layout
formats,
next
slide,
please
and
so
yeah.
Let's
skip
that!
That's
fine!
Every
I
think
I've
already
beaten
that
one
so
perfect.
So
this
is
a
great
segue.
A
Now
we're
going
to
have
go
ahead
and
cover
about
sagemaker
studio,
just
in
case
you're,
not
aware
of
that.
B
Yeah
thanks
Denny,
so
you
know
quickly,
walk
through
some
slides
and
then
and
introduce
everyone
to
sagemaker
studio,
so
Denny
talked
about.
You
know
the
whole
data
science
life
cycle
and
the
importance
of
having
high
quality,
reliable
data
for
building
your
machine
learning
models
right.
So
that's
that's
a
separate
pipeline.
B
It's
a
data
pipeline
involving
data
engineers
and
data
stewards
and
then,
once
you
add
that
data
in
the
right
format
you
want
to
derive,
derive
value
out
of
that
data
right
and
that's
where
all
these
different
machine
learning
tools
come
into
place
and
sagemaker
studio
is,
is
a
state-of-the-art,
fully
integrated
development
environment
designed
specifically
for
that
purpose.
It's
just
designed
specifically
for
building
machine
learning
pipelines
and
accommodating
all
these
different
personas,
not
from
the
data
world,
but
now
from
the
machine
learning
world.
B
So
you
have
data
scientists,
machine
learning,
Engineers
ml,
Ops
experts
Etc
right.
So
all
these
different
users
need
one
environment
and
one
Unified
visual
interface.
So
that's
exactly
what
sagemaker
studio
provides.
It
provides
you
purpose-built
tools
for
every
step
of
machine
learning
development,
including
labeling
data
data,
preparation,
feature
engineering,
biased
detection,
explainability
and
then
all
the
way
down
to
hosting
these
models
in
a
very
efficient
way
and
and
doing
model
monitoring
Etc.
B
You
can
write
code
track
experiments,
visualize
the
data
debug
and
monitor
all
of
that
within
a
single
environment,
and
all
of
these
different
steps
of
your
machine
learning
workflow
are
tracked
within
that
same
environment,
as
we
will
show
you
in
the
demo.
So
we
discussed
earlier
the
benefits
of
using
Delta
Lake
for
machine
learning.
Well,
sagemaker
Studio
makes
for
an
ideal
user
interface
and
offers
an
underlying
compute
platform
for
building
machine
learning.
Applications
on
that
data,
reliability,
layer
built
on
top
of
balcony.
B
So
that's
because
when
we
combine
sagemaker
studio
with
Delta
Lake,
you
get
this
optimized
storage
alongside
data
governance
and
data
reliability
capabilities,
and
then
you
also
get
the
end-to-end
machine,
learning,
capabilities
and
model
governance
through
sagemaker,
along
with
access
to
state-of-the-art
machine
learning
models
and
solution
templates.
B
So
what
are
these
state-of-the-art
machine
learning
models?
Well,
sagemaker,
Studio,
being
this
machine
learning
platform
gives
you
access
to
all
these
latest
and
greatest
built-in
algorithms.
B
So
you
know
for
all
the
different
kinds
of
applications
that
you
may
want
to
build
different
modalities
of
data
that
you
may
have
and
being
able
to
perform
machine
learning,
training
at
scale.
Sagemaker
Studio
provides
all
these
different
kinds
of
algorithms
right,
so
you
have
supervised
machine
learning,
algorithms,
you
have
computer
vision,
algorithms,
you
know
Advanced
computer
vision,
algorithms,
like
semantic
segmentation,
and
then
you
have
text-based
algorithm.
You
know.
Large
language
models
is
a
big
deal
today.
B
You
know
these
days
and
we
will
talk
about
that
in
a
little
bit
as
well,
and
then
you
have
purpose-built
algorithms
such
as
forecasting
Etc
right,
so
we
have
unsupervised
supervised
and
also
semi
semi-supervised
algorithms.
So
all
of
those
are
packaged
within
sagemaker
studio.
You
can
also
use
the
API
to
call
these
models
which,
which
I
would
show
you
in
a
little
bit,
and
then
we
have
the
service
called
sagemaker
jumpstart.
B
So
all
these
models
are
pre-packaged
within
containers
and
there
are
some
you
know:
proprietary
models
as
well
as
publicly
available
models.
These
are
basically
targeting
generative
AI
use
cases,
so
sagemaker
Studio
makes
it
very
easy
for
end
users
to
access
all
these
different
models
within
one
location,
fine-tune
them
train
them
and
and
deploy
them
behind
endpoints
within
studio
as
well,
and
all
of
that
is
happening
obviously
in
the
AWS
Cloud.
B
So
you
can
get
access
to
all
these
different
features.
If
you
have
the
AWS
Cloud
already
and
now
in
order
to
actually
get
access,
you
need
an
environment
and
that's
where
stage
maker
studio
notebooks
come
into
play,
so
sagemaker
Studio
allows
you
to
do
data
pre-processing,
analytics
and
building
machine
learning
workflows
all
within
one
notebook.
There
are
built-in
Integrations
with
spark.
B
We
talked
about
the
importance
of
spark
in
building
these
data
pipelines,
so
there's
built-in
integration
with
spark
and
also
other
open
source
projects
such
as
Hive
and
Presto
that
are
basically
running
behind
what
we
call
Amazon
elastic
mapreduce
clusters,
and
then
you
have
data
residing
on
S3
and
if
you
have
other
data
source,
this
is
you
know.
B
Studio
has
built-in
connectors
for
those
data
sources
as
well,
and
you
can
browse
and
query
these
different
data
sources,
explore
metadata
the
schemas
and
run
analytics
jobs
as
well
as
run
end-to-end
machine
learning
workflows,
depending
on
the
kind
of
framework
that
you're
using
if
you're,
using
pytorch,
tensorflow
and
others.
There's
built-in
support
for
that
as
well.
Using
our
deep
learning
containers
and
you
don't
have
to
leave
our
notebook
environment
in
order
to
build
these
workflows.
B
B
But
the
point
is
that
these
notebooks
are
fully
managed
and
they
run
on
elastic
compute
resources,
taking
full
advantage
of
the
AWS
for
the
scalability
of
the
AWS
Cloud,
along
with
the
economies
of
scale.
So
you
can
pick
all
these
different
algorithms,
From
Within,
These
notebooks.
There
are
15
built-in
algorithms
at
this
moment.
Probably
more
now
we
keep
adding
more
and
more
new
algorithms
based
on
you
know
the
latest
and
greatest
innovations
that
are
happening
in
the
open
source.
B
Community
as
well
as
algorithms
that
will
be
built
within
Amazon,
you
know,
Amazon
has
been
in
the
business
for
machine
learning
for
the
past
more
than
20
years,
I
believe
now,
so
there's
plenty
of
knowledge
that
is
being
transferred
from
Amazon,
also
intersection
request
Studio.
So
you
can
run
these
models
at
a
small
scale.
You
can
run
these
models
at
a
large
scale
in
a
distributed
fashion.
You
know
we
provide
the
controls
to
the
end
users
and
then
you
have
some
pre-built
solution
templates.
Also.
B
So
these
are
you
know,
cloud
formation,
templates
service,
catalog
templates
that
allow
you
to
you
know,
take
end-to-end
use
cases
such
as
you
know:
fraud,
detection
for
for
the
banking
industry
or
visual
inspection,
automation
for
the
manufacturing
industry.
It
brings
in
different
AWS
components
and
with
a
with
one
click,
you
can
deploy
these
machine
learning,
Solutions
prompt
data,
all
the
way
to
inferencing
endpoints,
and
then
we
also
have
Automated
machine
learning
capabilities.
We
all
know
the
importance
of
automl
in
in
what
we
do
as
machine
learning
practitioners.
B
So
we
have
that
capability
built
in
using
autopilot,
Within,
sagemaker
and
then
sagemaker
using
our.
What
I
talked
about
earlier
are
deep
learning
containers.
We
have
optimized
these
different
popular
open
source,
Frameworks
such
as
tensorflow,
pytorch,
mxnet
and
even
hunting
face
now
to
to
be
able
to
run
these
model
training
jobs
at
scale
and
with
minimal.
You
know
with
minimal
modification
of
code
and
so
on.
B
So
today,
data
scientists
can
use
sagemaker
Studio
to
spin
up
these
notebooks
and
start
building
these
machine
learning
models,
and
you
know
machine
learning,
Engineers,
as
well
as
data
scientists
and
ml
Ops
exports,
can
all
collaborate
and
come
together
within
this
one
environment
to
build
these
intimate
workflows.
B
So
that
being
said,
the
focus
is
going
to
be
on
on
Delta,
Lake
and
Studio
integration
for
today.
So
they're.
Really
these
two
concepts
within
the
sagemaker
studio.
One
is
you:
can
you
know
you
can
run
these
data
prep
jobs,
analytical
jobs,
data
explore
exploratory
jobs
locally
within
the
studio
and
notebook
environment,
but
then
there's
also
the
option
like
if
you
are
already
looking
for
that
scale,
to
prepare
your
data
and
to
be
able
to
run
those
data
preparation,
jobs
at
scale.
B
You
can
also
connect
to
these
remote
EMR
clusters
with
a
click
of
a
button,
okay,
and
that
that
gives
you
the
ability
to
bring
in
the
data
you
know,
use
Delta
Lake
to
have
you
know
the
data
reliability
layer
to
read
the
data
from
Delta
Lake
into
EMR
and
run
your
analytical
pre-processing
workloads
at
scale
in
an
optimized
fashion.
B
Before
you
really
get
to
the
point
of
building,
you
know
having
the
stream
test,
trained,
validation,
test,
split
and
really
start
building
these
machine
learning
models
and
doing
everything
else
that
comes
after
right,
and
so
we
also
enable
you
know:
fine-grained
access,
that
is
a
credential
push
down
capabilities.
B
So
if
you
want
fine-grained
security
permissions,
where
you
have
multiple
users
in
a
single
environment,
sagemaker
Studio
gives
you
that
capability
and
then,
finally,
if
you
have
to
automate
all
these
different
machine
learning
components
from
the
data
exploratory
and
data
pre-processing
stage,
all
the
way
to
the
inferencing
and
model
monitoring
stage,
Studio
gives
you
the
ability
to
automate
all
of
that
as
well
via
the
API
and
also
we
also
have
now
a
scheduling
feature
within
sagemaker
studio.
That
allows
you
to
do
that
so
I'm,
going
to
walk
through
the
demo.
B
I
have
a
couple
of
studio
notebooks
to
show
you.
So
this
is
the
sagemaker
studio
environment
in
order
to
access
sagemaker
Studio
number
one.
You
have
to
be
an
AWS
customer,
which
means
that
you
have
to
have
an
account
on
on
Amazon
web
services,
and
you
should
be
able
to
log
in
and
have
the
required
permissions
to
be
able
to
access
sagemaker
Studio,
which
is
all
proactively
managed
through
our
IAM
roles.
B
And
so
once
you
have
an
AWS
account,
you
can
just
go
into
Amazon
sagemaker.
You
can
search
from
here.
You
can
go
into
Amazon
sagemaker,
and
this
is
the
the
grand
stage
maker
console
right,
which
which
has
all
these
different
machine
learning
capabilities
from
data
labeling,
using
what
we
call
a
round
truth
Service
to
to
model
inferencing.
We
have
many
different
kinds
of
endpoints.
We
have
batch
influencing
Service
as
well
and
and
then
there
are
other
capabilities
around
machine
learning.
You
can
also
build.
B
You
know
for
every
model
that
you
create,
there's
model
governance
capabilities,
so
you
can
build
these
model
cards
and
have
metadata
around
these
different
models
and
see
how
these
different
models
evolve.
How
the
data
sets
that
were
used
to
create
these
models.
You
know
you
can
get
access
to
those
data
sets
and
look
at
those
as
well
in
order
to
access
sagemaker
Studio,
specifically,
you
have
to
go
into
the
siege
maker
domains
section
you
can
have
you
know
multiple
domains
here.
We
also
have
the
ability
to
build
collaborative
workspaces.
B
So
you
know
that's
when
you
have
multiple
users
and
you
want
to
track
changes
on
the
notebooks
Etc.
So
once
you
have
a
domain
yeah,
you
know
it
takes
a
few
minutes
to
create
a
studio
domain.
This
is
backed
by
an
imro.
So
in
order
to
be
able
to
log
into
a
domain,
you
need
to
have
access.
B
You
know
you
need
to
have
IAM
access
to
that
particular
Studio
domain
and
you
can
have
multiple
domains
based
on
the
kind
of
use
cases
that
you're
working
on
and
also
within
within
each
domain,
you
can
have
you
can
have
multiple
users
so,
depending
on
you
know,
you
can
have
multiple
data
scientists.
You
can
you
know
again,
depending
on
the
job
profile,
you
can
have
ml
Ops
Engineers
Etc
collaborating
with
one
another
in
a
single
collab
collaborative
space.
B
So
once
your
domain
is
spun
up,
you
can
access
sagemaker
Studio
from
there
directly,
and
this
is
the
UI
for
sagemaker
Studio.
So
when
you
spin
up
sagemaker
studio
for
the
first
time,
you're
going
to
land
up
on
this
home
you're
going
to
land
on
this
home
page-
and
here
there
are
many
different
options,
so
I
talked
about
you
know
ingesting
and
preparing
data.
There
are
multiple
different
sources
where
you
can
ingest
data
from
their.
You
know.
B
Built-In
Amazon
data
services,
like
redshift
S3,
obviously,
which
allow
you
to
you
know,
bring
your
data
in
directly
with
a
single
point-and-click,
and
then
we
have
other
third-party
Integrations
as
well.
As
you
can
see
here,
and
then
there
are,
you
know:
I
talked
about
jumpstart
models.
B
You
have
access
to
all
these
different
models
here
with
a
point
and
click
and
then
there's
a
whole
Suite
of
all
these
different
kinds
of
models,
depending
on
the
kind
of
use
case
that
you're
dealing
with,
and
then
there
is
the
automl
capability
that
I
talked
about
earlier
by
the
way.
All
of
this
is
also
accessible
through
the
sagemaker
SDK,
so
yeah.
If
you
don't
want
to
necessarily
use
the
UI,
you
want
to
use
your
own,
you
know
Visual
Studio
or
any
other
IDE.
B
You
can
access
all
of
these
features
through
or
most
of
these
features
through
SDK
as
well.
And
then,
as
you
build
these
machine
learning
models,
there
is
the
ability
to
track
these
models.
The
evolution
of
these
models
through
a
single
UI
through
this
experiment
tracking
feature
you
can
schedule
notebook
jobs,
so
I'll
show
you
that
when
you
know
once
you
have
your
notebook
and
I
see
prepped
and
ready,
you
can
even
use.
B
You
know
the
notebook
scheduler
in
order
to
schedule
these
jobs
on
a
particular
schedule,
and
then
we
have
sagemaker
pipelines
which
allows
you
to
you
know
plug
different
components
of
the
of
the
pipeline.
You
know
your
machine
learning
pipeline
from
you
know:
data
pre-processing
to
model
training
to
you,
know
automl
to
model
deployment.
All
of
that
can
be
done
automated
through
sagemaker
Pipelines
and
then
finally,
we
have
our
own
model
registry,
and
then
this
is
the
deployment
section
where
you
can
see.
B
You
know
all
the
different
models
that
are
that
are
deployed
there
again,
like
I
said
before
there
are
multiple
different
kinds
of
endpoints
that
you
can
use
to
deploy
your
models
and
so
on
and
so
forth.
So
talking
about
Delta
Lake.
B
B
You
know
you,
here's
a
here's,
a
cluster
feature
within
you
know
this
very
familiar:
Jupiter
lab
kind
of
environment.
You
know
you
have
this
this
cluster
feature,
so
here,
for
example,
I
have
these
two
clusters
running
so
one
is,
let's
say
a
pre-processing
cluster,
which
is
a
you
know,
standard
optimized,
spark
cluster
that
allows
you
to
run
these.
These
spark
workloads
and
then
I
have
a
very
specific
machine.
Learning
cluster
as
well.
B
So
I
can
connect
directly
to
these
clusters
from
within
the
notebook
itself,
and
then
we
have
these
concepts
of
these
kernels.
You
know,
depending
on
what
you're
trying
to
do.
You
know
we
have
many
purpose
built
and
optimized
container,
kernels
and
and
images
for
that
purpose.
So
if
you
wanted
to
use
mxnet
latest
and
greatest
version
of
mxnet
or
Pi
torch,
tensorflow
Etc,
all
of
that
is
available
to
you
in
a
fully
managed
fashion
at
your
fingertips.
B
Without
you
having
to
install
that
manage
all
those
dependencies
yourself
Etc
right
and
then
you
have
access
to
all
these
different
instance
types.
We
also
enable
fast
there
certain
fast
launch
instances
as
well
there's
support
for
spot
instances.
You
know
for
for
better
economies
of
scale
Etc
right,
so
you
know
as
Delta
Lake
users.
B
You
might
already
be
familiar
with
a
lot
of
these
Concepts
that
I'll
show
you
in
the
first
notebook
I
have
two
notebooks
that
I'll
walk
you
through,
but
you
know
the
first
notebook
is
where
we're
going
to
take
this,
this
data
that
is
openly
available.
It's
the
Lending
Club
data
and
it's
you
know
it's.
It's
got
this
loan
risk
data
right,
so
it's
like
basically
certain
feature
columns
that
show
the
all
the
different
loans
given
out
in
these
different
states
to
different
users.
B
Along
with
their
you
know,
some
information
about
the
users
you
know,
such
as
their
FICO
score
and
then
information
about
the
loan
term
and
then
also
information,
whether
or
not
the
user
was
approved
for
the
loan
right.
So
you
can
get
this
data
in
in
Delta
Lake.
You
can,
you
know,
manipulate
the
data
run
analytics
at
scale,
get
that
data,
reliability
layer
and
then,
once
you
have
that
you
can
build
and
manage
a
machine
learning
life
cycle.
B
You
know
which,
which
determines
whether
or
not
a
user
will
be
approved
for
a
loan
right.
So
that's
the
idea,
so
here
Forest,
once
we
have
this
cluster
access
to
this
cluster,
you
can
see
this
is
the
cluster
ID.
You
can
get
this
cluster
ID
from
the
EMR
console,
The
Classy
mapreduce
console
with
an
AWS,
and
here
we
can
see
we
are
connected
to
that
cluster.
B
We
have
access
to
the
spark
UI
from
here
from
within
the
notebook
and
then
in
order
to
be
able
to
read
Delta
Lake
data
into
EMR.
We
need
to
run
some.
You
know
we
need
to
configure
the
cluster
to
grab
those
dependencies
grab
the
open
source
Library.
You
know
the
Delta
core,
Library
and
and
passive
external
configurations,
and
once
we
do
that,
it's
going
to
restart
this
work
application,
and
then
you,
you
know.
B
Essentially,
you
have
your
Delta
Lake
Library
loaded
into
your
EMR
cluster,
for
you
to
be
able
to
run
all
these
processing
steps
for
your
for
your
data,
so
of
course
we're
going
to
import
the
data
and
we're
going
to
create
create
a
raw
table,
so
the
data
that
we
have
is
in
in
parquet
format.
By
the
way
this
is
data
that
is
openly
available.
This
notebook.
B
This
is
going
to
be
on
GitHub
or
it's
already
on
GitHub,
so
you
can
just
download
or
clone
the
GitHub
repository
and
run
this
notebook
and
yeah.
So
the
data
set
is
in
a
parquet
format.
It's
I
think
it's
a
few
hundred
gigabytes
of
parquet
Finance
taken.
So
you
know
we.
We
read
this
data
set
into
using
spark
into
the
EMR
cluster
that
we
are
connected
to
and
that's
exactly
what
we
are
doing
here
and
once
we
do
that
you
know
we
can.
B
We
can
run
some
pre-processing
steps
on
the
data
right
now.
We
will.
We
want
to
kind
of
highlight
the
differences
between
open
source
parquet
and
the
open
source
Delta
Lake
format.
You
know,
Delta
lake
is
built
on
top
of
parquet.
It
does
use
parquet
as
a
default
storage
and
then
it
has
an
additional
meta
data
layer,
for
you
know
the
data,
reliability
and
schema
enforcement,
and
things
of
that
nature
that
Denny
talked
about
previously
in
this
slides.
B
B
You
know
once
we
have
read
this
parquet
data
into
this
spark
data
frame
called
Data,
we're
going
to
write
it
out
in
a
Delta,
Lake
format,
and
it's
very
easy
to
do
that
with
you
know
using
spark-
and
you
know
we're
going
to
Partition
by
the
adder
or
the
the
state
where
the
loan
was
given
out
and
then
we're
going
to
write
that
out
back
into
S3
and
then
we
run
ran
some
pre-processing
steps
and
if
you're
familiar
with
a
with
a
bronze
silver
gold,
you
know
stages
within
Delta
Lake
we're
going
to
write
the
processed
Delta
V
table.
B
You
know
that's
the
silver
table
in
a
separate
S3
location,
we're
going
to
create
two
separate
Delta
Lake
tables
for
those
for
those
data
sets,
and
so
one
is
for
the
raw
data,
and
then
one
is
for
the
for
the
cleansed
data
right
and
this
is
we
can
see
that
here
within
Studio.
You
know
you
can
call
and
you
can
use
a
SQL.
B
You
know
here
is
SQL
extension
to
you
know
analyze
your
data
there's
also
built-in
visualization
within
studio
notebooks.
So
here,
if
I
wanted
to
create
a
bar
chart
or
a
pie,
chart
I
can
I
can
do
that
from
within
the
notebook
itself.
B
I
can
do
a
describe
table
here.
You
can
see
the
schema
of
the
table
and
then,
if
I
look
at
the
S3
location,
where
the
data
Delta
Lake
data
is
stored.
This
is
the
metadata
all
of
that
is
being
stored
in
a
Json
format.
Right,
so
you
know,
that's
that's
kind
of
the
difference
between
Park,
a
and
Delta
lake.
B
Is
that
Delta
Lake
ads
that
external
metadata
storage,
for
you
know
for
giving
end
users
data
reliability
on
top
of
on
top
of
Open
Source
per
k,
and
then
Delta
Lake
also
provides
a
full
DML
support
on
on
top
of
that
storage.
So,
if
you
wanted
to,
you
know,
Run
update
operations.
If
you
wanted
to
delete
your,
you
know
certain
rows
in
your
data
based
on
some
logic
or
you
wanted
to
merge.
B
A
B
Learning
you
know,
Delta
Lake
gives
you
this
full
DML
support
for
that
purpose
right.
So
that's
exactly
what
we're
going
to
show
here.
First,
we're
gonna
try
to
run
these
kind
of
operations
on
parquet,
so
you
know,
for
example,
if
you
run
this
delete
operation
on
top
of
parquet,
you
can
see
that
it
errors
out.
B
It
won't
work
on
parquet,
but
if
we
were
to
run
that
exact
same
delete
operation
on
Delta
Lake,
it
does
work
right,
and
so
so
that's
the
that's
the
concept
we
can
see
that
those
rows
regarding
you
know
with
the
loans
that
were
giving
up
given
out
in
the
state
of
Iowa,
for
example,
were
deleted
it
from
from
this
table
Delta
link
table,
whereas
the
same
internet
operation
failed
on
Arcade,
same
concept
here
with
update,
in
this
case
I'm,
going
to
update
the
count
of
the
number
of
loans
in
the
given
out
in
the
state
of
Washington,
and
you
know
the
same
thing:
it'll
fail
on
parquet,
but
it'll
go
through
on
Delta
Lake
and
that's
exactly
what
happened
here.
A
B
Yeah
in
full
disclosure-
this
is
originally
I
think
this
is
Denny's
notebook
that
I
just
imported
in
the
sagemaker
studio
and
and
basically
there's
you
know
you
can
take
other
notebooks
as
long
as
they're
in
ipnb
format,
you
can
just
import
them,
choose
the
right
kernel
within
studio
and
just
you
know,
make
some
slight
modification
like
the
storage,
Etc
and
just
run
through
with
these
notebooks.
B
So
it's
very
straightforward
so
in
similarly
for
merging
I
think
everyone
is
very
well
aware
of
this
concept
within
Delta
Lake,
but
if
you
wanted
to
do
In-Place
merges
within
within
these
Delta
Lake
tables,
you
can
do
that
as
well
without
having
to
create
all
these
separate
views
that
we
had
to
do
back.
You
know
10
10
years
ago
or
prior
to
the
Delta
Lake
days.
B
You
know
and
use
spark
And,
Hive
and
cons.
You
know
other
other
open
source
projects.
In
order
to
have
this,
you
know
merge
capability.
You
can
do
these
in
page
Place
merges
with
with
a
simple
SQL
statement
on
top
of
your
public
storage.
B
And
then
you
also
have
schema
Evolution
capabilities.
You
know
we
I
showed
you
the
metadata
layer,
which
was
you
know,
also
stored
on
S3
in
Json
format.
So
now
you
can
use
that
to
make
changes.
You
know
you
can
use
spark
to
make
changes
to
to
your
schema
on
on
your
Delta
Lake
storage
right.
So
that's
exactly
what
we
are
going
to
do
here.
B
We
you
know
once
we've
created
that
merge
table,
we
merge
the
schema
and
append
to
that
Delta
leak
path
and
then
eventually
we'll
end
up
making
a
gold
table
right
which
we
can
then
use
for
other
purposes,
such
as
reporting
and
parameters.
Etc,
there's
also
time
travel
capabilities.
So,
if
you
wanted
to
you
know,
let's
say
you
inject
it
and
you
got
some
bogus
data
you
injected
that
into
your
silver
table,
and
you
wanted
to
clean
that.
B
You
know
you
can
you
can
do
that
very
easily
with
with
time
travel
within
Delta
Lake
there's,
you
know,
there's
a
described
history
capability
and
then
you
can
also
see
you
know.
All
of
that
is
being
versioned
or
the
data,
as
new
versions
of
the
tables
are
created.
Those
are
versioned
within
the
Delta
Lake
metadata.
So
and
you
can
you
can
roll
back
and
forward
on
on
different
versions
right.
So
you
know
you
can
see
a
very
robust
data,
reliability
here
for
running
these
machine
learning
workloads.
B
So
once
you
have
that
you
know
you
can
give
using
stagemaker
Studio,
you
know
you
can
give
you
can
bring
in
your
data
scientists,
ml
Ops,
Engineers
Etc
to
come
and
now
start
building
machine
learning
models
on
top
of
that
Delta
data
sitting
within
your
S3
right.
So
let's,
let's
look
at
that
now,
real
quickly.
B
It's
the
same
concept
here:
I
just
switched
notebooks
and
here
I'm,
going
to
take
the
data
that
we
just
downloaded
and
created
a
Delta
lake
table
out
of
to
run
machine
learning,
yeah
and
run
machine
learning
at
scale
and
basically
take
a
very
simple
task.
I
mean
we
I
showed
you.
We
have
these
really.
B
You
know
state-of-the-art
machine
learning
models
for,
depending
on
the
kind
of
use
cases
generative
AI.
You
know
semantic
segmentation
Etc,
but
in
this
case
we're
going
to
do
a
simple
binary
classification
just
for
demo
purposes
right
so
same
concept.
Here,
I'm
going
to
connect
to
an
existing
cluster
I
showed
how
you
can
do
that
with
the
cluster
tab.
Previously,
here
I'm
going
to
collect,
connect
to
the
ml
cluster
and
then
I'm
going
to
read
the
deltaic
files
right.
So
now
you
can
have
two
things
one
you
can
there
is.
B
You
know:
there's
glue
integration
and
Athena
integration
with
Delta
lake,
so
you
can
have
a
glue
catalog
of
all
your
Delta
Lake
data
and
there
is
sagemaker
Studio
integration
with
glue
as
well.
So
you
can
read
that
data
directly
from
the
blue
catalog
right.
These
are.
These
are
permanent
tables
that
are
registered,
they're,
managed
tables
that
are
registered
with
the
glue
catalog.
You
can
read
them
directly
into
Studio
using
spark
apis
right.
B
In
this
case,
though,
you
know
I'm
not
connected
to
Google
catalog,
so
I'm
going
to
go
directly
to
S3
and
load
those
files
directly
into
a
spark
data
frame
by
simply
saying
format,
Delta
and
load
those
files
into
a
data
frame.
I'm
going
to
do
some
data
munging,
for
you
know
specifically
for
machine
learning.
We
are
trying
to
create
this
column.
B
You
know
whether
you
know
certain
feature
columns
right,
we're
trying
to
do
some
feature:
engineering
such
as
if
the
individual
defaulted
on
a
loan.
If
there
was
a
charge
off,
if
there
were
any
if
there
were
any
late
payments,
you
know
we
have
kind
of
this
transactional
data
in
our
raw
storage
and
I'm,
going
to
create
some
feature,
columns
that
are
necessary
for
machine
learning.
B
So
that's
exactly
what
I'm
doing
I'm
doing
here
and
then
I
create
a
view
right
and
so,
like
I
mentioned
here
earlier,
you
can
use
the
SQL
view
to
view
your
data
and
the
other
great
thing
about
our
studio.
Sagemaker
studio
and
using
this
master
integration
is
that
it's
very
easy
for
you
to
move
back
and
forth
between.
B
You
know
using
a
studio
instance
versus
using
a
cluster,
so
you
can
prototype
on
a
sagemaker
studio
instance
right
on
a
smaller
data
set
to
test
out
your
code
and
then
once
you
need
to
really
run
that
at
scale.
You
can
simply
just
change
this
little
extension
here
and
then
and
run
that
on
on
the
cluster
itself.
Okay,
so
it's
and
that's
exactly
what
I'm
doing
here,
if
I
just
say,
percent
local
I
can
see
you
know
this
data
frame
essentially
gets
surfaced.
B
The
spark
data
frame
gets
surfaced
into
a
pandas
data
frame
now
on
my
local
Studio
instance.
So
it's
running
on
a
single
node
and
that's
exactly
what's
going
on
here,
but
if
I
wanted
to
run
this
as
a
push
down
SQL
job
and
visualize,
the
data
you
know
on
a
larger
scale,
I
can
I
can
do
that
by
simply
pointing
to
this.
You
know
by
changing
this
little
line
here
within
the
jupyter
notebook.
B
Okay.
So
once
we
have
that,
you
know
we're
gonna
start
with
the
actual
machine
learning
process
where
we,
you
know,
create
the
Target
and
the
the
Target
and
the
feature
variables.
So
you
know
we
have
some
categorical
variables,
some
numerical
variables,
that's
exactly
what
we
Define
here.
We
will
need
this
when
we
actually
run
the
particular
model
training
job,
because
you
need
to
specify
to
sagemaker
to
build
an
algorithms
within
sagemaker
studio
which
variables
are
categorical
and
which
variables
are
numerical.
B
Now,
if,
when
you're
running
an
automated
tuning
job
or
using
autopilot
or
Auto
ml,
there
are
also
capabilities
for
automatically
understanding
which
variables
are
categorical
and
which
variables
are
are
numerical,
so
that
capability
is
already
is
also
available
to
full
users.
B
So
that's
exactly
what
we're
doing
here
we
are
going
to
first
convert.
You
know
the
categorical
columns
as
category
type
for
built-in
algorithms
and
we're
going
to
create
these
two
data
frames.
There's
a
training
data
frame-
and
you
know
which
contains
about
70
of
your
data
set
and
then
there's
the
validation
data
frame
right,
which,
once
the
model
is
trained
it
or
once
the
model
is
going
through
the
training
process.
B
It
needs
to
validate
the
output
from
the
training
process
on
on
a
data
set
to
make
sure
that
it's
improving
its
accuracy
as
it
goes
through.
Various
steps
of
training,
so
you
know
once
we
have
these
two
spark
data
frames.
Now
built-in
algorithms
won't
directly
read
data
from
Delta
lake,
so
there
is
a
another
step
involved
here.
So
these
training
and
validation
steps,
a
validation
data
sets
or
the
data
frames
that
we
just
created
need
to
be
written
out
into
a
separate
storage.
B
Now
built-in
algorithms
will
not
read
from
Delta
Lake,
but
they
will
read
from
open
source
parquet,
okay.
So
that's
exactly
what
we're
going
to
do
here,
we're
using
parquet,
because
our
data
set
is
fairly
large.
You
know
built-in
algorithms
also
support
other
formats
like
Json
and
CSV.
C
B
Others,
but
in
this
case,
because
we
have
a
large
data
set,
we
are
using
parquet
and,
like
I
mentioned
earlier,
you
know
this
is
a
you
know,
a
match
made
in
heaven
in
a
way:
Studio
spark
and
Delta
Lake,
because
you
can
take.
You
know
this.
These
spark
data
frames
and
and
write
them
out
into
parquet
back
into
S3
and
and
built-in
algorithms
will
read
that
in
those
part
that
parquet
data
from
S3
and
and
spin
up
these
distributed
training
jobs
in
in
sagemaker
studio.
B
Okay,
and
that's
exactly
what
we're
going
to
do
here.
So
we're
going
to
Define
some
parameters
for
the
training
jobs.
We
have
to
define
the
S3
bucket,
and
this
is
the
S3
bucket,
where
our
train
and
training
and
validation
data
sets
are
stored,
which
we
just
created
in
this
step,
and
then
we
have
to
Define
some
other
configurations
for
the
built-in
algorithm
as
well,
by
the
way
we're
using
extreme
boost.
Here
you
have
access
like
I
mentioned
earlier,
to
a
plethora
of
different.
B
You
know
libraries
and
algorithms
such
as
pytorch
tensorflow,
XG
boost
mxnet
and
then
proprietary,
algorithms
and
some
other
open
source
and
many
other
open
source
algorithms
as
well,
and
then,
once
you
know,
we,
we
have
to
give
this
training
job
a
name
and
you'll
see
why
that
why?
That
is
the
case
in
a
minute
and
then
we'll
Define,
some
hyper
parameters
right.
So
there
are
many
different
tuning
strategies.
B
We're
using
you
know,
automl,
so
there's
different
tuning
strategies
that
you
can
use
to
get
to
the
best
machine
learning
model
and
for
those
tuning
strategies.
We
are
essentially
hinting
to
the
to
sagemaker
to
take
certain
parameter
ranges
into
account
when
tuning
the
machine
learning
job.
Okay
and
then
we
have
resource
limits.
We
can
Define
the
number
of
training
jobs
and
the
parallel
training
jobs
that
can
run
at
the
same
time.
B
This
is
this
is
important
because
you
want
to
get
to
a
model
as
quickly
as
possible,
but
you
also
want
to
want
to
do
it
in
with
the
you
know,
in
an
economical
fashion.
So
this
this
can
be.
You
know,
as
distributed
as
you
wanted
to
be.
These
clusters
can
be
very
large
depending
on
your
data
set
and
depending
on
your
machine
learning
problem,
but
you
know
so
this
is
where
you
can
basically
hard
code,
those
parameters,
and
then
here
you
can
Define
the
strategy
right.
B
So
there
are
genetic
algorithms
that
are
there
this
in
this
case
we're
going
to
use
and
then
there's
grid
search,
random
search,
et
cetera.
In
this
case,
we
are
going
to
use
the
Bayesian
search.
So
that's
exactly
what
we're
doing
here
and
then
we
provide
these
training
job.
We
pass
these
specifications
to
a
variable
in
Python
and
then
we
create
a
hyper
parameter
tuning
job
and
and
we
go
ahead
and
launch
this
tuning
job.
B
So
as
this
tuning
job,
you
know
starts
to
kick
off,
you
know
in
this
case
it
you
know
for
this
demo.
It
takes
about
seven
to
ten
minutes
to
complete,
you
can
go
to
your
stagemaker
console
and
if
you
go
into
the
training
section
first
of
all
there
are
you
know,
list
of
algorithms
that
you
can
view
here.
You
can
also
bring
your
own
algorithm.
So
if
you
have
developed
an
algorithm
in-house,
you
have
a
container.
B
You
can
you
can
just
import
your
existing
algorithm
to
H
maker
studio
and
then
you
know.
Obviously
there
are
algorithms
that
are
already
there
that
we
are
using,
for
example,
in
this
demo,
and
here
you
can
see
you
know,
you'll
see
in
progress.
You
know
when
your
training
job
gets,
kicked
off.
B
You'll
actually
see
that
in
progress
here
and
you
can
go
into
your
sagemaker
training
console
and
you
can
go
to
each
stage
of
your
training
job
and
you
can
look
at
your
or
you
know
you
can
look
at
different
log
information
right.
You
can
look
at
the
location
of
the
training
data
set.
You
can
look
at
the
parameters,
the
hyper
parameters
and
you
can
look
at
the
different
metrics
and
then
you
can
also.
B
B
So
this
took
about
seven
to
ten
minutes
to
complete.
Obviously
you
know
you
can
speed
that
up.
If
you
throw
more
compute
to
your
job,
you
can
you
can
speed
that
up.
B
But
you
know
once
once
that
is
done.
You
know
I
register
I
go
ahead
and
I
have
the
data
set.
You
know
the
final
you
know,
I
have
the
data
set
in
S3
and
then
the
final
model
artifacts
or
the
model
metadata
is
also
stored
on
S3
once
the
training
job
is
completed,
so
I
can
do
one
of
two
things
so
I
can
either
call
the
sagemaker
or
SDK
to
register
the
final
model.
B
If
it's
to
my
satisfaction
to
the
built-in
model
registry
or
I
can
take
the
location
and
there's
a
UI
right
here
and
I
can
create
create
a
model
version.
So
once
the
training
job
is
run,
I
can
point
to
the
artifact,
which
is
a
zipped
file
and
point.
You
know
the
model
registration,
stagemaker
model
registry
to
that
S3,
location
and
and
create
that
model
in
the
registry
which
we
can
then
use
for
tracking.
As
we
you
know,
train
new
models
on
new
versions
of
the
data
set
so
essentially
with
Delta
Lake
you're.
B
Getting
this
data
version
capability
and
with
sagemaker
model
registry
you're,
getting
this
model
versioning
registry
as
well
right
and
all
of
that
each
version
of
the
model
also
comes
with
the
appropriate
training
and
validation
data
sets.
So
you
can
also
tie
back
to
which
data
set
you
use
to
build
these
particular
models.
You
can
also
create
endpoints
from
this
UI
itself.
Of
course
you
can
do
all
of
that
with
sagemaker
or
SDK,
but
you
can
also
create
these
different
kinds
of
endpoints.
B
There
are
asynchronous
synchronous,
real-time
endpoints,
you
know
different
kinds
of
modalities,
depending
on
your
use
case.
That
sagemaker
provides,
of
course,
they're
all
scalable.
You
know
they
Scale
based
on
traffic
Etc,
and
so
all
of
that
can
be
done
through
the
UI
as
well.
And
here,
if
I
go
into
this,
is
this
is
exactly
the
job
that
I
just
ran.
This
is
my
my
training
job.
If
I
go
into
this
I
can
see
the
different
versions
of
the
model
right
and
then
I
can
reject.
B
Some
I
can
approve
some,
and
then
we
can
have
some
have
some
actions
taken
once
these
these
models
have
been
approved
or
rejected
or
appended
Etc.
A
A
Cool
so,
for
starters,
we're
not
going
to
answer
all
questions
because
we're
we
ran
a
little
long
and
that's
on
us
I'll
primarily
be
me
so
with
that
join
us
at
go.delta,
dot,
IO
slack
without
myself.
Are
there
to
answer
questions
number
one
number
two
I
just
posted
to
everybody,
the
GitHub
repo
which
these
notebooks
are
going
to
be
posted
to
so
we're
gonna,
resolve
and
merge
that
PR
later
today,
I
think
the
question
I
want
to
leave
it
with
is
Harry
asked.
A
A
great
question
is
what
are
the
differences
between
Apache
spark,
Delta,
Lake,
databricks
and
EMR,
and-
and
this
is
what
I
also
want
to
call
out
a
sagemaker
studio.
So
for
starters,
Apache
spark
is
a
is
a
big
data
processing
engine
Delta
lake
is
a
is
a
storage
format,
okay
and
then
databricks,
EMR
and
sagemaker
are
Services.
Databricks
I
can
talk
about
that
and
I'll.
Let
talk
with
sage
without
talking
about
EMR
and
sagemaker.
Databricks
is
basically
your
lake
house
platform
service.
A
We
both
work
well
together
and
slightly
compete
against
EMR
sagemaker
Studio,
but
this
is
great,
we're
all
friends
too.
So
that's
why
we're
on
the
session
together
and
but
not
I'll
leave
you
with
the
last
words
on
EMR
and
sagemaker.
B
Yeah
so
yeah
EMR
is
is
a
big
data
platform.
That
is
it's
a
service.
That
is
it's
a
first
party
service
within
within
AWS.
So
you
know
it
now
supports
their
colleague
as
well,
and
it
offers
not
only
spark
but
they're
a
plethora
of
different
Big
Data
open
source
projects
that
it
offers
so
I
think
there
is
high,
there's
Presto
and
multiple
others
that
are
built
in
which
are
really
for
building
these
big
data
workloads
and
then
sagemaker
studio
is
really
like.
B
You
know
we
talked
about
in
slides,
it's
like
it's
that
Landing.
You
know
that
course,
party
platform,
first
party
service
within
AWS,
for
building
your
machine
learning
workloads
end
to
end
right.
So
there
are
places
where
you
know
you
like
I
show
like
EMR
and
sagemaker
can
play
really
well
together.
There
are
places
where
sagemaker
databricks
can
play
really
well
together
as
well,
which
we
won't
cover
in
the
session.
We
can
have
a
separate
session
for
that,
but
you
know
yeah.
B
There
are
overlaps
between
the
different
services,
but
there
are
also
places
where
we
can
be
complement
a
complimentary
to
each
other.
A
Perfect,
okay,
we're
actually
at
the
top
of
the
R,
so
I'm
gonna.
Unfortunately,
we're
gonna
have
to
wrap
it
up.
So
I
wanted
to
say
thank
you
very
much
for
everybody
for
attending
today's
session,
the
recordings
on
LinkedIn
and
also
on
YouTube
and
as
well.
A
If
you
have
any
questions
again,
but
not
myself
are
both
on
the
Delta
users,
slack
go.io
slack
and
oh
yes,
that's
right
and
Spotify.
The
session
will
also
be
on
Spotify
very
soon
so
again,
but
not
thank
you
very
much
for
attending
today
for
speaking
at
today's
session
and
to
everybody
else
again.
Thank
you
very
much
for
attending.