►
From YouTube: New Feature! Redshift Table/View Lineage
Description
Tamás Németh (Acryl Data) talks about how DataHub can now automatically extract Table, View, and S3 lineage using Redshift system tables.
This functionality is available as of v0.8.18
A
Over
the
redshift
lineager,
I'm
not
sure
how
much
you
know
about
redshift.
I
I
try
to
guide
you
through
the
pain
points
of
building
out
the
redshift
lineage.
So
when
we
are
talking
about
ratchet
finish,
there
are
we
are
talking
about
table
in
each
and
unfortunately
like
one
one
of
the
lineage.
What
we
want
to
get
is
the
triple
lineage,
unfortunately,
the
rest
shift.
We
are
not
that
lucky,
then,
with
snowflake,
where
we
have
everything
in
one
place.
A
So
here,
basically,
you
can
the
only
thing
what
you
can
do
is
basically
doing
some
nice
tricks
with
redshift
system
tables
to
get
this
information
and
one
way
one
approach.
What's
where
I
went
to
it
and
actually
it's
implemented
now
in
we
are
supporting
multiple
lineage
collectors
based
on
the
need
and
usc.
So
there
are
pros
and
cons
for
all
of
these
one
is
basically
the
str
scan
based
one
which
basically,
there
is
a
stable.
So
there
is
two
tables
what
we're
using
for
that.
A
A
This
is
the
sds
scan
which
basically,
if
shows,
if
in
a
table
there
was
a
paper
scan
happen,
which
of
course,
if
you
are
running
the
select
query
and
insert
into
a
table
then
which
table
you
insert
that
will
be
in
the
sd
table
and
where
you
scan,
when
the
table
is
what
you
carry
is
those
should
show
up
in
the
stl
scan
table,
and
actually
this
is
the
approach.
A
What
zinger
is
using
and
as
well
as
as
I
think,
type
form,
and
they
are
already
using
with
data
hub
getting
the
lineage,
so
it
has,
it
works
pretty.
Well,
it's
fast,
it's
reliable
because
it's
using
redshift
system
tables,
but
the
problem
with
with
that
one.
But
there
are
a
few
problems
with
that.
So
sdscan
only
works
with
redshift
tables
and
if
you
have
like
retrospec
spectrum
tables
which
are
external
tables,
those
won't
show
up.
A
So
we
I
try
to
come
up
with
some
kind
of
solution
to
being
able
to
capture
those
lineage
as
well,
and
I
think
another
another
group
you
should
know
if
you
are
doing
a
querying
of
you
or
there
is
a
query
which
touches
a
view,
then
view,
of
course
won't
show
up
in
the
str
scan,
but
but
all
the
dependent
tables,
but
what's
under
the
view,
will
show
up
as
a
dependency.
A
So
there
is
another
approach
actually
which
is
which
you
can
set
up
in
the
retrieve
lineage.
It's
the
sql
part
sql
parsing,
we're
using
a
library
called
a
sequel
lineage
and
it's
a
title
library.
Actually
it
works
pretty
well,
but
of
course,
so
these
you,
if
you
would
sing
that
and
actually
this
approach
works
with
external
tables
as
well
like
registry
spectrum,
and
you
would
imagine
that.
A
Okay,
that's
why
we
are
not
going
into
that
way
and
using
that
that
one,
there
are
a
few
problems
with
that,
one
as
well,
so
one
it's
slow,
because
it's
doing
actual
silver
parsing
and
of
course
it's
not
that
precise,
then
using
it
on
a
table.
So
one
known
issue:
what
I
saw
if
any
of
the
table
table
name
is
a
word
like
you
have
a
table
called
date.
Then
it
basically
won't
show
that
table,
but
only
the
alias,
if
you
have
an
ios
on
top
of
that.
A
So
basically
it
just
skips
it
because
it's
the
the
parser
things
it's
a.
I
guess
it's
a
in
it's
a
reserved
world.
So
these
are
the
two
approach
which
what
you
can
set
up
actually
by
default.
We
are
defaulting
to
the
sds
canvas,
but
you
can
set
up
the
sql
parsing
base
and
there
is
a
third
one
which
is
a
mixed
one,
which
actually
first
run
the
the
sequel,
parses
parts
based
which
actually
can
capture
all
of
the
tables.
A
And
if
there
is
any
issue
with
the
table
name
like
the
parser
fair
to
get
it,
then
we
also
run
this
to
ask
str
scan
based
to
fix
those
issues
so
that
as
well
can
be
set
up.
So
these
are
just
for
the
table
in
each
and
there
are
another
things
it's
views.
Actually,
there
is
another
system
table
where
you
can
get
all
the
views
and
all
the
dependent
objects.
That's
easy,
so
it
works
pretty
well,
but
there
is
only
one
thing:
it's
delayed
binding
views,
but
you
can
create.
A
So
basically,
you
can
create
a
view
where
the
the
schema
will
be
only
checked
when
you
run
actual
query.
So,
basically,
then
those
columns,
those
tables
won't
show
up
in
this
system
table
the
only
thing
what
we
can
do
is
detect
lead,
binding
views
and
basically
running
sql
parsing
to
getting
the
tables
from
the
view,
creation
query,
and
there
is
the
last
one.
A
What
we
wanted
to
support
is
basically,
when
you
are
loading
your
data
into
redshift,
it's
called
there's
a
copy
command
where
there
is
another
system
table
to
get
this
information
as
well
and
and
getting
the
files.
What
what
is
loaded
into
a
table-
and
this
is
how
now
the
configuration
looks
like
actually,
as
you
can
see
here
there
is
there-
is
this
table
lineage
mode
where
you
can
set
up?
A
If
you,
if
you
want
to
use
the
sql
base,
the
mixed
one
or
like
or
for
the
default
one,
which
is
the
sds
cam
based
and
I
think
the
rest,
if
you
actually,
if
you
are
using
the
secure,
based
parser,
there
is
another
issue
with
that.
So
if
you're
using
your
sql-based
parsing
and
you
fail
to
define
the
schema
like
for
the
tables,
then
we
had
to
know
somehow.
What
is
the
default
schema?
A
B
This
is
this
is
very
near
and
dear
to
my
heart,
because
I
I
used
to
work
extensively
in
redshift
and
tried
to
figure
out
the
lineage
of
thousands
of
tables
and
views,
and
I
mean
this
is
like
well.
First
of
all,
I'm
very
excited
about
it.
Second
of
all,
I
just
I'm
yeah,
I'm
also
very
excited
about
it
anyway,
for
the
for
the
stl
table,
scan
and
also
the
query
parsing,
do
we
have
an
idea
of
how
much
history
by
default
is
available,
because
I
know
that's
flushed
right,
like.
A
Yeah
yeah,
that's
right,
so
the
retention
is
by
the
redshift
documentation.
As
far
as
I
can
remember,
I
think
two
to
five
days
based
on
how
many
queries
are
running
on
on
directory
cluster.
So
if
it's
a
busy
cluster,
then
it
will
be
shorter,
like
two
days
but
but
more
than
a
day
for
sure
so
like
two
to
two
days.
B
Okay,
because
I
think
that's
something
that's
different
as
far
as
I
know,
that's
different
from
bigquery
and
snowflake,
where
I
think
all
like.
Not
only
are
they
doing
a
better
job
of
just
storing
table
lineage
as
a
system
table,
but
I
think
they're
retaining
a
longer
history,
if
not
all
right.
So
I
think
that's
something
that
we
should
definitely
call
out
in
the
docs
if
we
haven't
already
just
to
like
kind
of
help,
people
understand
best
practices
and
the
limitations
of
it,
which
hopefully
they
already
know
since
they're
working
in
it.
B
A
Yeah,
actually,
the
restrict
recommendation
is,
is
basically
just
taking
backup
of
that
on
your
own.
If
you
want
longer
retention
yeah
for
the
for
the
query
history
and
I
guess
for
the
other
tables,
you
should
do
that.
Maybe
we
can
support
if
this
is
how
our
customers
are
using,
if
they
are
backing
up
to
being
able
to
specify
your
own
backup
table
or
something
like
that.
B
Yeah
yeah
or
dumping
it
to.
I
know
that
the
last
two
companies
I
was
at
we
were
dumping
it
into
an
s3
bucket
and
just
storing
it
indefinitely.
So
maybe
we
need
to
think
through.
How
do
we
say
you
know?
How
do
you
point
it
to
either
s3
or
redshift.
A
B
A
A
So
I
have
this
nice,
I'm
not
sure
how
much
you
can
see
that
basically
for
the
demo,
the
data-
I
will
create
a
schema.
I
will
create
a
couple
of
tables.
Actually
this
is
the
achieve
demo
tables
and
copying
so
uploading
from
s3
data
to
these
tables
then
creating
a
table.
This
is
the
sales
by
city
and
running
this
example
query
and
also
creating
a
view
on
the
top
of
that
data
and
let's
see
how
it
will
go.
A
Hopefully,
to
run
all
of
this,
so
it
seems
like
it's
worked
just
to
show
you,
I'm
not
cheating,
I'm
showing
you
my
data
hub
and
it's
currently
empty.
As
you
can
see.
Oh
it's
a
zoom.
Why
isn't
it?
Okay?
So
no
chipping,
as
you
can
see,
let's
check
this
is
my
so
this
is
my
ingestion
job
which
basically
will
do
jerry
the
the
lineage
I'm
going
now
with
the
the
default
lineage
generation
and
if
you
use
the
the
demo
3,
because
that
was
the
schema
where
I
loaded
the
data.
A
A
A
B
This
is
awesome
extracting
out
those
s3
buckets
is
so
great
like
what
a
nice,
what
a
nice
way
to
like
it's,
like
a
nice
cherry
on
top
of
of
just
basic
red
shift,
lineage.
A
Sure
so,
basically,
that's
the
lineage
yeah
one
one
thing
about
what
I
put
there
as
well:
what
if
this
does
with
srishanka?
Actually
so
if
the
very
poor
sales
rule
is
an
exception
that
we
had
the
property
into
that
table
that
hey
these
are
the
queries
which
failed.
This
is
by
default,
the
disabled.
I
think
we
can.
I
don't
know
if
the
default
should
be
enabled
or
disabled,
but
if
you
enable
that,
then
for
the
query
properties,
you
can
find
those
queries
which
actually
failed
during
the
query.
A
A
I
I
tried
out
all
of
these
three,
so
the
mixed,
this
sts
canvas
and
the
sigma
based.
I
would
assume
that
most
of
our
customers
would
use
the
fts
canvas,
but
if
they
are
using-
and
I
think
that
should
be
the
most
reliable
as
well,
but
if
it
turns
out
they
are
having,
like-
I
don't
know,
external
tables,
we
still
can
say.
Okay,
if
you
want
to
support
external
tables,
then
you
should
use
the
sql
parser.