►
Description
Session number 5: Chris Shar, Radovan Bacovic and Dennis Van Rooijen - pillars of successful DBT sharding
B
A
What
is
the
agenda
for
today?
I
I've
been
have
time,
Chris,
sorry,
for
this
to
check
this
airport
running
locally,
why
pod
is
running
and
that
is
not
spin
up
properly,
but
we'll
check
it
probably
tomorrow
on
a
Friday,
it
will
be
much
much
whether
with
my
free
time
and
also
then
it
secured
us
to
try
to
cover
the
let's
say
his
Explorer.
He
want
to
share
experience
about
late
time
in
the
measure.
Principles.
A
A
A
Opportunity
date
yeah,
if
I'm
not
wrong,
I
check
below
the
data
every
six
hours
in
the
extraction
part,
and
also
we
want
to
get
closer
to
that
part
when
it
comes
to
transformation,
but
also
to
open
some
doors
to
do
it
more
frequently
and,
as
I
said,
Chris
did
a
great
job
to
analyze
the
models
and
approximate
time
of
execution
for
this
set
of
models.
We
want
to
to
focus
and
actually
do
extraction
from
the
main
deck
and
put
it
into
the
separate
deck.
For
that
reason,
also,
we
discover
a
couple
of
good
options.
A
I
would
I
already
mentioned
earlier
today
to
Dennis
about
the
selectors
option,
so
instead
we
have
a
long
sausage
with
commands.
You
want
to
use,
you
can
say:
DBT,
wrap,
selector
and
selector
name
right,
and
in
that
case
it
can
be
a
good
General
approach
for
the
entire
transformation
part
or
all
DBT
jobs,
but.
A
To
do
whatever
you
want
right,
yeah,
yeah
and
actually
Dennis.
We
need
your
help
here
about
what
we
talk
about
this
late
driving
the
measure
principle.
How
we
should
implement
this,
because
the
main
ideas
try
to
convert
from
Full
to
incremental
load
each
model,
where
it's
suitable
and
makes
sense,
do
not
over
complex
everything,
but
it's
a
sharing
at
the
top.
In
my
point
of
view,
we
need
this
principles
with
later
I
will
imagine
so
we
want
to
hear
from.
C
A
C
The
the
idea
behind
arriving
Dimensions
is
that
that's
most
of
the
time
has
higher
frequency
data.
I
have
to
take
a
look
at
the
same
opportunity
and
I.
Don't
know
what
more
Dimensions
we
have
there
sorry
and
one
more
thing.
It
only
applies
to
text
and
dimensions,
not
to
Mark
labels,
not
to
report
tables
not
to
prep
tables,
so
really
effects.
C
Okay,
the
idea
behind
this
is
is
that
texts
are
updated
more
frequent
than
Dimensions.
Let's
say
you
have
an
Tech
table
opportunity
out
of
salesforce.com
and
he
has
also
a
dimension
table
that
I
don't
know
which
devices
are
applying
to
this
effect
table
also
on
dim
table,
which
is
called
client,
for
example,
the
client
itself.
All
the
attributes
will
not
change
frequently
yeah.
The
name
of
the
client
will
not
change
frequently.
The
the
contact
version
of
the
client
will
not
change
frequently.
They
do
maybe,
but
that's
why
it's
also
called
slow
changing
dimensions.
C
Text
data,
most
of
the
time,
especially
if
you
have
an
event-driven
fact
table
it's
updated
more
frequently.
So
there
are
a
lot
of
opportunities,
hopefully
put
into
Salesforce,
so
you
see
a
lot
of
new
opportunities.
So
from
that
perspective,
it
makes
sense
to
update
also
the
effect
table
more
frequent
rather
than
updating
the
dimension
more
frequently.
If
you
want
to
see
an
eye
level
number
high
level
kpi
the
number
of
opportunities.
As
of
now
likely,
you
don't
need
to
have
all
the
attributes
who
are
in
the
dimension.
C
Maybe
you
want
to
divide
or
aggregate
or
group
by
on
some
of
those
those
those
Dimensions
on
those
attributes,
but
because
they
will
not
change
that
frequently.
It
makes
perfect
sense
to
only
focus
on
the
fact
table.
The
problem
is
as
soon
as
you
put
in
a
new
effect
line,
so
in
this
case
a
new
opportunity
into
reflectable
opportunity,
and
you
don't
update
First
Dimension
table.
You
cannot
create
a
join
between
effect
and
the
dimension
table,
and
then
you
make
a
missed
data
if
you
do
an
energy
one
between
effect
table
and
dimension
table.
C
So
the
principle
behind
leading
the
right
dimensions
is
that
you
already
put
in
placeholder
in
the
dimension
table
with,
if
you
have
sugar,
cookies
or
shortcut
key
the
natural
key,
but
none
of
the
attributes
and
let's
say
once
per
night,
you're
going
to
update
all
the
attributes
following
the
normal
frequency.
What
we
have
already
right
now.
So
that
means
that
you
don't
have
an
eye
loading
process
where
you
need
to
capture
changing
Dimensions.
C
A
C
C
Yeah
I
have
but
not
in
DVT,
so
I,
for
example,
Talent.
If
Market
Power,
Center
IBM
data
stage
for
doing
this,
but
not
via
DVT
yeah,.
A
B
Yeah,
the
the
all
the
models
that
we've
looked
at
for
the
Salesforce
opportunity.
They
are
all
full
refreshes
and,
given
that
they're
short,
they
run
pretty
quickly
because
there's
only
like
250
000
rows
or
something
so
yeah.
A
Because
we
just
want
to
create
some
kind
of
exercise
if
we
wanted
to
play
and
pretend
we
want
to
switch
from.
Let's
say
four
refresh
incremental,
but
also
you
spoke
with
Pete
rimpy
right
Chris
about
one
model
we
found
is
kind
of
more
most
complex
in
that
case.
If
we
try
to
reorganize
it,
it
will
be
very,
very
yeah.
B
Right
yeah,
we
looked
at
putting
making
some
of
these
incremental
and
it
would
I
mean
incremental
models
are
really
useful
for
large
models,
but
it
does
increase
the
complexity
of
the
the
code
that
goes
into
it.
So
and
a
lot
of
these
have
got
many
different
inputs,
so
it
was
we
sort
of
felt
that
it
was
more
more
trouble
than
it
was
going
to
be
worth.
It
would
have
increased
and
complexity
and
don't
think
you
would
have
seen
much
Improvement
in
runtimes.
A
C
That's
true,
on
the
other
end,
if
we
want
to
roll
this
out.
Furthermore,
can
also
be
showcase
to
show
that
this
is
a
good
work
to
watch
the
future,
so
when
when,
if
we
don't
do
it
right
now,
when
do
we
want
to
do
it,
then
that's
basically,
the.
A
Other
question
yeah,
that's
true,
that's
true,
but
because
we
start
optimistic
like
okay,
let's,
let's
pick
up
one
model
and
try
to
reorganize
it
to
be
incremental
instead
of
or
log,
but
as
Chris
said,
it's
super
complex.
So
when
you
try
to
touch
this,
actually
you
need
to
to
write
it
from
the
scratch.
Probably
or
most
of
the
things
will
not
be
used
or
does
discarded.
So.
A
This
light
driving
data
against
complex
model
under
the
hood,
because
for
now
Chris
label
everything
needed
for
for
the
new
deck
and
also
he
excluded
everything
from
the
existing
deck.
So
now
we
have
a
material
to
create
completely
separate
deck.
I
just
need
to
check
why
DBT
model
run
is
not
spin
up
on
the
cluster
on
testing
environment
and
also
the
main
question
is
okay:
how
to
approach
this
to
optimize
this.
That
is
what
do
you
think?
B
I,
don't
think
so
to
looking
into
some
of
these
prep
CRM
opportunity,
there's
many
different
sources,
so
we
just
make
it
quite
complex
to
update
something
to
be
incremental.
C
Another
thing
I'm
asking
you,
because
what
I
from
what
I
hear
right
now,
basically
is
we
exclude
a
certain
flow
from
some
Source
models
to
the
to
the
to
the
both
layer.
So,
let's
say
from
raw
to
prep:
we
exclude
it
from
the
regular
one
and
we
do
it
in
a
new
one
and
we
scheduled
it
four
times
today.
So
I
try
to
see.
Where
can
we
raise
the
bar
a.
B
So
this
this
is
actually
this
is
the
model
and
it
uses
a
macro.
I
I
wasn't
involved
enough
to
speak
to
Michelle
about
how
this
has
put
together,
but.
B
These
this
macro
I've
never
made
a
macro
into
incremental,
but
I
mean
there
would
be
ways
of
doing
it.
What
was
the
name
of
the
actual?
A
Also,
the
pro
one
of
the
problem
with
this
model,
then,
is
because
many
Avengers
are
hardcoded
here
not
exposed
to
the
table
but
expose
in
hard
code
manner
as
a
table
like
one
two,
three,
four,
five,
the
meaning
of
four
all
these
facts.
So
that's
the
catch
here.
One
of
the
issue,
I,
would
say-
and
this
is
built
based
on
marker
right,
Chris
and
I-
think
it's
using
two
times.
C2
live
and
one
another
parameter
right.
Yeah.
B
That's
that's
right
if
it's.
A
A
Yeah
in
this
case,
theoretically,
probable.
No,
but
what
then
is
pointed
out
is
perfect
sense
for
me,
because
sometimes
you
can.
You
can
tell
late
arriving
data
from
some
other
size
site
and
you
miss
that
information
and
if
you
apply
the
principles
with
late
driving
data
and
fill
the
facts
completely,
because
you
have
the
facts.
But
let's
say
you:
you
miss
new
user
new
player,
new
customer.
Whatever,
then
you
need
to
put
this
place
called
hold
the
record
in
dimension.
A
B
B
B
B
B
Yeah
I'm
not
sure
how
I
think
you
know
if
it
was.
If
we
were
looking
at
a
larger
longer
running
model,
then
definitely,
and
we
do
generally
make
those
into
incremental,
but
these
small
sort
of
very
quick
running
ones.
It's
it's
often
better
for
the
Sim
Simplicity
of
a
full
refresh.
A
A
B
That's
so
we've
got
a
selector
called
six
hourly
Salesforce
opportunity
and
then
the
dag
itself.
A
Yeah
this
is
this:
is
we
Implement,
then
it's
for
information
to
make
everything
flexible.
Let's
say
we
want
to
run
it.
There
was
discussion,
okay,
how
to
touch
something
to
run
hourly,
but
actually
our
needs
to
have
six
salary
in
a
first
iteration
and
later
on,
to
be
able
to
decrease
the
time
up
to
five
minutes.
A
Half
an
hour
one
hour
whatever
and
yeah
Chris
found
this
very
nice
and
elegant
solution
with
these
selectors,
so
you
can
easily
just
find
and
replace
tags
like
six
hours,
so
we
combine
Salesforce
six
hours
tomorrow
we
can
combine
Salesforce
with
five
minutes
run.
If
you
know
what
I
mean,
so
it's
very
easy
to
keep
everything
in
control,
so
that
part
is
also
covered
and
I
think
it's
also
a
very
nice
feature
to
help,
because
we
will
run
this.
A
Airflow
tag,
but
the
main
question
for
us
is
how
to
pick
up
one
showcase
for
increment
a
lot
from
from
full
load.
That's
that's
the
question
mark
for
us
because.
A
Model
here,
even
all
of
them
are
kind
of
I'd,
say
I'd
like
to
listen
fast
and
quick
in
execution,
even
in
full
load,
but
we
want
to
expose
one
example:
okay,
we
know
how
to
do
this
yeah,
but
yeah.
As
you
said,
the
main
concern
here
is
about
over
complexity,
for
this
showcase
and
also
contains
the
several
sources
can
create.
C
A
It
will
pick
up
the
latest
data
it
has
yeah.
Also
one
thing
we
consider
is
come
up
with
some
to
have
some
sensor
to
check
is
extraction
done
or
not,
I
know
from
the
previous
projects.
I
was
working
on.
That
was
a
catch
like,
theoretically
speaking,
not
to
be
locked
it
will.
You
will
just
load
the
data
from
some
point
of
time.
Actually
what
you
find
in
the
wrong
layer
in
Snowflake,
because
it's
this
DBT.
B
B
A
C
C
Exactly
so,
in
my
opinion,
what
I
would
like
to
oh
well,
my
personal
Vision
here
is
that
and
downswing
dateable
data
model.
C
Should
run
properly
regardless,
what
happens
Upstream
so
it
needs
to
be,
let's
say:
flexible,
can
I,
call
it
flexible
or
or
adaptive
to
what
happens
because
anything
can
happen.
What
also
can
happen
is
that
the
extraction
for
the
opportunity
table
goes
well
and
there
is
an
error
on
the
user
extraction
right.
So
if
we
now
have
the
philosophy
that
we
do
a
full
refresh,
because
we
don't
know
exactly
what
happens
for
a
Hocus,
Pocus
I
think
that
that's
not
the
right,
the
right
thing
to
do
and
I
think
the
impact
is
less.
A
A
A
Actually,
when
you,
when
you
define
the
speech
integration,
you
can
say
which
way
you
want
to
do
as
Dennis
said
in
some
other
courses
called
std-1.
Std2
here
is
called
like
not
tracked
key
based
incremental
for
refresh
something
like
that,
but
usually
I'm.
Looking
now,
I
can
share
my
screen.
If
you
want
just
to
show
you,
they
just
call
it
in
a
different
name
and
Baptist
with
different
method,
see
I,
say
table
account.
It's
key
based
incremental,
so.
A
A
B
A
And
then
on
our
side
we
do
a
full
load
is
then
he
said
because
of
course,
this
Focus
blockers
don't
want
to
to
mess
with
this,
maybe
good
time
to
challenge
that
approach
and
try
to
beat
it
somehow
differently
and
as
then,
it's
explained
nicely
can
happen
that
something
is
going
late
or
something
is
screw
up
with
low
probability,
of
course,
but
we
need
to
be
prepared
for
that.
All
that's
just
I'm
thinking
about
the
best
use
case,
how
to
rise
the
bar
and
expose.
A
C
A
C
In
this
case,
where
we
do
an
explorational
thing
of
DVD
sharing,
maybe
for
this
one
we
can
do
it
super
boring
and
just
scheduling,
scheduling
it
four
times
a
day.
Every
six
hours,
but
I'd
also
see
this
as
a
showcase
to
roll
this
out
any
further.
So
from
that
respect,
I
would
say
yeah,
indeed,
don't
over
engineer
it,
but.
C
I
think
if
we
want
to
do
TBT,
sharding
or
if
you
want
to
shout
out
the
Big
Deck,
what
we
have
right
now
run
it.
The
second
use
case
run
it
also
multiple
times
per
day,
to
give
our
consumers
data
more
frequently
I
think,
then
we
have
to
come
up
with
something.
What
we
don't
have
right
now,
because
a
a
a
a
more
frequent
load
on
the
models
that
we
have
right
now.
I
think
that's
not
doable,
because
it
already
runs
for
eight
or
nine
hours
and
if.
C
Do
it
that's
not
possible
so
from
a
shorty
perspective,
I
think
that
selector
mechanism
is
fantastic
because
it
gives
a
lot
of
flexibility
to
short
out
decks,
but
to
make
this
next
step
to
to
raise
the
bar
a
little
bit.
I
also
would
like
that's
my
opinion
here.
If
we
also
can
implement
something
that
leads
to
more
efficient
loading,
a
more
robust
loading
to
increase
the
frequency,
I
think
that
would
be
great
as.
A
For
now
is
sharded
plan
to
how
to
Shard
one
deck
for
any
provider.
Second
takeaways
switch
to
selector
option
and
make
it
very
very
flexible,
and
then
you
open
a
very
various
options
to
decrease
the
load
increase.
The
load
do
whatever
you
want,
include,
exclude
and
the
first
takeaway
what
we
are
actually
missing
here.
We
want
to
agreed
on
that
to
Riser
bar
and
provide
comprehensive,
Swiss
knife
how
to
Shard
everything.
Is
that
incremental
load
from
full
and
also
later
running
Dimension
right-
and
this
is
how
I
see
this.
This
is.
C
My
vision
exactly
and
not
a
great
example
is
kitler.com
data.
Previously
we
had
just
one
database
instance
where
all
the
data
was
in
now
we
already
have
two
main
NCI.
If
you
want
to
combine
those
data
sets
basically
you're
combining
data
from
two
different
data
sources.
Technically,
so,
if
one
pipeline
fails,
we
have
this
now.
Sometimes
you
also
need
to
prepare
to
a
situation
for
a
situation
where
the
models
in
DBT
are
robust
enough
to
handle
these
kinds
of
situations
where
a
pipeline
could
fail,
and
now
we
have
two
instances.
C
Hopefully
in
the
future,
we
have
hundreds
of
thousands
of
instances
if
we
go
to
a
portion
post,
sharding
architecture,
where
we
have
thousands
of
database
instances
for
our
soft
platform
and
I
get
guarantee.
If
we
have
hundreds
of
ports
where
we
need
to
extract
the
data
from
some
of
them
will
fail
in
the
end
right.
Every
now
and.
A
C
More
you
have
the
likelihood
the
something
will
break.
We
also
increase
here.
So
the
data
landscape
will
only
be
more
complex
and
the
question
now,
of
course,
is
yeah.
To
what
extent
do
we
want
to
prepare
ourselves
for
that?
Well,
that's
basically,
a
little
bit
up
to
you.
I
think.
Indeed,
the
selected
mechanism
is
already
right.
You
can
do
good
charting
with
that
mechanism.
C
A
I
think
selectors
are
kind
of
red.
We
just
need
to
implement
this
summer,
but
we
have.
We
know
the
mechanism
like
it's
very
fairly
simple
and
we
can
simplify
the
process,
as
I
said,
first
takeaways
to
establish
a
good
process
of
how
to
do
a
sharing.
You
have
a
specific
steps
like
we
have
a
method
and.
C
Because
that
that
selector
will
make
a
decision
and
putting
that
in
a
yaml
file,
I
think
that's
also
very
beneficial
for
the
analytic
Engineers,
because
right
now,
if
there
are
new
models
created
at
least,
if
I
do
it,
for
example,
for
certain
source
to
provide
the
data
in
a
workspace
model.
For
me,
it
is
unclear
how
I
can
scale
that
one.
C
B
A
A
B
A
A
Of
them,
also
not
in
sharding
but
generally
speaking
and
the
first,
the
pain
point
for
us
at
the
moment,
how
to
find
the
optimal
way
for
robust
DBT
load
Creation
in
case
something
is
pale
or
something
rapidly
radically
grows
like
from
1
to
100
databases,
which
is
really
possible
scenario.
Next,
couple
of
years
right,
we
spoke
with
Robin
good
company.
Then,
as
you
remember,
they
have
30
databases
which
they
are
fairly
small,
but
you
have
a
lot
of
components.
It
also
lead
you
to
high
probability
of
something
will
fail.
C
Maybe
the
fourth
takeaway
number
D
yeah
find
a
optimal
way
to
provide
high
frequency
loading
or
data
processing
in
DBT
and
optimal
weight,
provide
more
frequent
data
loading
or
data
processing.
I
think
we
should
call.
A
Because
I
I
connected
with
the
first
one,
but
just
my
practical
order-
how
to
put
here
so
I,
will
also
put
this
in
issue
thanks
for
help.
Dennis
you're
welcome,
Chris
I'll
just
put
everything
in
agenda
and
also
in
issue.
So
what's
our
next
steps?
Now
it's
holiday
season,
so
probably
a
few
weeks.
This
will
be
like
a
bit
and
after
that.