►
From YouTube: 2021 09 16 Paul, Ved, Dennis and Radovan discussed on how to explore data pipeline setup
Description
Discussing the solution will explore in order to find the optimal solution for the Postgress extraction part.
A
And
yes,
it's
about
exploration
of
gitlab.com
data
pipeline
setup.
What
I
understand
from
this
issue,
it's
about
create
the
extraction
tool
or
whatever
or
any
other
way.
We
can
do
the
thing
and
also
I
spoke
with
that,
as
I
said
yesterday,
and
he
proposed
solution-
maybe
to
split
this
work
and
try
each
point
from
here
and
spend
some
time
on
that
and
also
open
the
agenda.
So
put
your
information
here
just
to
do
some
conclusion:
how
we
can
move
forward
with
this
issue.
B
Here
so
from
here
is
what
we
see
that
we
have
five
options.
We
have
five
or
six
total
for
one
of
the
option,
like
the
license
dv
extract,
we
haven't
put
a
lot
of
like
a
description
around
it,
but
licensed
tv
extract,
or
there
was
only
we
extracted,
something
which
postgres
they
pump
the
data
from
postgres
to
our
bucket
and
from
there
we
pick
it
up
and
our
snow
pipe
picks
up
and
loads
the
data
inside
of
snow.
B
So
it
is
an
automated
process,
but
the
volume
of
data
is
still
low
over
there.
If
we
need
to
do
a
hundred
percent
refresh
for
all
the
tables
for
the
scp
tables,
I'm
not
sure
will
that
be
sufficient
to
copy
every
day,
gbs
of
data
for
every
run,
what
it
does
so,
but
yeah
there's
six
options.
What
we
have-
and
I
think
for
us
to
mistake
it
from
here
that
which
one
would
be
the
best
we
can
either
in
this
okay
or
on
the
next
okay.
B
That's
the
first
thing
second
option
is
that
if
it
is
possible,
what
are
the
blockers
and
so
on,
because
for
most
of
them
the
challenge
is
that
to
get
the
postspace
sequel
logs
access
to
the
production
postgres
equal
logs,
which
I
am
not
sure
how
safe
is
the
infrastructure
team
or
the
db
team,
is
to
share
it
openly
to
have
access
to
the
third
party
tool
like
stitch
and
five
turn
when
they
don't
when
they
have
a
for
the
stitch
for
the
index,
they
had
a
huge
and
that
someone
else
is
accessing
their
data.
A
B
I'm
not
sure
how
but
yeah
we
can
start
until
we
do
start
digging
up
in
those
areas
will
not
be
able
to
say
that
as
well,
where
we
will
end
up
with
this.
So
if
we
set
up
the
in
this
meeting,
we
what
we're
trying
to
finalize
that
we
come
up
with
that.
What
we
out
of
this
issue
we
get
to
take
a
next
step.
Do
a
poc
come
up
with
the
results
that
these
are
the
this
is.
These
are
out
of
six
five
are
doable.
One
is
not
doable
out
of
five.
B
What
two
are
having
dead
blockers,
which
can't
we
can't
take
it
further.
Those
because
the
infrastructure
has
raised
a
red
flag
on
top
of
it
or
securities
team
has
raised
a
red
flag
on
top
of
it,
or
we
can't
bypass
that
then
the
three
and
then
we
come
up
with
the
time
and
effort
what
it
takes
to
do
it
so
that
my
say
on
this
dennis.
B
C
Can
you
add
something
yeah?
The
only
thing,
as
I
understand
right
now
is
to
execute
a
poc
or
all
five
options.
It
may
be
a
little
bit
overkill,
that's
quite
time
consuming.
C
So
what
I
prefer
is
to
order
the
solutions
that
we
have
now
on
the
table
and
which
one
is
the
most
preferred
solution
from
us
with
all
the
information
that
we
have
right
now
and
with
that
get
in
contact
with
the
infra
team,
get
in
contact
with
the
security
team
and
to
do
a
hypothetical
discussion
and
theoretical
discussion,
and
if
they
don't
have
anything
against
it
then
start
to
do
the
poc.
I
think
that's
a
little
bit
more
efficient.
C
B
B
B
That
is
there
a
way
they
can
expose
the
log
file.
Do
they
see
any
concern
on
exposing
the
log
files?
So
whatever
changes
happen,
we
ping
the
login
system
and
if
they
say
yes,
then
we
have
other
four
options:
open
that
are.
These
four
can
be
worked
out.
Otherwise
these
four
goes
to
stop
then
and
there
if
they
say
that
we
will
not
share
the
log
file
or
we
can't
share
it
this
morning
and
honestly
yeah.
If,
if.
C
We,
I
think,
the
log
files
that's
the
most
convenient
way
forward
for
two
reasons.
I
think
it's
the
most
efficient
one,
because
you
don't
need
to
scan
a
complete
table
in
terms
of
a
full
load
or
incremental
loads.
You
also
need
to
filter
in
some
tables,
don't
have
index
so
there's
also
time
consuming.
So
I
think
the
the
logging
files
will
be
a
huge
benefit
one
for
performance
scalability,
but
also
to
detect
any
deletions,
maybe
you're
not
going
to
take
deletions
until
we're
doing
complete,
table,
scan
and
check
that
with
our
own
database.
C
But
if
we
need
to
do
that
for
all
the
tables,
I
think
that
is
to
cost
consuming.
So
I
think
the
log
files
is
the
way
forward
and
then,
honestly,
I
think
five
trend
and
stitch
are
not
an
option
because
we're
going
to
use
a
third-party
tool
attached
to
our
internal
systems.
I
think
that
will
be
a
go
for
security
and
I
think
it
also
costs
some
money,
because
we
have
quite
some
data
to
be
transferred
and
520
states
are
the
cost
model.
Is
data
driven
so
the
more
data
you
transfer?
C
So
I
think,
and
my
preference
is
to
investigate,
to
which
extent
we
can
connect
to
the
log
files
and
if
that's
an
option
and
then
find
the
way
forward.
What
the
best
way
of
implementation
is
and
is
that
creating
something
ourselves
based
on
the
log
files
or
are
we
going
to
reuse
a
single
tab
which
is
already
available,
which
you
can
leverage
in
nottano
or
maybe
somewhere
else.
D
Yeah,
what's
it
in
the
comments?
Justin
did
also
do
some
investigation
on
which
of
everyone
saw
this,
but
that
he
does
not
recommend
multano,
basically
for
this
use
case,
so
yeah
I'd
also,
if
we
were
going
to
do
a
poc
on
five
train
or
stitch
I'd
say
we
should
pick
one,
but
at
the
same
time
I'm
also
yeah
very
reluctant
to
move
there.
There's
a
few
reasons
that
I
also
put
in
there
in
the
comments
there
that
yeah,
I
would
also
prefer
to
keep
the
process
internal
and
yeah.
D
I
completely
agree
that
the
using
the
log
files
is
probably
the
most
would
be
the
best
next
step.
I
also
think
we
can
probably
include
some
parts
of
step.
One,
like
I
think
sort
optimization
is
always
a
good
idea
and
we
should
probably
look
at
any
like
easy
wins
that
we
can
get
out
of
that,
although
I'm
not
sure
that
there
are
many
easy
wins
left
but
yeah,
it's
always
good
to
do
a
small
check
on
that.
B
Logging,
so
monitoring,
basically
the
major
thing
and
the
stability
of
the
overall
product
itself.
C
D
The
last
one,
rightful
yeah
yeah,
just
a
few
points
up
there
yeah
this
one.
B
But
I
so
okay,
so
stretch
five
run
is
like
a
very
costly
process,
even
though
we
get
access.
So
I
think
the
very
first
step
we
can
is
to
check
with
infra
that
can
they
give
access
to
logs
only
if
they
give
access
to
logs.
If
we
do
a
self
in-house
development
using
kafka
or
something
that
we
do,
we
are
doing
a
kafka
streaming
of
the
bin
logs
and
dumping
into
snowflake.
That's
a
good
option
technically,
but
it
might
take
a
time
to
develop
it,
but
development
and
all
will
come
the
secondary.
B
Only
we
get
access
to
the
log
files
first.
If
not,
then
all
of
this
is
stops,
and
then
we
have
to
look
into
a
total
different
direction
that
how
do
we
make
it?
Optimized?
More
now
we
were
given
now
that
they
are
doing
a
decomposition
of
the
main
production
database
and
they
are
splitting
the
db
itself.
Will
it
reduce
turnaround
time
for
us
as
well,
because
now
we
have
to
extract
only
10
tables
from
one
db,
which
might
be
quite
faster
and
only
the
less
load
is
on
the
main
database.
So.
D
Yeah,
it's
not
sure,
because
we're
about
to
see
you
thinking
that
that's
like
running
too
long
at
the
moment,
because
there
are
definitely
a
few
places.
I
know
that
we
basically
also,
I
think
we
still
read
the
file
into
a
pandas
cf
csv,
but
we
basically
like
getting
the
data
out
of
the
database,
we're
actually
running
post
progress,
copy
commands,
which
I
think
are
like
one
of
the
fastest
ways
that
you
can
actually
like.
Physically
extract
the
data.
D
It's
just
that
we
then
pass
it
into
a
data
frame
and
then
do
some
other
data
passing.
So
if
you
just
wanted
to
like
speed
up
the
process
in
general,
we
could
probably
look
at
some
of
the
code
that
we
have
that's
processing
the
python
stuff.
But
if
you
wanted
to
speed
up
like
the
actual
database
extraction
or
if
we
have
like
a
particular
concern
with
like
yeah,
that
one
scd
is
timing
out
or
something
like
that,
then
we'll
need
to
look
into
something
on
postgresql.
C
C
Duplicate
or
replicate
the
data
out
of
the
postgres
database
to
our
snowflake
system,
which
will
be
fantastic.
Of
course,
I
don't
know
the
wiz
except
you're,
going
to
use
it
because
bbt
runs
every
day,
but
it
will
be
cool
to
say
we
are
now
having
a
real-time
mirror
to
get
them
accommodated.
So,
let's
see.
B
Yes,
so,
okay,
so
for
us
to
take,
it
was
like
the
next
steps
because
we
have
discussed
so
I'm
just
taking.
Like
a
note,
I
got
like
two
points,
which
is
the
next
main
course
of
action,
apart
from
optimizing
the
source
and
looking
at
the
bottleneck
for
that,
that
is
to
check
with
the
infra
to
get
access
of
the
postplaced
log
files.
B
If
they
can
share
it,
and
if
yes,
then,
can
we
get
a
sample
of
a
log
file
to
see
that
we
can
process
through
the
single
tab
or
we
can
post
process
it
through
enables
like
meltano
or
some
just
a
sample?
It's
not
a
huge
file.
There's
a
small
one
to
see
that
few
mps
to
check
that
this
works.
It
is
not
so
this
file
can
be
consumed
and
it
will
work
out
if
you
want
to
take
it
forward
yeah.
So
those
are
the
two
course
of
action.
We
can
look
into
definitely
into
the.
B
I
have
not
explored
this
licensing
db
extraction.
Part
paul.
Have
you
worked
on
this?
The
licensing
db
extract.
D
Or
it
is,
it
was
before
you
not
much,
but
to
be
honest,
I'm
a
bit
skeptical
that
that'll
improve
things
like
I
say
it's.
The
problem
is
that
they'll
basically
have
to
run
the
same
copy
commands
and
then
we'll
be
picking
them
up
with
the
same
python
commands
that
we're
running
at
the
moment.
You
know
I
mean
I
don't
think
we'll
be
able
to
speed
things
up
in
the
way
we'd
want
to
yeah.
B
D
But
I'm
just
saying
that
the
the
command
that
they'll
run
actually
physically,
it's
the
same
thing
that
we're
running
in
the
normal
one,
so
yeah
I
mean
we
could
also
confirm
it
like
I'm,
not
100
certain,
if
that's
the
case,
but
like
I'm
pretty
skeptical
that
it
will
yeah.
I
think
we're
basically
just
it's
a
process,
that's
separated
in
this
case,
and
it's
all
in
our
space
in
the
gitlab.com
space,
okay,
okay,.
C
Well,
can
you
can
you
give
some
context
regarding
the
setup
in
the
issue
here
that
you
describe
the
current
licensed
db
extraction
solution,
because
I
think
then
we
have
the
complete
picture
and
then
indeed
we
can
say
to
everybody
that
is
not
an
improvement.
Because
of
this
is
this.
A
C
A
C
A
D
Right
in
that
case,
something
to
keep
in
mind
with
those
five
training
stitches
that
they're
actually
also
just
running
singer
taps
as
well,
so
anything
that
you
can
do
with
the
tap
is
the
same.
If
you
I
mean
I'm,
to
be
honest,
I'm
pretty
sure
most
of
them
are
also
just
running
python
code.
So
I'm
not
sure
how
much
blaster
there'll
be
them
up
in
the
house
python,
but
it
would
be
good
to
compare
yeah.
C
I
don't
know
if
it
is
faster,
but
it
is
a
complete
other
way
of
extracting
the
data,
and
now
we
read
the
data
ourselves
out
of
the
database
and
we
are
reading
out
of
and
because
we're
reading
out
of
a
database.
We
saw
a
huge
lag
in
the
past
because
of
log
logs
over
over
there.
So
that's
why
we
got
provisioned
a
new
copy
database
which
refresh
only
once
per
day.
C
A
Also,
just
to
understand
the
full
context,
I
assume
what
dennis
said.
We
have
that
copy
of
postgres
database
as
it
is
copy
of
production,
and
after
that
we
move
the
data
into
snowflake
right
and
then
do
a
calculation
account.
What
stop
us
to
query
directly
postgre
database?
Do
we
need
all
of
these
data
or
just
some
sums,
or
I
don't
know,
aggregate
stuff
there?
You
know
what
I
mean.
C
Yeah,
I
I
don't
think
there
is
a
limitation,
I
think,
there's
also
a
good
solution
and
that
we
got
the
data
pushed
instead
of
to
that
replica
directly
to
snowflake.
That
can
also
be
in
solution,
and
I
believe
I
believe
that
the
god,
maybe
that
you
know
better
than
I
am
then
I
know
that
replica
that's
provided
to
us
is
also
based
on
lok
valley.
I'm
not
sure
about
that.
B
So
so
the
replica
what
is
provided
to
us
is
a
clone
of
production
instance.
So
it's
a
a
production,
gitlab.com
production,
postfast
tv,
so
they
have
given
a
clone
to
us.
That's
the
replica
what
we
have,
but
I
didn't
get
the
red
one
question.
The
last
part.
A
My
assumption
or
my
question
for
us
is,
I
don't
know
the
full
context.
Does
anything
stop
us
to
query
and
sum
up
data
directly
from
porsgrass
and
just
upload
results.
What
we
are
interesting
for
in
snowflake
to
reduce
the
space,
because
you
have.
A
B
B
Was
happening
over
there
over
there?
I
saw
some
couple
of
quotes
yes
over
there,
but
the
snowflake
was
treated
as
the
cloud
warehouse
and
you
can't
be
relying
on
because
the
data
over
there
in
postless
keeps
changing.
So
you
have
a
certain
deletes
happening,
certain
which
is
not
being
captured.
So
if
you
rely
on
the
source
db,
which
is
keep
updating
everything,
so
today
you
got,
you
are
not
able
to
maintain
the
history.
B
Blocker
for
the
whole,
that's
why
they
all
move
the
whole
post
from
postgres
to
a
snowflakes
on
the
real
time
and
within
a
day,
so
that
any
change
is
happening.
We
are
able
to
track
the
historical
content
of
the
data.
Anything
happens
over
there,
so
changing
so
those
were
the,
I
believe,
the
reason,
because
initially
I
saw
a
couple
of
codes
sitting
in
the
somewhere
where
it
was
running
on
post
press
itself,
all
the
calculations.
B
B
We
were
reading
directly
from
the
production.
What
I
say
now
we
have
got
a
clone,
but
before
this
clone
environment
we
had
like
a
real-time
replication,
so
we
were
reading
from
the
slave
server
of
the
production.
B
So
you,
if
you
put
a
heavy
query
over
there
and
everything
that
already
it
was
throwing
a
lot
of
lag
like
48
hours
one
month
yeah.
I
think
there's.
B
Performance
of
copying
from
master
to
a
slave,
I
think
the
most
because
of
those
things
taking
into
consideration
they
post
everything
towards
the
snowflake,
because
we
have
a
scalable
computer.
We
can
scale
up
as
much
as
we
want
to
do.
The
costly,
computation
and
postspace
is
simply
not
the
place
where
we
can
do
the
whole
prep
model.
Preparation.
Now,
okay,.
A
Okay,
okay,
that
was
an
interesting
for
I
see
that
limitation
because
of
keeping
history
and
now
currently
and
stuff
like
we
keep
all
the
data,
including
history,
because
we
have
only
a
snapshot
in
postgres.
So
in
case
you
want
to
travel
back
and
pick
up
something
from
I
don't
know.
Last
year
it
can
be
a
problem
because
you've
lost
the
data
right.
B
B
This
system
was
in
place,
but
the
gut
feeling
that
they
were
postless
db
itself
will
not
be
able
to
handle
this
much
compute
and
they
can't
scale
it
up
dynamically,
like
a
snowflake
that,
because
when
we
run
one
of
the
few
of
the
models
where
our
dvt
models
run
for
five
hours
in
snowfall,
snowflake
in
parallel
running
lots
of
cluster
of
xl
size,
or
else
we
can't
scale
up
that
much
in
postman.
So
it
will
definitely.
C
B
So
much
taking,
we
have
almost
got
to
the
time,
but
shall
we
pick
it
up
like
this?
I
raise
a
trigger
a
discussion
with
gary
asking
that.
Can
we
get
access
to
the
log
files?
Is
this
an
option
and
his
idea
that
how
can
we
optimize
our
pipeline
means
from
the
infrastructure
side?
Do
they
think
something
like
kafka
can
be?
B
Because
when
I
spoke
to
him
very
briefly
in
one
of
the
coffee
chat
with
him,
they
said
that
they
are
planning
to
use
kafka
in
application
side
as
well
somewhere,
and
he
said
that
if
your
team
wants
to
use
kafka
as
a
streamer
to
consume
the
data,
then
you
need
to
talk
to
there.
Some
infrastructure
head
someone
I
didn't
got
his
name,
but
I
that
was
like
a
very
early
stage.
B
So
but
I
will
check
the
first
thing
I
should
be
checking
with
him
is
that
access
to
the
log
files
and
how
easy
it
is
for
them
to
give
it
to
us
or
do
we
have
any
restriction
and
if
he
says
to
raise
an
issue
I
can
create
an
issue
and
then
we
tag
all
of
us
and
then
we
all
can
discuss
through.
That.
Would
that
be
the
next
first
step
to
be
done
or
you
feel
something
different.
A
C
Cool
thanks
for
organizing
thanks
for
that.