►
Description
Monthly catch up meeting for Data Platform , PI and Infra team
A
Okay,
so
in
the
agenda,
just
gary
is
not
here,
so
it's
fine,
he
will
watch
async
updated.
So
also
we
are
in
a
position
that
decomposition
project
is
about
to
go
live.
I
think,
next
month,
sometime
there
is
an
issue
created.
I
I
think
I
have
access,
if
not
I'm
just
putting
it
over
in
the
chat.
A
This
was
created
by
dylan
to
look
for
us
to
start
migrating
to
the
new
one.
Do
our
changes
as
well?
So,
yes,
so.
B
A
Yeah,
that's
fine!
So
what
we
were
after,
basically
that
we
have
another
instance,
but
there
is
already
two
databases,
one
for
ci1.
Can
we
get
a
clone
of
the
ci
database
as
well?
As
can
I
because
it
was
basically
a
question
to
gary
that
I
go
and
create
an
issue
and
put
all
the
information,
so
he
can
spin
up
a
clone
instance
for
us
one
so
that
we
can
test
it
as
well
that,
yes,
we
are
able
to
connect.
A
B
I
have
good
news
for
you
and
we
have
already
choose
the
fs
replicas.
One
is
stating
one
in
production
from
the
ci
cluster.
Why?
I
want
to
keep
everything
running
properly
and
when
we
do
the
switch
over
you
already
have
this.
What
I
need
to
follow
up
with
jerry
with
gary
is
about
the
snapshots,
the
scripts
that
he
uses
for
snapshotting,
and
I
think
it's
not
installed
there,
but
the
dfs
replica
is
there.
I
can
give
you
the
names
if
you
want,
and
rather
and
after
we
can
talk
about
it,.
A
Yeah,
so
we
need
that,
basically,
that
a
snapshotted
environment
so
that
we
start
picking
up,
and
since
we
talked
about
zfs
there
is
a
query
we
have
so
this
jdfs
snapshot.
It
is
like
a
thin
database,
that's
what
I
have
so.
What
does
it
contain?
Is
it
contain
like,
from
the
last
instance,
the
snap
it
was
taken
to
this
one?
Is
that
does
it
contain
the
delta
changes
or
is
it
contains
everything
inside
it?
What
you
have.
B
C
B
Inside,
what's
what
reaffirms
does
exactly
that
has
a
initial
image
and
then
you
have
the
deltas
of
what
is
changing?
Isn't
the
internal.
But
for
you
you
see
all
the
changes
you
just
see.
Oh
you
don't
see
only
the
delta.
A
C
A
B
Yes,
because
the
snapshots
you
will
see,
for
example,
suppose
that
you
don't
have
any
changes.
You
have
a
second
snapshot.
You
will
have
like
a
few
kilobytes
only
to
generate
the
difference.
Let's
say
a
few
updates.
If
you
have
a
massive
update,
it's
including
this,
for
example,
we
have
one
database
of
one
terabyte
and
suppose
that
you
have
a
hundred
gigas
of
changes.
B
The
thinkload,
the
next
clone,
will
have
one
100
gigabytes
and
how
to
look
to
these
two
separate
updates
instead
of
delete.
I
don't
know
because,
as
far
as
I
know,
this
is
written
in
this
blocks,
so
you
know
what
happened
the
best
way
to
do.
That
will
be
with
logical
decoding,
but
you
can
get
what
happened
at
that.
Then
you
can
have
a
log
of
this
but
like
from
zfs.
B
As
far
as
I
understand
is
much
more
disk
wise
disk
block
mapping
of
changes,
it
doesn't
look
for
the
it's
not
so
smart
to
understand
the
technology,
but
that's
running
up
and
supposedly
so
to
know
what
happened.
Everything
is
snapshot
to
the
other.
I
believe
you
will
need
to
have
some
experiments
with
logical
decoding.
C
Yes,
but
jose
said
just
to
be
on
the
same
page,
what
you
know
from
oracle
roles,
it's
more
friendly
than
postgres.
What
gary
said
talk
to
us
positive
scenes,
not
so
friendly
in
this
in
this
task,
I
would
say
right:
yes,
comparing
to
other
database
providers,
yeah.
B
B
We
usually
export
page
start
statements
to
promise
you,
so
you
can
see
this
at
least
in
live
live
database,
but
I
want
to
tell
one
thing
quite
interesting
here
we
have
a
timeout
of
15
seconds
in
production,
so
if
the
transaction
is
long
enough
and
it
has
a
timeout,
you
process
a
bunch
of
data.
You
generate
a
lot
of
wall,
but
it's
not
committed,
so
it
doesn't
show
that
statements.
So
what's
up
with
that,
that's
also.
B
A
Okay
good
thing
to
keep
in
mind
when
we
move
towards,
but
for
now
we
will
raise
a
request
to
gary
to
create
the
snapshot
environment
for
us
and
we
can
put
it
in
the
firewall,
the
vpc
and
network
and
all
so
that
we
have
access
from
our
kubernetes
cluster
and
then
we
test
it
that
we
have
access
to
both
of
them
and
our
changes.
Because
till
now
our
changes
are
theoretical
because
we
have
done
all
the
configuration
changes,
but
we
didn't
have
two
databases
to
connect
to
so
we
reached
to.
A
B
A
C
Bad
time
stamp
yeah,
if
I'm
not
wrong,
yeah,
what's
happened.
We
retrieve
the
data
if
it
is
not
stable
and
for
a
couple
of
records
more
than
one.
We
find
some
time
steps
like
before
christ
like
easily
be
before
our
age.
So
if
you
filter
out
the
data,
we
have
work
around
and
it's
working
fine,
but
just
want
curious.
What's
going
on
what
you
see
on
production,
do
you
see
an
issue
because
of
that
or
not.
B
B
C
C
C
B
Like
imposes
like
you,
can
do
arithmetic
operations
with
data,
for
example,
one
is
class
one
day
or
like
this
is
quite
interesting
because
you
need
to
cast
thing
properly
throughout.
But
if
you
have
like
plus,
let's
say
ten
thousand,
you
can
get
some
strange
dates
because
you'll
have
something
ten.
C
Thousand
years,
probably
this
is
four
thousand
years
before
current
moments,
so
something
suspicious
happen
and
it's
couple
of
records,
but
yeah
create
an
issue
for
us.
We
expect
time
stamp,
but
I
I
reproduce
the
issue
and
pause
the
database
and
table
and
be
able
to
insert
on
timestamp
that
value.
It's
legal
value,
legit
value
for
time
stamp
for
positives,
but
for
us
for
snowflake
it
doesn't
seem
legible,
but
yeah
we
will
see.
I
will
prepare
everything
for
you
after
the
meeting
thanks.
B
A
The
last
one
who
says
that
we
had
this
instant
last
weekend
that
our
clone
was
down
for
more
than
like
it
was
down
for
like
more
than
12
or
30
hours,
because
the
recovery
took
more
than
60
minutes,
and
then
we
didn't
had
access.
So
what
we
understood
from
the
on-call
people
that
this
service,
this
postfast
instance,
is
not
in
as
part
of
on-call
support.
A
A
How
shall
we
approach
this
that
to
bring
this
whole
service
of
clone
instance
into
the
on-call
support,
so
that,
because
they
monitor
the
we
don't
do
like
overnight,
24
7
support
of
our
data
business.
By
the
time
we
realize
that
there
is
a
database
issue,
we
have
400
slack
alerts,
saying
that
all
the
400
tasks
have
failed.
So
so
we
just
want
to
like
kick
start
that
process.
We
don't
have
to
like
get
an
answer
now,
but
how
shall
we
approach?
That
is
that?
B
A
So
I
got
this
so
in
that
incident
there
is
a
write-out
as
well
from
you.
B
A
Yeah
so
they
said
that
they
will
support
it,
but
they
will
not
pick
up
the
failure
when
it
happens
like
that,
so
it
doesn't
do
a
slack
pager
alert
to
them
so
like
that.
So
this
is
something
so
they
supported
when
we
ping
them,
they
jump
to
the
call
we
get
a
solution,
but
by
the
time
we
ping.
It
is
like
six
hours
below
after
six
hours,
close
to
like
four
hours
it.
The
failure
happens
like
11
45
utc,
for
example,
pm
upc.
A
We
pick
it
up
by
3
pm,
utc
or
4
pm
utc,
since
I
am
in
india
times
when
I
look
at
that.
But
if
we
don't
have
people
in
india
time
zone,
it
is
like
close
to
like
six
pa
beauties.
So
we
have
two
bad
two
batch
run
by
that
time
and
we
have
like
500
600
alerts
to
look
after
so
we
wanted
it
to
be
like
picked
up
by
that,
like
slo
or
sla
kind
of
thing.
So
we
need
to
move
your.
B
Alerts
to
the
main
cluster
and
we
need
to
alert
infra
directly,
for
example,
if
your
batch
fails,
we
need
to
raise
an
alert
to
srv
in
the
same
moment
and
notify
you
as
well,
but
in
the
same
moment
right
because
from
what
understand
your
pipelines
later
are
failing
and
you're
getting
notifications
like
three
four
hours
later
right,
then
you
are
looking
the
root
cause
and
then
you
are
paying
a
survey
or
someone
to
support
you
right.
Yes,.
A
So
gary
already
has
stated
that
alert
mechanism.
It
goes
to
the
alert
channel,
but
it
is
like
an
alert
challenge.
You
are
like
thousands
of
nerds,
so
that
is
not
being
picked
up.
So
we
see
that
our
pipeline
failed.
Then
I
go
to
the
alert
channel,
see
that
is
there
an
alert
triggered
for
this.
Yes,
there
has
been
an
alert
river,
so
it
is
an
incident
over
there.
A
So
we
just
like
make
it
a
formal
thing
that
if
it
gets
failed
instead
of
we
picking
it
up,
if
an
on-call
people
pick
it,
it
gets
fixed
up
then
and
there,
because
if
there
is
a
rundown,
it
is
a
repetitive
step.
Then
we
redo
the
exercise
and
by
the
time
our
task
kicks
in
the
things
are
restored
back.
A
B
Possibly
when
you
just
adjust
the
alert
it
should
be
under
the
sre
radar
and
then
we
are
fine
and
we
put
like
the
rumble.
Please
put
the
run
book
there
as
well.
The
reference
that
we
can
put
in
the
alert
is
usually
the
best
way
is
you
have
an
alert
and
you
have
the
rumble.
The
service
can
react
better
because
they
see
what
you
do.
A
So
that's
all
in
the
agenda
from
my
side
today.
These
are
the
two
main
things
one
to
get
a
clone
for
the
for
the
decomposition
and
the
second
one
is.
This
radon
has
already
got
one
for
the
infrastructure
thing
where
there
is
a
backdated
date,
so
he
will
create
an
issue.
I
will
create
and
two
issues
one
for
this-
to
bring
in
the
on-call
sre
thing
and
to
get
a
clone
plated,
bikery
and
map
it
to
the
vpc.