►
Description
Weekly update for SEG in APM.
Update issue - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/6
A
Hello
there
joe
shaw
single
engineer
group
for
apn.
This
is
my
weekly
update,
video
slightly
delayed,
but
better
later
than
ever
at
the
moment,
we're
starting
to
put
issues
together
for
the
videos
themselves.
A
A
I
just
saw
my
headphones
out,
that's
better,
so
I
showed
you
the
datadog
agent
sandbox
a
bit
last
week.
That's
come
along
quite
a
long
way
that
project,
and
this
allows
us
to
test
that
agent
in
isolation
and
we've
added
some
stuff
to
isolate
the
network.
There
it's
in
docker
compose
and
add
core
dns,
so
we
can
capture
any
external
traffic.
A
So
in
that
respect
we
can
see
if
the
agent
is
sending
any
outbound
requests
and
that
we
wouldn't
normally
know
about,
because
while
we
can
see
the
expected
requests
through
the
mock
servers
that
we
previously
set
up,
we
can't
be
sure,
without
auditing,
all
the
code
in
depth
that
there
are
other
requests
going
on.
So
I'll
show
you
what
I
mean
by
that.
So
we've
got.
A
The
datadog
agent
there
in
this
compose
file,
you
should
be
able
to
see
this
better
than
usual
now,
because
I've
actually
got
it
zoomed
in
as
before
I
didn't,
I
noticed,
and
at
the
top,
as
you
can
see,
we've
got
a
dns
proxy
setup
here
using
core
dns,
and
this
is
using
that
dns
proxy
ip.
A
And,
in
fact,
let's
predict
the
dog
in
there
see
if
they
request
going
through
there,
what
you'll
see
and
the
only
other
request-
that's
not
now
being
captured
by
this
compose
environment
and
going
to
one
of
our
services-
are
data
doc
requests
here
for
ntp,
network
time,
protocol
servers,
which
is
a
good
idea
and
to
be
expected,
so
it
keeps
the
age
in
synchronization.
So,
even
if
the
host
machine
has
the
wrong
time
settings,
you
can
reach
out
and
grab
some
ntp
settings
and
synchronize
with
those
servers.
So
that's
fine.
We
don't.
A
I
don't
think
we
need
to
worry
about
that
going
through
there
what
we've
got
next.
A
Yes,
so
we've
added
into
this
environment,
a
click
house
data
store
with
a
golang
services,
it's
on
top
of
that
to
capture
the
series
data.
It's
a
really
simple
setup.
A
I'll
show
you
the
series
go
agent
here,
it's
just
in
one
one,
golang
file
and
you'll
see
that
click
house
has
a
very
much
a
sort
of
sql
like
dialect.
In
most
cases,
it's
almost
identical.
A
We
create
a
metrics
database
on
startup
here
in
a
series
table
in
that
database
and
it's
really
just
a
sort
of
flattened
representation
of
the
series
data
we
get
from
datadog.
Not
much
thought
has
gone
into
this
and
it's
it's
something
that
we
will
redesign
and
do
some
benchmarks
against
to
get
a
better
data
model
in
place,
because
this
is
completely
normalized
and
it's
it's
not
a
very
efficient
way
of
doing
it,
and
so
you've
got
timestamp
host
metric
value
and
an
array
of
tags
there.
A
I'm
using
this
merge
tree
engine,
which
is
the
default
engine
in
click
house,
which
I
need
to
spend
some
more
time
looking
into.
We
don't
bother
with
a
primary
key
and
we're
just
ordering
it
all.
The
records
are
there
by
timestamp,
it's
that
we'll
create
a
timestamp
index
in
there,
and
this
is
some
code
to
flatten
the
incoming
requests
and
store
them,
and
you
can
see.
A
A
What
you
get
with
the
quick
house
data
source
which
isn't
built
in
you-
have
to
add
this
plugin
in
you
get
close
that
a
simple
way
of
setting
it
up.
It
will
try
and
detect
timestamp
column
that
it
will
want
to
sort
by,
and
you
don't
get
much
in
the
way
of
query
editing
capability,
but
it
gives
you
some
sort
of
template
variables
that
do
some
automatic
expansion
like
this
time
filter
here.
A
So
if
we
look
at
the
actual
query,
the
time
filter
is
actually
converting
epoch
times
to
dates
and
and
checking
the
timestamp
is
within
those.
And
here
what
we're
doing
is
filtering
this
data
down
to
the
system
disk
right
time
percentage
here.
So
just
refresh
this
I've
noticed
with
this
plugin,
the
auto
refresh,
doesn't
seem
to
be
working
so
the
last
five
minutes.
A
So,
let's
see
if
we
can
actually
run
some
run,
a
stress
ng
test
here,
I'll
keep
that
off
for
30
seconds,
so
I'm
just
going
to
run
a
hard
drive
test
with
32
workers
and
it's
just
doing
sort
of
I
o
sync
buffer
syncs
to
disk.
Hopefully
this
will
get
some
changes
in
those
disk
metrics.
A
It's
always
a
bit
of
an
issue
when
you
you're
doing
these
sort
of
disciplines
when
you've
got
a
lot
of
ram.
Often
things
just
get
memory
mapped.
You
don't
see
any
changes
to
the
disk,
so
we
can
just
wait
for
that.
I've
just
set
that
for
30
seconds
and
there
it
is.
You
can
see
it's
done
some
operations
there.
A
If
we
have
a
look
at
this,
you
can
see
a
big
spike
there
so
relatively.
If
we
look
at
that
time
frame
there,
they
were
sort
of
normal
percentages
and
then,
if
we
look
at
the
last
five
minutes,
you
can
see
that
big
big
spike.
Now.
What
I
don't
really
understand
in
this
data
is
why
that
is
supposed
to
be
a
percentage,
but
it's
higher
than
2000.
A
I
don't
know
why.
That
is
something
I
need
to
look
into
something
to
do
with
the
data
dog
agent.
I'm
not
quite
sure
you
can
see
that
big
right
spike
there
and
it's
coming
back
down
there
and
there
we
go
and
it's
roughly
was
it
2,
20,
36
yeah,
so
like
roughly
over
over
30
seconds
to
a
minute
there.
So
you
can
see,
that's
that's
working
fine
and
if
we,
if
we
took
that
spike
off,
you
know
you
can
you
can
see
the
normal
data
set
there
in
terms
of
the
query
there.
A
We
are
grouping
by
hosting
tags
there,
so
I've
concatenated
the
tags
together.
So
you
can
see
for
this
data
set.
We've
got
my
host
name
and
we've
got
device
names
for
the
individual
disk
devices
here
and
you
can
see
the
only
ones
getting
here.
The
real
real
disk
devices
here
there's
lots
of
loop,
logical
volumes
here
that
aren't
doing
anything
at
all.
If
you
look
along
the
bottom
there
right,
so
that's
that.
A
So
there's
grafana
configured
yeah.
We
did
notice
that
there's
no
docker
stats
being
collected
other
than
some
basic
ones.
I'm
hoping
this
is
just
a
c
group's
v2
issue,
and
otherwise
it
would
work.
I
need
to
test
the
agent
out
in
kubernetes
really
just
to
make
sure
that
it
is
really
getting
all
the
metrics
that
I
would
expect.
A
A
Yes
and
yeah
configuring
the
agent,
because
it's
kind
of
set
up
with
lots
of
different
endpoints.
If
you
want
to
configure
everything
like
process
logs,
the
apms
stuff,
appsx
stuff,
you
would
expect
these
are
these
environment
variable
overrides?
You
would
expect
some
level
of
consistency
there,
but
there
are
some
inconsistent
bits
like
the
logs
url,
not
allowing
you
to
specify
a
protocol
in
the
actual
url
and
having
that
set
somewhere
else.
A
Things
like
that
also
certain
things
that
aren't
consistent
in
terms
of
this
sort
of
schema
that
you
would
expect-
and
these
are
documented,
but
it
it
just
makes
setting
this
up
a
bit
more
awkward.
We
might
be
able
to
solve
that
by
providing
our
own
image,
for
instance,
some
items
of
interest
that
came
up,
there's
a
an
article
here
about
talking
about
observability
as
a
braid
of
data
instead
of
a
classic
three
pillars,
so
you're
weaving
them
together
with
contacts
that
are
coming
from
that
trace
information.
A
This
is
something
that
a
lot
of
products
are
starting
to
offer
now
and
it's
something
that
we'll
definitely
look
to
provide
in
the
future
and
also
one
thing
that
keeps
coming
up.
Is
people
asking?
Why
aren't
we
using
the
open,
telemetry
collector
because
it
is
becoming
the
sort
of
I
wouldn't
say
the
standard?
But
it's
getting
a
lot
of
traction.
There
are
a
lot
of
blogs
about
it
and,
I
would
say,
is
the
open?
A
Telemetry
collector
does
have
a
data
dog
exporter,
so
in
theory,
as
long
as
we
keep
the
data,
modeling
click
house
or
whatever
time,
series
database,
we
use
generic.
You
know
it
won't
matter,
because
this
format,
this
exporter,
would
still
work.
You
could
still
use
open
telemetry,
you
don't
need
a
datadog
agent
installed,
so
that
would
be
great
so
up.
Next,
I
need
to
dig
further
into
click
house
here,
so
we've
started
doing
that.
A
Looking
at
some
logs
looking
at
some
of
the
sort
of
drivers
and
there's
the
time
series
database
benchmarking
tool
there
that
we'll
look
at
a
bit
further
and
what
I
want
to
do
with
the
next
week
as
well
is
create
an
apm
architectural
design
to
share
in
gitlab.
So
we
can
get
some
early
feedback
on
that
and
start
to
look
at
setting
up
infrastructure
and
and
general
sort
of
design
assumptions,
and
things
like
that.