►
Description
Weekly incubation engineering video for APM.
https://about.gitlab.com/handbook/engineering/incubation/monitor-apm/
History issue tracker link - https://gitlab.com/gitlab-org/incubation-engineering/apm/apm/-/issues/2
A
Hello,
joe
shaw
here
engineer
in
the
incubation
department
for
apm.
A
Another
weekly
update
video
mixed
week
this
week,
looking
at
historic
information
in
gitlab
and
talking
to
people
and
looking
at
a
bit
of
research
around
potential
time,
series,
databases
or
general
column,
storage
databases.
We
might
be
looking
at
using
to
store
all
this
performance
data
and
looking
specifically
at
datadog
agent
and
how
we
can
set
that
up
and
use
that
as
an
agent
and
create
our
own.
A
Version
of
the
datadog
api,
a
simplified
version
that
we
can
use
in
production,
this
first
page,
I'm
just
showing
you
here-
is
one
of
the
issues.
I've
got
open
that
will
stay
open
and
I'll
keep
adding
to
it.
It's
a
historic
issue:
project,
epic
tracker
for
parts
of
the
organization
that
I
want
to
keep
a
tab
on
a
lot
of
this
stuff
is
probably
over
a
year
old
now.
A
So,
whilst
I
am
talking
to
people
in
the
business,
it's
a
lot
of
this
information
is,
is
too
old
for
people
to
recall
easily
enough?
They'll
only
remember
the
salient
points,
so
it's
useful
for
me
to
dig
into
the
details
so
I've
I've
managed
to
search
and
track
down
some
epics
issues
that
are
quite
pertinent
to
what
we're
doing
here,
an
open
mr
there
in
a
room
book
and
some
more
internal
docs
that
I
don't
feel
for
the
general
reading
list
in
the
handbook.
A
But
it's
still
quite
useful
is
one
or
two
ones
that
are
specific
to
research,
for
example
in
grafana
usability
and
things
like
that.
So
an
example
of
one
of
these
issues.
Well,
an
epic
was
for
actually
dog
fooding,
the
monitoring
there
you
can
see
a
discussion
starts
where
they're
talking
about
it
and
looking
at
actual
dashboards
that
we're
looking
to
to
dog
food
into
get
lab
here,
and
this
was
a
year
ago
you
can
see
these
are
all
comments
from
2020
early
2020
and
it
eventually
gets
closed
off.
A
But
you
can
see
the
sort
of
discussion
that's
going
on
here
and
likewise
jumping
off
from
that.
You
can
see
things
like
existing
issues
that
have
been
created
around
things
like
datadog.
A
So
looking
at
agent
metric
collection
here
and
digging
down
into
this
there's
some
interesting
stuff,
I
don't
think
anything
ever
got
fully
completed
here
or
even
started,
but
it's
all
interesting
information
anyway,
and
a
lot
of
this
was
focused
on
being
able
to
monitor
ci
jobs,
for
example
within
different
environments.
So
if
you're
running
your
own
gitlab
runner
for
ci,
how
you
monitor
make
sure
that's
still
performing
well,
and
you
can
do
all
that
with
prometheus
right
now.
A
But
if
you've
already
got
a
datadog
integration,
then
you
might
want
to
be
able
to
do
that
directly.
So
there's
some
work
that
was
going
on
there
and
I
do
believe
that
datadog
integration
was
completed
recently
for
the
runner
in
one
of
the
recent
releases
of
gitlab.
So
that's
great
so
before
I
move
on
to
talk
a
little
bit
about
why
we're
looking
at
data
dog
runner,
specifically
I'll,
obviously
get
people
asking
me.
Why
not
jump
straight
to
using
the
open
telemetry
project,
which
is
a
great
community
project?
A
That's
brought
together
and
is
stood
on
a
number
of
other
projects
that
have
folded
into
it.
For
example,
I
think
openmetrics
was
one
that's
been
folded
in,
I'm
not
sure
about
open
tracing
but
working
on
things.
They're.
Looking
at
things
like
how
you
standardize
the
metric
data
structures,
trace
data
structures,
log
data
structures
because
they
all
tend
to
be
very
similar
and
then
how
you
can
create
an
agent
things
like
that.
On
top
of
that,
so
you
can
see
in
the
actual
documentation
here
for
open
telemetry.
A
Where
was
I
going
to
look?
You
can
see
that
the
kind
of
reference
architecture
there
would
be
very
similar
to
something
that
we
would
look
at.
Look
at
using
you've
got
your
own.
You've
got
a
you
know,
open
telemetry
collector,
pushing
data
into
it,
a
very
simple
single
agent
architecture.
A
So,
for
example,
if
you
look
at
the
goal,
language
instrumentation
you've
got
very,
very
low
level.
I
release
candidates,
alpha
releases
and
and
no
logging
implemented
there
at
all.
For
example,.
A
A
People
will
be
able
to
get
going
very
very
quickly
with
that
and
whereas,
in
this
scenario
I've
seen,
for
example,
a
github
were
using,
I
started
to
use
open
telemetry
with
ruby,
and
it
looks
like
they've
had
to
put
a
lot
of
work
in
to
get
that
going
based
on
a
recent
blog
post,
and
you
know
a
business
of
that
size
and
technical
capability
can
do
that,
whereas
a
lot
of
small
to
medium-sized
businesses
simply
can't.
A
A
One
of
the
things
we
want
to
keep
in
mind
is
trying
to
pick
a
solution
that
might
be
able
to
solve
all
areas
of
observability,
as
opposed
to
just
solving
for
time
series
data
and
by
time
series
I
usually
just
mean
metric
data
where
you've
just
got
a
floating
point
number
and
tags
and
timestamps,
which
are
you
know.
A
There
are
a
lot
of
databases
that
support
that
we're
also
looking
at
log
data,
which
can
also
which
can
be
sort
of
full
text
searchable
data
and
trace
data
which
again,
whilst
it's
all
time,
spanning
timestamp
index,
it's
a
very
different
data
structure.
So
we're
looking
to
have
that
level
of
flexibility.
Now
I
stumbled
upon
quest
db
here
this
blog
looking
at
benchmarking
and
there's
a
benchmarking
suite
the
time
series
benchmark
suite,
which
I
think
was
created
yeah.
A
It's
run
its
own
open
source
by
time
scale
here,
which
is
another
time
series
database
based
on
top
of
postgres,
and
this
was
an
interesting
one,
because
you
can
see
here
the
sort
of
throughput
test
that
they're
doing
click
house
comes
out
as
a
very
strong
contender
versus
other
things
like
influx
timescale,
which
are
very
good
time.
Series
databases
don't
get
me
wrong.
A
Timescale
based
on
postgres
has
some
limitations
there
in
terms
of
scalability
and
also
influx.
It's
another
open
core
model.
It
scales
vertically
really
really
well
on
single
nodes,
but
then,
when
you're
wanting
to
do
horizontal
scaling
and
replication-
and
you
then
have
to
get
an
enterprise
license
for
that
and
whereas
click
house,
theoretically,
I
haven't
tested
this
yet
should
be
able
to
scale
horizontally.
A
And
I
had
seen
a
an
issue
in
looking
at
quest
db
when
I
was
looking
at
to
evaluate
it,
they
haven't
actually
implemented
horizontal
replication
multi-server
application
yet,
but
it
is
an
interesting
one
and
it
looks
like
a
very
performant
time
series
database-
it's
not
clear
yet
whether
that
would
be
useful
for
other
sorts
of
metrics
as
well,
so
I'll
leave
that
for
the
time
being,
here's
another
article
now
this
is
looking
at
various
time
series
workloads
and
one
thing
that
comes
out
of
this-
that
we
need
to
do
some
more
research
around
with
click
house
in
particular,
is
when
you're
doing
like
queries,
simple
queries.
A
Click
house
actually
comes
out
as
being
a
lot
less
performant
than
a
lot
of
other
solutions.
Right,
that's
creating
small
amounts
of
data.
You
can
see
it's
one
of
the
lowest
contenders
in
there,
but
actually
when
you
then
switch
to
very,
very
heavy
queries
because
the
nature
of
the
database,
it
then
starts
to
perform
much
much
better.
A
So
we
need
to
get
an
idea
of
what
sort
of
access
patterns
we're
going
to
get
with
this
from
users,
building
dashboards
and
visualizing
the
data,
and
if
they
are
of
this
sort
of
nature,
then
it
would
be
quite
useful-
and
I
you
know
from
my
instinct-
I
think
it's
going
to
be
the
case-
that
we're
going
to
have
large
volumes
of
writes,
which
click
house
is
very
good
for
and
can
ingest.
A
A
lot
of
data
has
very
high
throughput
and
a
less
volume
of
complex
reads
where
there's
lots
of
selection,
of
a
lots
of
of
the
columns
merging
tags
together,
and
things
like
that.
So
my
hope
at
the
moment
is
that
this
would
be
a
good
fit
also
another
another
blog.
That's
interesting
is
uber's
logging
blog
entry
here,
where
they
talk
about
migrating
from
elasticsearch
to
click
house
and
as
a
solution
for
their
logs.
A
I
think
you
can
see
the
sort
of
architecture
diagram
here
which
is
very
similar
to
the
sort
of
architecture
we
might
be
looking
at
for
the
apm
solution,
where
you
have
kafka,
acting
as
a
set
of
brokers
there
to
handle
back
pressure
and
perform
messaging,
passing
those
logs
storing
them
in
click
house
and
having
abilities
to
to
view
on
top
of
that,
and
they
go
into
a
lot
of
details
here
and
talk
about
the
schema
they're
using
and
some
of
the
optimizations
that
they
then
put
in
place
to
scan
to
be
able
to
provide
the
data
at
the
scale
of
uber.
A
That's
that's
a
very
good
sign
that
they're
able
to
use
that
for
that.
For
that
method-
and
it
means
you
know
if
we
can
just
use
click
house
for
all
these
different
types
of
observability
data
in
one
place,
it
means,
from
an
infrastructure
point
of
view
much
less
of
a
headache
to
have
loads
of
different
databases.
A
So
very
quick.
There's
a
data
tag
agent
in
a
docker
compose
file
that
I'm
setting
up
here
and
I'm
using
mock
servers
here,
which
is
another
project
to
essentially
mock
out.
Http
requests,
captures
them
and
allows
you
to
put
in
your
own
responses
and
things
like
that.
It's
good
for
just
being
able
to
analyze
sort
of
trace,
http
requests
because,
as
we'll
see,
the
data
dog
agent
uses
a
bunch
of
endpoints
that
are
documented,
but
some
aren't
as
well,
and
we
need
to
look
at
what
it's
doing
there.
A
As
you
can
see
we're
setting
that
up
setting
up
various
url
overrides
here,
which
is
a
bit
awkward,
you
can
see
with
what
you've
got
this
main
data
dog
url,
but
we
we
also
have
to
override
it
for
different
areas
that
we
haven't
turned
on,
but
otherwise
the
default
is
just
to
go
to
digital
hq.
If
you
don't
override,
which
is
a
little
bit
frustrating.
But
it's
not
a
big
deal.
You
can
set
these
up
in
config.
You
can
set
them
up
as
command
line,
args
and
environment
variables.
A
We
pass
it
the
docker
socket
the
unix
socket
so
that
we
can
read
docker
state
information.
It
automatically
detects
that
and
reads
it,
and
you
know
the
hosts
proc
file
system,
which
is
sort
of
standard
linux,
way
of
interrogating
counters
for
performance
for
the
actual
hardware
and
the
cgroup
information
as
well,
which
is
related
to
the
way
containers
run
and
organize
themselves.
A
So
if
we
just
stand
that
up
see
those
setting
up
there
and
we
should
see
logs
from
the
agent
starting
to
spread
out
and
if
we
jump
into
the
mock
server
dashboard,
we
start
to
see
requests
coming
in.
So
we
can
see,
we've
got
a
an
initial
validation
request
there
to
validate
against
our
api
key.
All
we're
doing
with
all
these
responses
is
returning
a
200
with
the
body
of
okay,
and
it
seems
to
just
not
care
in
fact.
A
Initially
it
was
all
404s
and
it's
still
just
carried
on
so
it's
it
doesn't
seem
to
mind
much
what
you
send
back.
It
just
keeps
spamming
metrics
at
the
end
point
for
these
in-text
endpoints
that
I
haven't
had
a
chance
to
break
down,
yet
that
are
not
documented.
So
there
are,
you
know,
there's
lots
of
stuff
documented
for
the
datadog
apis
that
you
can
see
here,
but
that's
not
documented
and
these
check
run
ones.
I'm
not
sure
what
they're
doing
here.
A
The
main
thing
we're
interested
in
at
the
moment
is
the
series
posts
here.
So
you
see
every.
I
think
it's
set
to
every
10
to
15
seconds
by
default,
to
send
this
and
it's
sort
of
a
batch
request.
So
it's
everything's
building
up
in
memory
and
then
it
does
a
flush
out
to
this
series.
Endpoint.
Let's
do
another
check
run
there,
let's
wait
for
another
one.
A
A
A
These
are
tuples,
it
sends
over
you've
got
a
unix
timestamp
and
a
and
a
value
here
in
which
case
would
be
the
percentage
you
got.
Some
tags
tag
formatted
here
device
name.
This
is
the
name
of
one
of
the
disks
on
my
machine
and
saying
it's
a
particular
gauge
metric
type
and
for
this
particular
device.
The
interval
is
not
relevant
here,
because
it's
a
gauge
that
would
be
if
you've
got
like
a
calculated
rate
or
something
like
that
and
the
source
of
data.
So
you
get
these
sort
of
metrics
coming
in
from
the
system.
A
As
expected,
I
haven't
got
to
the
point
of
being
able
to
visualize
those.
Unfortunately
I
was
hoping
to
be
able
to
get
that
done
this
week.
That's
one
of
the
first
things
I've
been
looking
at
next
week
is
getting
a
basic
back
end
with
grafana.
It
might
not
be
click
house
because
it
might
be
a
bit
complicated
to
set
that
up.
A
Initially,
it
might
be
simpler
time
series
database
and
then
we
can
just
get
click
house
set
up
straight
away,
I'll
get
the
something
like
grafana
set
up
straight
away
on
that
so
anyway,
there
we
go
I'll,
leave
it
there.