►
Description
Concourse CI 102 - Taylor Silva & Scott Foerster, Pivotal
Concourse CI is a general-purpose workflow automation tool. This talk, a “102” level, is for Concourse users who want to understand more about the inner workings of Concourse. Taylor and Scott will provide an overview of how Concourse components work under the hood, from the worker scheduling logic, to volume and container management on each worker, and what metrics are emitted from Concourse. They will close out with a brief overview of what improvements the Concourse core team is currently working on.
For more info: https://www.cloudfoundry.org/
A
We've
called
their
talk,
concourse
102.
The
aim
of
this
talk
today
is
to
perform
a
deeper
dive
in
the
underlining
mechanics
of
how
Congress
works
peek
under
the
hood.
If
you
will
it's
our
assumption
that
you
know
what
concourse
is
already,
you
can
probably
create
pipelines
of
the
fly,
CLI
you've
interacted
and
are
familiar
with
the
web.
A
Ui
you've
probably
been
stuck
in
situations
where
you've
needed
to
scale
your
workers,
or
even
submitted
issues
or
PRS
the
concourse
repo,
and,
if
not
well,
you're
stuck
with
us
for
30
minutes
today,
but
it'd
be
great
well,
we
hope
this
talk
is
informative
for
everyone.
What
we're
gonna
discuss
today
will
probably
resonate
most
with
engineers
using
or
operating
con
courses
that
are
planning
on
to
use
concourse
and
scale
and
will
find
themselves
in
situations
where
understanding
the
mechanics
behind
concourse
are
essential.
A
So
before
we
get
going,
we
just
wanted
to
do
a
quick
agenda
before
we
dive
into
things
we're
gonna
start
by
giving
a
brief
overview
of
what
concours
is.
Then
we're
gonna
spend
the
bulk
of
our
time
performing
a
comprehensive
overview
of
the
two
main
components
of
concours
the
web
and
the
worker.
Finally,
we'll
give
some
tips
and
best
practices
around
monitoring,
concours
and
I
put
a
question
mark
under
questions.
We
hope
there's
time
for
questions,
but
probably
not
so
we're
gonna,
Lloyd
or
in
the
hallway
after
this.
A
Okay,
so
we'd
like
to
preface
this
talk
by
letting
you
know
that
we
don't
have
enough
time
to
talk
about
everything.
Everything
plenty
of
details
here
have
been
left
out
for
the
sake
of
time
and
to
keep
things
simple,
everything
is
important,
but
you
know
we
had
to
prioritize
at
this
time.
I'd
also
like
to
mention
that,
although
a
lot
of
the
concepts
we're
talking
about
today
can
be
applied
to
earlier
versions
of
concours,
our
talk
is
going
to
be
based
around
the
latest
OSS
version,
a
version
5.5
Oh.
A
So
what
is
concours
if
you
go
to
conquer
CI
homepage,
you'll,
probably
see
concourse
described
as
an
open-source
continuous
thing
doer.
Well,
this
description
is
completely
valid.
It
may
be
a
wee
bit
of
an
oversimplification
to
most
users.
Concourse
looks
a
lot
like
the
image
on
the
screen,
a
series
of
resources,
tasks
and
jobs
created
via
llamó
files
that
are
sequence
together
to
create
pipelines
for
automation,
the
components
that
make
up
pipelines
are
configured
through
the
fly,
CLI
and
visualized
to
the
web.
Ui
is
seen
above
well
because
of
the
age
of
this
talk.
A
A
Concur:
sis
of
three
main
components:
the
web
node,
the
worker
and
the
Postgres
database.
Given
that
Concours
is
stateless
in
nature,
you
can
spin
up
as
many
web
nodes
and
workers
as
you
want,
because
state
is
kept
within
the
database
as
I
mentioned
before
this
talk
is
not
going
to
be
focused
on.
Why
are
my
TAS
failing
and
why
are
my
pipelines
read?
Those
are
issues
specific
to
your
pipeline,
and
today
we
want
to
talk
about
how
your
pipeline
is
being
run
by
Concourse
and
these
components
you'll
need
to
under
these
are
the
components.
A
So
before
we
begin,
I
wanted
to
take
a
quick
bird's-eye
view
into
the
web
node
and
the
worker,
since
they
will
take
up
the
the
meat
of
this
talk
today.
The
web
node
is
the
brain
of
Concours
and
it
consists
of
everything
you
see
here
in
the
diagram
above
as
well
as
the
TSA
we'll
be
diving
further
into
all
these
concepts
from
the
coming
slides,
but
I
just
wanted
to
give
a
quick
bird's-eye
view.
First,
the
web
node
is
the
thing
that
ensures
all
your
pipelines
are
running
and
cleans
up
the
workers.
A
A
Let's
also
do
a
brief
overview
of
what
the
worker
is.
The
worker
consists
of
garden
baggage,
claim,
beacon
and
a
volume
and
container
sweeper
at
the
highest
level.
The
workers
wear
container
orchestration
is
managed
so
with
all
of
that
out
of
the
way-
and
this
preamble
being
finished,
I'm
gonna
hand
things
over
to
Taylor
to
give
a
more
detailed
walkthrough
of
these
concepts.
Thanks.
B
So
the
first
question
we'd
like
to
answer
is
what
happens
when
you
set
a
pipeline
you're
in
your
terminal.
You
do
fly
Dashti
your
targets
set
pipeline
and
what
starts
happening
on
concourse.
Well,
when
you
first
create
your
pipeline,
it's
actually
paused.
So
at
this
point,
concourse
isn't
going
to
commit
any
resources
to
your
pipeline.
No
containers
get
created,
no
volumes
are
created
and
once
the
pipeline
is
actually
unpause,
this
will
never
be
the
case
again.
B
There
will
always
be
some
artifacts
lying
around,
but
for
now,
you've
just
create
your
pipeline
setup
for
the
first
time,
it's
pause.
So,
let's
unpause
it
and
ask
what
happens
when
you
unpause
your
pipeline,
so
once
your
pipeline
is
on
pause,
this
routine
called
the
pipeline.
Sinker
will
actually
notice
your
pipeline
and
it's
going
to
create
another
web
routine,
and
this
is
happening
in
the
web
node.
So
the
pipeline
thinker,
Caesar
pipeline,
sees
us
on
paused
and
it
creates
a
pipeline
routine
and
there's
always
one
pipeline
routine
for
each
pipeline
that
exists
in
your
concourse.
B
B
You
don't
need
to
try
and
read
it,
but
just
looking
at
this,
we
can
figure
out
how
many
resource
configurations
there
are
so
a
resource
configuration
is
anything
that
falls
under
the
resources
and
resource
types
sections
in
your
pipeline
configuration.
So
we
can
see
here
that
we
have
four
items
under
resources
and
a
fifth
item
under
resource
types.
So
our
pipeline
has
five
resource
configurations.
B
B
So
how
are
these
check
containers
made
now,
so
the
scanner
is
going
to
create
a
check
container
by
first
creating
volumes,
these
scanner,
the
scan
routine,
does
this
by
talking
to
baggage
claim
and
each
check
container
is
going
to
consist
of
one
resource
hype,
volume
and
one
container
volume.
So
we
have
two
volumes
right
now:
one
resource
type
volume
and
one
container
volume.
B
The
10
volumes
come
from
the
workers
file
system,
so
container
volumes
come
based
on
the
workers
file
system
so
that
those
will
contain
files
like
the
CA
certificates.
That's
what's
going
to
be
in
these
container
volume
and
the
resource
type
volumes
come
from
two
places.
Some
come
bundled
with
the
concourse
binary,
so
these
are
volumes
based
off
of
the
base
resource
types
that
come
with
concourse
and
then
others
are
downloaded
either
from
docker
hub
or
from
an
external
image
registry.
B
These
are
typically
the
volumes
that
would
come
from
the
resource
type
section,
all
right
now
back
to
our
check
containers.
So
we
have
those
two
wrongs
made.
We
know
with
where
they
came
from,
and
now
scanner
can
tell
garden
to
create
a
container
based
off
those
two
volumes
that
we
that
we
have
or
start
three
now
actually,
because
we
have
two
containers,
one
resource,
so
scanner
toss
garden
tells
it
to
create
that
container
and
we
create
that
one
check
container.
B
B
B
The
first
reason
is
that
resources
are
by
default,
checked
every
minute,
so
Concours
doesn't
want
to
waste
time
every
minute
having
to
go
through
this
container
creation
process,
so
Concours
will
create
the
check
container
once
and
it'll
leave
it
around
for
an
hour.
So
that
way
you
can
just
continue
to
read,
reuse
it
over
the
hour
and
the
second
reason
that
a
check
container
lasts.
B
Only
one
hour
is
so
one
worker
doesn't
get
overloaded
with
having
all
these
check
containers
persisted
on
it
for
forever,
so
the
one
hour
lifetime
gives
Concours
a
chance
to
every
so
often
rebalance
the
check
containers
across
the
workers
based
on
whatever
their
current
states
may
be,
and
one
last
thing
about
check.
Containers
that's
important
to
note
is
that
check
containers
are
essentially
your
minimum
workloads.
So
when
you
have
your
pipelines
all
on
paused
and
even
if
no
job
is
running,
the
check
containers
themselves
are
always
always
running
so
check.
B
So
what
happens
next
now
that
we
have
these
new
versions,
so
the
scheduler
starts
up,
the
scheduler
was
actually
always
running
and
the
scheduler
again
is
another
routine,
that's
running
in
the
web,
node
and
but
the
scheduler
had
nothing
to
do
previously
when
there
are
no
virgins,
so
it
only
has
stuff
to
do
once.
There
are
new
versions
that
have
been
discovered.
B
So
let's
say
that
a
new
version
was
found
by
our
app
resource
and
our
app
resource
is
just
checking
a
github
repo
looking
for
new
commits
and
every
time
there's
a
new
commit
that
becomes
a
new
version
in
Concourse,
so
the
family
new
version,
and
that
means
scheduler
can
now
queue
up
a
job
for
two
of
a
build
for
our
first
test
job
there.
Since
now,
there's
a
ballot
input
for
it
and
the
scheduler
does
this
by
starting
a
build
routine
and
again.
B
This
is
a
builder
team
just
on
the
web
node,
so
a
build
routine
is
what
actually
runs
your
job.
It
looks
at
your
job
plan
and
it
execute
each
step
in
your
job
plan,
and
it
does
this
by
creating
containers
and
volumes
similar
to
the
scanner
and
the
building
routine
is
what's
monitoring
the
entire
run
of
whatever
builds.
It
was
told
to
run,
and
it's
long
and
all
that
output
that
the
job
is
generating
and
putting
that
into
the
database.
So
that's
its
main
job.
B
So
we
can
see
that
we
have
a
get
step
and
we
have
a
task
step.
So
these
are
the
two
steps
in
our
build
plan
here
and
the
build
routine
will
start
the
job
by,
of
course,
running
the
get
step
first,
so
the
Builder
team,
similar
to
scanner,
it's
going
to
talk
to
baggage
claim
first
and
have
you
create
a
resource
following
you?
The
resource
volume
is
gonna,
be
where
some
of
the
data
from
this
get
step
is
going
to
be
output
it.
B
Alright,
and
with
that
we're
almost
done,
ran
the
job
halfway.
There
there's
only
two
steps
in
this
job
and
the
last
step
is
the
task
step.
So
how
is
the
task
type
executed?
So
every
task
tip
has
its
own
configuration.
So
this
is
the
configuration
for
our
task
step.
So,
of
course,
the
build
routine
is
going
to
have
to
go
through
this
task
configuration
to
know
how
to
run
it.
B
So
the
goal
of
this
specific
task
is
to
just
run.
Go
test
so
run
some
unit
tests
that
we
have
inside
our
source
code
and
the
first
thing
the
build
routine
is
going
to
do
is
figure
out
what
type
of
worker
platform
does.
This
task
need
to
run
on
whether
it's
Linux,
Windows
Darwin
it'll,
find
that
worker
and
then
it'll
continue
going
to
the
task
plan
here.
B
B
B
It
does
cache
it,
of
course,
but
that
means
it's
taking
it
up
disk
space
on
your
worker
and
every
so
often
the
cache
will
expire
or
there
will
be
a
new
image
to
download
and
every
time
that
happens
means
you
have
to
download
some
large,
a
large
docker
image
and
save
on
your
workers
so
to
keep
things
speedy
and
lean.
It's
generally
in
your
interest,
try
and
keep
those
docker
images,
small
alright.
So
what
does
that?
B
Look
like
worker
look
like
now
after
fetching
that
docker
image
for
this
task,
so
after
downloading
damage
from
darker
up
there,
the
worker
will
now
have
to
resource
files,
so
it
just
made
a
second
one
based
off
of
that
docker
image,
so
docker
images
just
become
resource
volumes.
On
your
worker,
so
now
our
tasks
step
has
that
container
image,
and
the
last
thing
it
needs
is
its
input
volumes.
So
these
are
volumes
that
existed
from
previous
steps
in
your
job.
B
So
typically,
these
are
satisfied
by
get
steps
and
they
can
be
satisfied
from
the
output
of
other
tasks
as
well.
But
in
this
case
the
only
input
azar
is
the
app
volume
that
we
want
and
that
came
from
the
previous
get
step
so
on
our
work
there.
That
means
that
the
Builder
team
can
now
talk
to
garden
and
have
a
created
container
using
the
docker
image
volume
and
the
app
volume
that
our
source
code
contains.
B
With
that,
the
Builder
Tina
has
finished
running
the
entire
job
plan,
so
that's
the
entire
job.
Now,
if
your
job
was
successful,
then
of
course
the
pipeline
will
continue
running
this
process
of
scheduler
checking
what
needs
to
be
run,
creating
build
routines
simply
continues
and
keeps
going
over
and
over
again,
as
your
entire
pipeline
runs
and
another.
Another
thing
to
note
about
successful
jobs
is
that
garbage
collection
will
actually
eventually
delete
the
get
and
task
containers
that
were
created
during
entering
this
job.
They
don't
get
immediately
deleted.
They
get
deleted
a
little
shortly
after
depends.
B
If
you're
tracking
metrics
on
your
cluster
you'll
see
as
you
upstart
that
you'll
see
the
number
of
containers
and
volumes
slowly
increasing
on
on
your
workers,
and
then
it
will
probably
peak
for
some
time
as
those
jobs
running
depending
on
how
long
the
containers
need
to
be
around
for
until
the
task
is
complete
and
then,
as
they
get
garbage
collected,
you
should
see
them
going
back
down.
So
overall,
this
is
a
sign
of
a
healthy
concourse
cluster.
B
B
So
why
does
concours
keep
the
containers
and
volumes
so
Congress
keeps
the
containers
and
volumes
because
Congress
doesn't
know
when
people
will
be
able
to
debug
them.
This
could
have
happened
on
a
Friday
night
or
Saturday
morning
and
you're,
probably
not
gonna,
be
checking
your
your
jobs
or
your
pipelines
on
the
weekend.
At
least.
We
hope
that
you're,
not
so
Congress,
will
just
leave
them
around
so
that
way
a
Monday
morning
you
can
come
in
and
you
can
check
and
see
what
happened
why
your
wire
jobs
failed.
B
B
Alright.
So,
with
this
we're
going
to
leave
our
example
pipeline
here
with
the
one
job
passed
and
we're
going
to
talk
about
some
of
the
remaining
components
in
the
web
note
and
those
last
few
components
are
at
the
bill
tracker
the
build
reaper
and
the
garbage
collector.
So
first
up
is
the
bill
tracker.
B
So
what
does
the
bill
tracker
do
so?
The
bill
tracker
just
starts
build
routines,
just
like
the
scheduler.
So
why
do
we
need
to
build
tracker?
Then,
if
it's
doing
exactly
what
the
scheduler
does
so
the
scheduler
is
already
at
creating
routines
and
the
reason
we
have
to
build
trap.
The
build
tracker
is
so
it
can
save
any
orphan
bills.
B
Now
what's
an
orphan
bill,
so
let's
say
scheduler
started
at
Build
routine
as
it
normally
does,
and
it's
running
our
test
job,
so
you
can
see
there,
but
then
something
happens
like
there's
a
networking
issue
or
the
entire
web
note
goes
down.
The
underlying
VM
is
just
gone
for
some
reason.
So,
basically,
if
something
happens
where
the
build
routine,
that
was
tracking
that
tests
job
if
it
loses
connection
with
that
worker,
this
results
in
an
orphan
built,
essentially
because
I've
at
this
point
the
build
routine
that
was
running
would
exit
so
yeah.
B
B
The
build
tracker
is
able
to
start
a
new
build
routine,
and
it
does
this
by
looking
the
database
for
any
orphan
builds,
because
again,
concourse
saves
all
stayed
in
the
database,
so
the
build
tracker
will
check
the
database
for
any
old
friend
builds
and
when
it
finds
one
it'll
simply
start
a
new
build
routine.
For
that
job,
and
just
like
the
build
tractor,
the
build
routine
can
just
look
in
the
database
to
find
out
where
the
job
was
left
off
and
then
everything
just
continues
as
normal.
B
B
So
the
building
Reapers
job
is
to
delete
logs
that
come
from
builds
and
by
default,
conquerors
actually
doesn't
delete
any
builds
so
yeah.
We
don't
want
to
delete
something
that
you
might
want
to
keep
around.
So,
though,
Reaper
does
nothing,
and
if
you
want
to
do
something,
you
have
to
set
a
build
log
retention
policy.
This
can
be
set
at
the
web
node
level.
So
when
you're
deploying
honkers,
you
can
say
globally
reap
all
my
logs
that
are
X
days
old
or
after
X
number
builds
across
all
all
pipelines.
So
that
can
be
said.
B
The
web
node
level
users
can
also
have
control
over
their
build
retention
policy,
so
they
can
set
it
in
their
tasks
and
configuration
when
they
create
their
pipelines.
Now,
when
should
you
set
a
build
log
retention
policy?
So
if
you
have
a
job
that
runs
frequently,
that's
typically
a
good
candidate
for
something
that
you'll
want
to
have
it's
build
logs
roof
every
so
often,
for
example,
we
have
a
pipeline
or
one
specific
job
in
a
pipeline
that
generates
about
600
megabytes
of
log
output
and
then
that
all
gets
saved
on
the
Concours
database.
B
So
that's
a
lot
of
data
to
just
be
saving
in
the
Concours
database
itself,
because
it'll
take
a
blog
disk
space
on
there.
So
that's
what
one
job
where
we
actually
only
want
to
keep
the
most
most
recent
bills,
so
anything
that
either
generates
a
lot
of
output,
frequently
or
even
just
whatever
you
want
to
run
at
once.
It
generates
a
lot
of
data
in
that
one
shot.
Those
are
both
very
good
reasons
for
why
you'd
want
to
set
builds
log
retention
policy.
B
B
So,
if
you
don't
want
to
save
all
these
like
six
hundred
megabyte
logs
in
your
concourse
database,
you
can
instead
have
conquests
just
straight
Alice
logs
to
some
syslog
endpoint,
and
you
can
save
it
in
in
some
other
persistent
data
location
that'll
make
much
more
sense
for
your
logs
and,
of
course,
when
Bill
Reaper
is
deliver
logs.
This
is
what
she'll
end
up
seeing
if
you
go
to
a
really
old,
build
that
you
don't
the
logs
for
so
here.
This
is
one
of
the
jobs
that
I
was
talking
about.
That
runs
really
frequently.
B
B
All
right
and
the
last
component
that
were
will
talk
about
is
the
garbage
collector
and
the
garbage
collector
is
a
two-step
process.
The
first
part
happens
on
the
web
node,
so
the
web
node
the
web
nose
job
in
this
process.
It
is
to
figure
out
what
containers
and
volumes
can
be
deleted,
and
it
does
this
based
on
this
concept
of
ownership,
so
every
container
and
volume
has
owners
and
who
are
these
owners
who
can
own
a
container
and
volume,
so
ownership
can
come
from
running,
builds,
failed
jobs,
resource
caches
and
check
containers.
B
Those
are
essentially
the
main,
the
main
sources
of
ownership
for
containers
and
volumes,
and
as
generally
as
things
finish,
so
if
your
bills
are
successful,
that's
when
ownership
will
be
released
on
any
containers
and
volumes
that
that
build
rat.
If
a
job
fails,
it
retains
that
ownership
and
same
thing
with
check
containers,
check,
containers
have
that
default
one
hour
minimum
lifetime
and
once
that
one
hour
is
up,
that's
simply
a
release
of
ownership
and
that's
how
the
web
node
knows
to
list
those
containers
as
ready
to
be
deleted.
B
B
This
information
gets
saved
in
the
database
and
is
then
used
by
garbage
collector
to
figure
out
what
can
be
deleted
and
then
the
garbage
collector
on
the
web
node
will
save
the
list
of
containers
of
volumes
to
delete
back
from
the
database
and
as
a
response
to
that
initial
request
from
the
worker.
The
work
will
then
delete
the
containers
of
volumes
and
that's
the
garbage
collection
process
I'm
going
to
end
it
back
to
Scott.
Now
who
will
wrap
things
up
episode.
A
Before
we
got
out
of
here,
I
just
wanted
to
briefly
touch
on
monitoring
your
concourse
as
it's
very
important.
So
we
strongly
suggest
setting
up
monitoring
around
your
concourse
deployments,
especially
as
you
scale
and
our
experienced
troubleshooting,
an
operating
large
concourse
deployments,
we've
seen
that
as
your
concourse
grows
and
you
begin
to
add
multiple
teams,
workers,
ATC's,
etc.
It's
no
longer
good
enough
to
do
to
basically
judge
your
KPIs
as
to
whether
your
pipeline
is
red
or
green
for
overall
concourse
health
selfishly.
A
It
also
helps
us
to
assist
users
and
customers
in
troubleshooting
when
they
have
relevant
metrics
present
within
pivotal,
we
managed
to
large
scale
conquests
as
a
service
deployments
for
internal
pivots.
The
deployments
are
called
wings
in
hush
house.
We
have
monitoring,
set
up
on
both
deployments
now
we're
not
saying
what
we
currently
measure
is
perfect,
but
it
should
give
a
pretty
good
overview
about
what
you
could
set
up
for.
Monitoring
for
wings.
A
Metrics
are
consumed
via
influx
DB
and
displayed
on
a
graph,
an
adored
brush
house
metrics
are
consumed
via
Prometheus
and
displayed
on
a
graph
and
a
dashboard,
and
we
have
SL
eyes.
An
SLO
is
displayed
on
monitors
through
data
dog.
We
also
have
alerts
that
notifications
set
up
to
send,
where
slack
with
certain
thresholds
or
crossed,
there's
no
particular
reason
why
we
use
influx
in
one
deployment
in
Prometheus
and
the
other.
We
just
think
it
gives
us
good
insight
into
the
most
common
ways:
users
and
customers
or
emitting
metrics.
A
Our
metrics
dashboard
is
actually
currently
available
publicly
and
when
you
get
the
links
lot
of
slides,
the
links
you'll
be
able
to
access
it
there
or
the
links
the
slides
I
should
say
should
be
able
to
access
it.
Lots
of
potential
disasters
can
be
preemptively
solved
by
having
an
priyad
metrics
approach.
This
is
something
the
Concours
team,
highly
suggest
investing
in
as
you
scale
and
begin
to
understand.
More
of
you
under
the
hood
mechanics
that
we've
mentioned
here
today,
so
just
to
close
up
thanks.
So
much
for
coming
out.