►
Description
Cloud Tech Thursday explores the full modern open source cloud stack, from hardware to serverless. Learn about new ideas, projects, and releases around Kubernetes, OpenStack, hybrid cloud enablement, and many other topics.
Twitch: https://red.ht/twitch
A
A
A
A
B
Good
morning,
good
afternoon,
good
evening,
welcome
to
another
episode
of
cloud
tech,
thursdays.
I
am
chris
short
producer
host
and
showrunner
extraordinaire.
For
this
thing
we
call
red
hat
live
streaming.
I'm
joined
today
by
amy
josh
and
a
special
guest
leaf
madsen
to
talk
about
stf
and
how
to
monitor
your
openstack
cluster
amy.
Why
don't
you
introduce
yourself
talk
a
little
bit
more
about
the
topic
hand
it
off
to
others
that
kind
of
thing.
C
All
righty
hi,
my
name
is
amy
marish.
I
am
the
openstack
community
person
here
at
red
hat
and
we're
joined
today,
as
already
mentioned
by
josh
burkus,
who
is
the
kubernetes
community
person
and
our
guest
today
is
leaf
madsen,
who
works
on
stf,
which
is
also
known
as
infrawatch
upstream
leaf.
Do
you
want
to
go
ahead
and
introduce
yourself
and
the
project.
D
Sure
my
name's
life
madison,
I'm
the
cloud
service,
telemetry
architect
at
red
hat
and
service
telemetry
framework-
is
a
project
that
I've
been
working
on
for
three
plus
years
and
the
idea
is
that
we
install
a
set
of
microservice
applications
on
top
of
openshift
and
we
monitor
our
infrastructure
as
a
service,
our
openstack
and
so
today,
I'll
just
be
kind
of
going
through
the
architecture
of
stf,
some
links
of
where
the
open
source
project
components
are
available
for
anyone
to
make
use
of
and
just
I'll
go
through
some
dashboards
and
some
live
environment
stuff
and
happy
to
answer
any
questions.
D
Ta-Da,
all
right
so
just
I'll
just
place
the
links
at
the
beginning
here
and
I
can
get
those
over
to
the
show
hosts
after
as
well,
but
basically
github.com
infowatch
is
the
upstream
location
for
all
the
source
code
that
I'm
going
to
be
going
through
today
from
our
from
our
overview
and
the
rendered
documentation,
which
is
also
written
in
the
upstream
source.
There
is
at
infowatch.github.io
documentation.
D
So
just
a
quick
overview,
so
service
alumni
framework
is
basically
receives
monitoring
data
from
openstack
or
third-party
nodes
and
is
a
central
location
for
storage,
viewing
of
dashboards
and
alerting
for
your
system,
and
so
what
we
make
use
of
is
collect
d
for,
storing
and
collecting.
Sorry,
not
for
storing
for
collecting
the
metrics
and
events
for
the
infrastructure
components.
D
The
openstack
aware
metrics
and
events
come
from
salometer.
We
also
support
multiple
clouds
going
into
the
same
monitoring
infrastructure.
D
We
also
provide
availability
monitoring,
such
as
container
statistics
for
cpu
and
memory
and
the
api
health
checks
for
the
various
openstack
epi
interfaces,
so
glance
neutron
things
like
that
integrated
stuff
metrics
with
the
collect
dcf
plug-in.
So
if
you
happen
to
be
running
ceph
within
your
openstack
infrastructure,
we
can
also
collect
that
information
using
the
theft
plugin.
We
can
send
snmp
traps
using
alert
manager.
D
We
have
some
we
make
use
of
the
prometheus
snmp
hook,
implementation
for
that
we
make
use
of
various
storage
back-ends
that
are
provided
by
openshift
and
so
we've
tested
with
ocs
and
things
like
that.
Just
to
make
sure
that's
all
working
and
our
visualization
is
done
with
grafana,
and
so
all
of
this
is
all
operator
driven
using
the
operator
lifecycle
manager
with
an
openshift.
D
So
this
is
kind
of
the
high
level
overview
of
the
architecture
on
the
left.
Here
we
have
our
actual
openstack
deployment
and
we
make
use
of
collect
d
and
celometer
for
collecting
the
events
and
the
metrics.
We
then
make
use
of
amq
interconnect,
which
is
also
apache,
cupid
dispatch,
router
and
that's
in
amqp1
protocol
based
message,
bus
and
we
use
that
without
brokers
or
anything
in
order
to
just
stream
the
telemetry
and
the
event
information.
D
So
that's
our
transport
protocol
coming
across
the
back
end.
We
then
make
use
of
the
smart
gateway,
smart
gateway,
which
is
actually
made
of
two
components:
the
sg
core
and
the
sg
bridge,
which
I
can
get
into
later.
But
basically
that's
our
middleware
that
sits
in
in
between
the
data,
storage
and
the
transport
layers.
So
takes
the
information
from
the
bus
for
metrics.
It
provides
a
scrape
endpoint
for
prometheus
to
scrape
so
it
can
collect
the
metrics
for
all
your
nodes
and
for
events.
E
Right
we're
all
cuny
here,
so
collect
d
is
been
around
for
a
while,
the
and,
and
particularly
based
on
my
experience
with
it,
although
it
may
have
evolved
since
then,
you
know
precedes
having
any
sort
of
standards
around
how
what
its
messages
look
like
the.
Why
collect
d?
How?
How
has
it
been
useful
as
sort
of
the
central
collection
point.
D
Yeah
mostly
collect
d,
because
the
intent
when
the
framework
was
originally
being
developed
was
to
make
use
of
something
that
was
one
small
and
fast.
D
So
it
didn't
have
a
lot
of
overhead
and
to
be
able
to
run
it
on
things
like
other
network
devices,
so
collect
e
can
actually
be
compiled
and
run
on
some
routers,
some
switches-
some,
you
know,
infrastructure
components,
so
that
was
kind
of
the
main
reason
around
that
it
also
just
had
a
lot
of
plugins
that
were
good
for
infrastructure
monitoring.
D
So
it
had
a
lot
of
the
information
that
that
we
required,
and
the
messaging
thing
hasn't
really
been
too
much
of
a
problem
for
us
when
it
comes
in
we're,
basically
collecting
that
data
with
our
smart
gateway,
anyways
and
we're
exposing
that
data,
for
you
know,
via
plug-in
instance,
by
the
type
instance
and
there's
various
labels
and
fields
that
we
make
use
of,
and
so
that's
actually
been
quite
consistent
across
across
the
plug-ins.
D
So
we
haven't
really
had
any
issues
with
having
to
you
know,
do
any
like
crazy
manipulation
or
anything
like
that.
So
the
other
thing
was
the
vents
as
we
make
use
of
the
ves
format,
and
so
the
ves
format
is
like
an
encapsulation
vnf
event
standard,
or
something
like
that.
D
I
can't
remember
what
the
ves
stands
for,
unfortunately,
but
we're
making
use
of
that
for
our
events,
so
it
kind
of
has
another
encapsulation
layer
on
it
as
well,
so
that
when
the
events
come
in
from
collect
either
also
somewhat
standardized.
E
You
didn't
consider
trying
to
push
this
upstream
so
that
the
collector
was
actually
using
standard
formats.
D
So
everything
we're
doing
is
is
from
collect
the
upstream.
So
all
the
best
formats
and
everything.
D
E
D
D
All
right
so
so
we'll
just
kind
of
go
through.
I
only
have
a
handful
of
slides
and
then
we'll
just
get
into
kind
of
the
sun's
alive
stuff,
but
the
intent
here
is
to
provide
kind
of
a
near
real
time,
event
performance
system,
and
so
we
can
collect.
You
know
various
pieces
of
information
from
various
events
and
telemetry
systems
we're
making
use
of
collect
the
installometer
currently,
which
is
our
collection
layer,
the
distribution
or
the
transport
layer
is
making
use
of
amq.
So
that's
the
amqp1
protocol
inside
of
openstack.
D
The
message
bus
in
use
is
rabbitmq
and
that's
amqp0.9
protocol
format,
so
they're
actually
different
formats,
but
a
similar
idea.
You
transport
the
information
across
the
bus
and
get
that
into
the
central
location.
The
big
thing
that
that
provides
us
is
kind
of
a
push
gateway
model
for
telemetry
for
prometheus.
So,
instead
of
going
and
scraping
all
500
600
nodes
of
your
system,
you
actually
just
end
up
scraping
the
one
smart
gateway.
D
So
all
of
the
data
is
collected
and
sent
across
the
bus
and
then
that's
exposed
as
a
single
scrape
endpoint
across
effectively
the
local
network
for
prometheus.
So
we
kind
of
get
everything
collected
and
and
sent
to
the
central
location,
and
then
we
just
have
a
single
scrape
endpoint.
So
there's
no
real
need
to
do
no
discovery
or
anything
like
that.
D
You
just
basically
start
sending
the
data
over
and
it
becomes
exposed
for
prometheus
to
scrape
events
are
obviously
a
push
model
anyways,
so
we
collect
that
send
that
across
the
bus.
Just
so
we
have
a
single
transport,
and
then
we
write
that
into
our
event:
backend
storage
backend,
which
is
elasticsearch.
D
So
this
is
kind
of
a
little
bit
more
of
a
blown
up
view.
The
same
kind
of
idea.
Here
we
have
various
collective
plugins
that
we
can
make
use
of
a
lot
of
the
reason
for
collect
d.
Is
it
also
has
a
lot
of
nfe
specific
things
so
for
telco
back
ends?
You
know
overlay
networks,
things
like
that.
We
have
a
lot
of
information
here
that
we
can
make
use
of
from
an
openstack
perspective.
D
We
also
make
use
of
our
syslog
and
we're
in
the
process
of
getting
our
logs
potentially
across
the
bus,
we're
doing
a
bunch
of
load
testing
right
now
to
determine
whether
that's
feasible
and
so
we're
we're
testing
out
at
you
know.
D
Hundred
thousand,
you
know
four
four
million
logs
and
things
like
that
and
making
sure
that
we're
having
as
close
to
100
percent
delivery
of
those
messages
as
possible
so
that
that
works
on
going
so
at
some
point
logs
may
show
up
here
also
in
the
single
transport
layer
and
then
yeah.
So
it
just
comes
into
the
bus
here.
D
Smart
gateway
is
basically
the
middleware
from
a
third-party
integration
perspective
because
of
the
way
that
we're
doing
the
transport
layer
and
using
the
message
bus
in
a
distributed
manner,
both
the
smart
gateway
and
other
systems
can
actually
connect
to
that
same
bus,
and
so
you
can
actually
have
the
smart
gateway
can
collect
that
data
and
store
it
for
you,
and
other
systems
can
actually
also
listen
for
that
data
and
make
reactions
to
it.
So
part
of
the
reason
this
is
set
up
in
in
this
way
is
to
allow
for
closed
loop
remediation.
D
So
what
you
could
actually
do
is
you
could
have
a
process
living
on
side
of
your
openstack
system.
Listening
to
the
local
message,
bus
being
able
to
react
to
that
and
do
something.
Let's
say
an
error
showed
up
or
a
warning
or
something
like
that.
That
says
I
need
to
go
and
like
restart
a
service,
for
example
right
that
same
service
can
listen
and
react
to
it
without
actually
affecting
the
data
storage,
which
can
happen
much
further
down
the
line.
D
You
don't
necessarily
always
want
to
be
reacting
to
the
information
after
it's
been
stored,
because
that
can
actually
be
quite
a
long
time
right.
So
so
that
can
actually
be
a
long,
long
loop
right,
which
we
call
the
northbound
loop.
So
going
all
the
way
up
going
into
the
storage
domain
sitting
there
for
several
cycles
to
determine
that
something
is
actually
wrong
and
then
reacting
to
it,
sending
an
alert
and
then
going
and
acting
on
it,
which
you
know
that
action
may
be.
D
You
know
done
by
a
human,
but
if
you
wanted,
you
could
actually
have
a
remediation
system
that
could
react
to
that.
So
we
have
these
various
loops
within
the
system.
So
we
have
like
a
very
closed
loop
here
or
a
very
small
loop,
and
then
you
have
a
longer
loop
that
can
come
up
into
the
actual
storage
domain
and
then
even
longer
loop,
where
everything
is
actually
in
the
storage
domain.
D
D
Yes,
so
you
definitely
have
to
make
sure
that
your
system,
if
you're,
going
to
do
a
closed
loop
remediation,
is
able
to
understand
that
something
happened.
It
was
then
able
to
react
to
it,
make
make
a
change
and
then
it
should
technically
also
report
back
into
the
system.
In
order
to
clear
you
know
the
condition
to
say
that
I
have
resolved
it
or
it's
being
resolved
system
that
we
have
set
up
also
allows
for
multiple
clouds.
So
you
may
have
you
know
several
different
data
centers
or
you
may
have
one
data
center.
D
That
has
you
know
various
small
clouds
that
are
broken
out,
maybe
on
a
tenant
basis
or
just
you
know,
specific
features,
or
maybe
you
just
you
just
have
a
system
that
you
have.
D
You
know
various
small
clouds
or
you
know
medium
and
large
clouds
right
and
so,
instead
of
having
multiple
different
monitoring
systems,
you
can
actually
centralize
that
again
we
use
the
transport
when
that
transport
comes
in,
we
actually
have
various
groups
of
smart
gateways,
one
per
cloud,
and
then
that
can
also
go
all
into
the
same
storage,
backend
or
different
storage
backends.
If
you
actually
want
to
configure
it
that
way,
excuse
me
so
this
is
just
the
various
components,
so
stf
is
actually
made
up
of
a
bunch
of
different
components.
D
So
I've
been
talking
about
the
storage
domain,
I'm
talking
about
our
middleware,
our
transport
layers
and
our
collection
layers.
So,
on
the
open,
stack
side
of
things
I
mentioned,
collect
d
and
silometer
are
the
data
collectors
we're
making
use
of
the
apache
cupid
dispatch
router
for
our
transport
layer
in
that
diagram?
Here
we're?
Actually
that's
the
amq
interconnect
operator
here
and
that
amq
interconnect
operator
is
what
manages
the
life
cycle
of
the
amq
workload.
D
We
also
make
use
of
certificate
manager
for
creation
of
certificates.
We
use
elasticsearch
and
prometheus
operators
in
order
to
deploy
and
manage
the
life
cycle
of
the
data
storage
components.
We
use
the
grafana
operator
for
managing
the
actual
grafana
deployment,
and
then
we
have
the
smart
gateway
operator.
That
is
what
actually
manages
the
deployment
the
stand
up
and
the
configuration
of
the
various
smart
gateway
components,
and
then
we
have.
D
Finally,
the
service
telemetry
operator
and
the
service
telemetry
operator
is
what
I
call
an
umbrella
operator,
and
so
it
is
the
thing
that
you
install
and
when
you
want
an
stf
instance,
it
actually
goes
out
and
creates
various
objects
inside
of
openshift
and
those
objects
are
then
reacted
on
by
the
different
operators
listed
here
and
each
of
those
operators
then
goes
and
manages
their
components.
D
So
stf
just
says
I
need
an
elastic
search
or
I
need
a
prometheus,
so
it
will
request
that
and
then
the
operator
that
has
all
that
operational
knowledge
of
how
to
stand
up
a
prometheus,
how
to
stand
up
in
elasticsearch
how
to
stand
up
a
a
qdr
system.
Stf
just
makes
those
requests,
and
then
those
operators
actually
manage
the
stand
up
of
all
of
that
and
then,
when
all
of
those
components
are
up
and
running,
you
now
have
an
stf
instance.
Basically,.
D
Yes,
so
if
you
were
running
this
in
okd
or
open
shift,
you
would
see
this
as
the
installed
operators,
so
once
you've
gone
through
the
installation
process
following
the
the
documentation
for
stf.
This
is
what
you
would
see
inside
of
the
operator
lifecycle
manager
page.
D
Sorry
for
that,
let's
try
that
again
there
we
go.
So
if
I
get
to
the
proper
end
point
here,
oc
get
service
telemetry.
D
And
obviously
you
get
stf
sorry,
I
think
I
just
typed
it
wrong,
so
oc
get
stf
default.
This
is
what
you
would
actually
create
as
part
of
an
stf
instance.
So
you
basically
say
inside
of
alerting
you
ask
for
an
alert
manager
and
you
request
its
storage
back
end,
and
then
you
can
have
a
receiver
like
for
snmp
traps
and
things
like
that.
D
Here's
our
back
ends
that
define
our
events
back
end,
so
we
basically
say
elasticsearch
is
enabled
and
we're
going
to
use
persistent
storage
for
our
metrics,
we're
going
to
say
we're
going
to
use
prometheus
for
that.
That's
been
enabled
with
a
scrape
interval
of
10
seconds
and
then
we've
again
set
our
storage
backend
and
how
long
we're
going
to
retain
the
data
and
then
the
various
clouds
that
we're
going
to
monitor.
So
these
are
basically
we
want.
These
are
our
collectors.
D
So,
within
the
clouds
set
up
we're
saying
we
have
events
and
metrics
collectors
that
we're
going
to
listen
for
which
is
the
subscription
address
inside
of
the
cupid
dispatch
router.
What
is
the
collector
type
for
that
smart
gateway?
And
then
basically
we
define
that,
and
so
we
can
have
a
list
of
various
clouds.
If
I
had
a
multi-cloud
setup,
I
would
have
another
line
down
here.
D
So
I've
defined
this
as
cop04,
which
just
is
a
is
a
short
form
for
the
for
the
cloud
configuration
again
I
I
can
even
have
overrides
so
this
is
an
example
of
a
grafana
manifest
and
I'm
doing
an
override,
because
I
needed
to
change
the
base
image,
so
I
I
can
actually
the
different
objects
that
stf
can
manage.
I
can
actually
override
if
I
needed
to
again
graphing
is
enabled-
and
this
just
kind
of
sets-
you
know
some
information
for
it
and
whether
I
have
high
availability
and
the
transports
and
things
like
that.
D
D
So
oc
get
prometheus
default
and
then
you
would
basically
have
another
object
here
that
this
is
the
object
that
the
prometheus
operator
would
react
to
and
again
that
just
took
the
information
that
I
had
in
my
service
telemetry
saying
I
wanted
a
prometheus
storage
back
end
that
created
this
prometheus
object,
which
the
prometheus
operator
reacted
on
and
resulted
in
a
prometheus
object
for
me,
and
then
it
created
the
different
storage
backends
and
things
like
that.
So
here's
my.
D
D
So
this
is,
this
is
a
picture
of
kind
of
the
layout
of
the
routers,
so
these
routers
are
the
cupid
dispatch
routers
that
are
collecting
the
data.
So
you
can
see
this
is
basically
stf.
This
is
what's
running
on
okd
and
then
each
of
these
routers
are
running
on
each
of
the
nodes
inside
of
my
openstack
environment,
and
so
each
of
these
routers
run
locally
and
then
the
local
clients
connect
to
them.
D
D
The
way
the
salometer
works
on
openstack
is
that
there
are
compute
agents
that
will
run
on
the
non-controller
endpoints
and
then
that
information
is
actually
sent
across
the
rabbitmqbus
to
the
salometer
agents,
and
then
the
salometer
agents
via
oslo
messaging
will
basically
be
able
to
send
that
information
across
the
amqp1
connection,
which
is
our
cupid
dispatch.
Router
that
we're
looking
at
here
and
that
information
is
then
centralized
back
to
stf.
D
This
is
just
a
picture
of
one
of
the
dashboards
and
I'll
go
through
the
various
dashboards.
I've
been
kind
of
working
on
this
week,
but
this
is
kind
of
the
results
of
all
of
that
information.
So
I've
got
you
know
my
various
apis
inside
of
openstack
that
I'm
checking
and
I've
just
created
very
a
dashboard
here
that
has
various
bits
of
information.
D
D
So
I
have
these
pre-recorded
demos,
but
I
have
to
take
any
questions
and
then
I
can
get
into
any
live
environment
stuff.
If
that's
interesting
to.
D
So
if
worst
case
scenario,
you
basically
monitor
to
say
I'm
not
getting
any
monitoring
and
I
basically
sent
an
alert
saying
your
monitoring
system
is
basically
I'm
not
getting
any
info
information
from
this
from
from
the
cloud
that
you're
monitoring
right
so
that
could
be,
the
networking
could
have
gone
out.
Services
could
be
overwhelmed.
The
memory
could
have
run
out
in
the
system,
but
ideally
I
would
have
enough
information
leading
up
to
that
event,
that,
even
when
I
got
the
alarm
saying
well,
your
monitoring
system's
offline.
Well,
it's!
Why
is
it
offline?
D
All
of
the
systems
are
actually
running
their
own
collectors
and
they're
running
their
own
cupid
dispatch
routers
for
the
transport.
So
if
you
have
networking,
you
will
still
get
data
if
worst
case
scenario,
you
just
don't
get
any
data.
That
is
definitely
something
you'd
want
to
alarm
on.
So
what
I
can
also
do
is,
I
can
start
doing.
I
can
make
some
use
of
things
in
prometheus,
for
example,
to
do
predictive.
D
You
know
systems,
so
I
can
say
I've
been
watching
memory
going
up
or
I've
been
watching
network
utilization
going
up
over
the
last
hour
and
as
predicted
in
three
hours
from
now
that
at
the
current
rate,
you
will
have
overwhelmed
the
system.
So,
ideally
you
react
to
things
faster
that
way
without
having
to
actually,
you
know
actually
have
the
worst
case
scenario
where
the
system's
offline
and
then
you
have
to
react
to
it.
That
way.
C
And
it's
kind
of
an
advantage
versus
the
openstack
telemetry
project
and
that
the
telemetry
itself,
except
for
salometer,
isn't
running
on
openstack,
so
your
monitoring
is
now
on
another
cluster.
Therefore,
it's
not
actually
affecting
the
system
itself,
which
I
think
is
a
great
improvement
over
what
we
had
before.
D
Yeah,
we
actually
don't
recommend
that,
so
our
documentation
actually
will
say
you
should
not
run
your
monitoring
platform
on
top
of
the
thing
you're
monitoring,
because
again
you
have
that.
Let's
say
I
have
that
catastrophic
network
outage.
Well
now
the
thing
that's
running
on
top
of
it
can't
actually
notify
me
that
it's
out
right,
so
the
only
other
way
to
do
that
is
then
have
another
monitoring
system.
D
That's
checking
to
see
if
your
monitoring
system's
up
and
running
and
then
alerts
you
when
the
alert
monitoring
system
goes
away,
which
is
kind
of
silly
right,
but
it's
not
totally
impossible.
If
you
have
something
really
small
running
somewhere,
but
you
know,
ideally,
I
believe,
a
lot
in
kind
of
an
infrastructure
cluster
where
you
actually
deploy
a
very
small
cluster.
That's
specifically
for
running
things
like
monitoring,
you
know,
maybe
you
run
your
under
cloud.
You
run
your
open
shift.
D
Installer,
you
run
your
acm,
whatever
systems
you
happen
to
want
to
be
running,
to
actually
manage
and
deploy.
Your
clouds,
I
believe,
should
be
in
a
separate
cluster
anyways.
So
that's
kind
of
the
idea
around
here
is
that
you
don't
run
your
monitoring
on
top
of
the
thing
you're
monitoring
you
know
in
an
ideal
world,
anyways
yeah.
D
Okay,
so
I
don't
know:
let's
look
at
some
dashboards,
because
dashboards
are
cool.
D
D
D
It's
pretty
straightforward.
What
I
like
to
do
is
I'll
go
to
prometheus
first,
and
I
will
go
and
find
something
that
I
want
to
monitor
right.
So
I
can,
I
can
click
through
this
list
you
can
see
this
list
is
that
true.
D
D
And
I
can
execute
that
that'll,
you
know
shows
me
the
amount
of
free
memory
across
my
systems.
So
then
maybe
what
I
would
do
is
I
would
take
this.
I
would
go
into
my
dashboards
and
go
home
if
I
can
get
back
to
the
main
screen
here.
D
D
D
Obviously
I
can
set
this
name
set
this
panel,
whatever
the
case
may
be,
but
that's
pretty
much
it
and
then
usually
what
I
end
up
doing
is
once
I've
done
that
I
will,
you
know,
save
the
dashboard,
then
I
will
export
it.
So
that'll
save
it
to
a
file
that
gives
me
a
you
know,
a
json
document
and
then
what
I
end
up
doing
is
oc,
get
rafana
dashboards.
D
And
so
these
are
all
the
dashboards
that
I've
created.
So
the
nice
thing
is
that
when
you
load
these
in
they're
managed
for
you
as
as
objects
inside
of
of
openshift,
so
I
don't
have
to
manually
import
those
every
time
if
I
restart
grafana,
if
it
doesn't
update
or
whatever
these
dashboards
will
all
be
automatically
loaded
back
in
so
oc
edit.
Let's
just
look
at
the
rofana
dashboard.
D
1.3-
and
so
I
just
have
to
wrap
it
in
this-
basically
header
so
which
api
it's
the
integrately
api.
What
kind
is
it
it's?
A
grafana
dashboard,
it's
a
little
bit
of
metadata,
this
stuff's
actually
all
created
for
you,
so
you
wouldn't
even
have
this
stuff
in
here.
I
just
name
it
and
then
which
namespace
it's
part
of
and
then
basically
that
would
be
it
and
then
I'd
have
the
spec.
I
say
that
it's
a
json
blob,
this
bar
just
means
multi-line.
D
So
everything
after
that
that's
indented,
and
then
this
is
literally
exactly
what
I
just
exported
out
of
that
file.
So
if
I
quit
out
of
that,
oh
I
might
made
modifications
so
then
I
would
do
oc
create
dash
f.
You
know
new
dashboard.yaml,
for
example
right
and
then
that
would
create
this
dashboard
for
me.
D
So
if
I
go
to
github.com
infowatch
slash
dashboards,
this
is
actually
where
our
dashboards
live
in
github,
so
it
would
be
oc,
create
f.
These
are
the
dashboards
here,
but
lc
create
dash
f,
stuff
dashboard,
for
example.
Right
once
I
do
that,
then
that
that
would
automatically
be
created
for
me
inside
of
the
dash.
So
in
fact,
if
I'm
going
to
be
really
brave
here,
oc
delete
dash,
f.
D
And
then
do
a
review
here.
This
is
the
one
that
I
created,
but
you
can
see.
My
cef
dashboard
is
no
longer
here,
but
if
I
go
back
to
my
console,
oc
create
dash
f
staff
dashboard,
and
then
I
refresh
this
page
there's
my
stuff
view
that
wasn't
in
there
before
and
now,
there's
my
dashboard
and
it's
making
use
of
all
the
data
that
I've
already
previously
collected,
and
you
can
see
all
the
information
about
you
know
your
stuff
back
end.
E
D
So
that's
that's
part
of
the
reason
I
make
use
of
okd,
for
this
is
because
it's
really
easy
to
manage
these
components
so,
instead
of
me
going
and
having
to
write
a
whole
bunch
of
stuff
to
manage
and
deploy
all
this
for
me
kind
of
after
the
fact
I
make
use
of
the
operator
model
to
do
that.
So
I
can
show
you
a
little
bit
of
that
of
how
that
works.
D
Infra
watch
service,
telemetry
operator,
and
so
this
is
just
ansible
in
here,
and
so
it's
just
there's
a
lot
of
boilerplate
of
of
creating
the
actual
operator
itself.
But
ultimately
it's
just
ansible,
and
so
part
of
this
is
all
of
these.
You
know
these
components
or
these
playbooks
that
I've
created
so
there's
one.
You
know
the
alert
manager,
the
certificates,
the
clouds,
the
elasticsearch,
the
grafana
things
like
that
right
and
so
all
that's
doing.
In
fact,
I
will
load
it
in
something,
so
we
have
some
colors.
D
It
just
looks
up
a
template.
It
sets
some
defaults
and
then
it
creates
an
instance
of
grafana
using
the
kubernetes
module
inside
of
ansible,
and
so
that
takes
that
object.
Inside
of
that
template
loads
it
into
okd,
and
then
the
grafana
operator
reacts
to
that
and
results
in
the
creation
of
the
grafana
instance,
and
then
there's
other
things
where
you
know
anything
else.
It
needs
like.
Looking
up,
you
know,
data
sources
or
you
know
what
this
is
doing
is
also
creating
the
data
sources.
D
So
oc
get
grafana
data
sources,
I
think,
is
the
object
name.
So
there's
the
default
data
sources
default
data
sources,
oh
yaml,
and
then
so.
That
creates
the
data
sources
inside
of
my
grafana
instance,
which
is
defined
here
under
data
sources,
so
es
cellometer
es
collect
d
stf
prometheus.
These
are
all
created
for
me
as
part
of
an
stf
deployment.
So
I
don't
create
any
of
this.
This
is
all
automatically
resulted
at
just
by
enabling
dashboards
inside
of
the
service
telemetry
object.
D
So
anything
you
want
to
add,
you
just
add
it
to
the
service
telemetry
operator
and
then
that
can
go
off
and
work
with
other
operators
to
actually
deploy
the
components
that
you
might
need
or
might
want.
Again.
We
have
overrides
that
you
can
also
pass
into
the
service
telemetry
object.
So,
like
I
had
that
grafana
manifest.
For
example,
I
did
an
override
of
grafana,
but
I
still
had
access
to
my
data
sources
that
were
automatically
generated
for
me
and
created
as
part
of
the
service
telemetry
deployment.
D
Our
documentation's,
even
all
auto-generated
and
everything
as
well
so
any
time
that
I
make
a
change
to
this
infrawatch
documentation.
So
this
is
our
source
and
ascii
doc.
Anytime,
that's
changed
and
it
merges
into
the
main
branch
here.
This
actually
will
update
and
you
will
get
changes
into
this
documentation.
D
So
all
of
that
is
auto-generated,
and
you
can
see
that
for
our
upstream
here
for
the
open
source
deployments
here,
you
know
openshift
uses
okd,
you
know,
suggestions
of
different
backends,
you
can
use
for
testing
and
things
like
that
and
then
it
just
goes
through
of
how
to
create
the
object.
D
So
when
I
deploy
an
stf,
I
literally
just
do
this
copy
and
paste
that
copy
and
paste
that
copy
and
paste
that
just
keep
going,
and
then
I'm
basically
done-
and
once
I
get
to
the
end
of
this,
I
will
have
everything
that
you
just
saw
there
in
terms
of
the
operators
that
are
deployed
the
crafada
instance
that
exists
all
of
that
and
then
all
you
have
to
do
is
add
your
alerts,
so
there's
a
file
that
we
provide
that
provides.
You
know
these
alerts
that
you
can
use
for.
D
E
The
yeah
I
can
see
that
putting
that
part
of
a
workflow
one
of
the
other
things
I
want
to
actually
ask
you
a
question
about
is
a
couple
of
times
when
we're
talking
about
things
like
we're,
talking
about
degraded
mode
and
other
stuff.
Is
you
mentioned
predictions?
E
So
is
there
any
kind
of
a
predictions
feature
in
in
stf
to
say,
hey,
you
know
you're
going
to
run
out
of
memory
in
your
cluster.
At
this
point.
D
Yeah,
so
that's
actually
part
of
prometheus,
so
there's
a
lot
of
different
functions
that
you
can
make
use
of
in
prometheus.
So
this
would
be
like
this
predict,
linear,
I'm
not
going
to
try
and
create
something
on
the
fly,
but
you
basically
make
use
of
these
various
things
inside
of
prometheus.
So
there's
some
there's
lots
of
functions
so
there's
you
know
just
summing
so
like
adding
things
up,
predict
linear,
you
know
and
then
just
various
things
when
you
write
your
alerts.
D
That
says,
when
something
reaches
this
threshold,
send
a
warning
and
when
you
reach
a
further
threshold,
send
something
like
a
critical
notification
right.
So
you
can
actually
get
these
different
things
when
you
send
alerts
to
say
this
is
just
a
warning:
we're
kind
of
getting
up
to
the
you
know
the
critical
area
and
then,
when
you
surpass
that
critical
area,
then
you
get.
You
know
your
alarms
that
say:
okay,
you
really
need
to
react
to
something
right
now.
D
D
These
ones
just
happen
to
show
up
as
an
example,
but
it's
just
telling
us
that
our
virtual
machine's
still
active
basically
inside
of
this
project.
So
if
I
change
to
a
different
open
stack
project,
this
one
doesn't
have
any
vms
in
it.
This
one
has
two
and
then
we
can
even
see
that
down
here.
So
this
you
know
these
projects,
here's
the
instances
that
are
living
inside
of
that.
D
C
D
Oh
so
we're
just
making
use
of
the
alert
manager
that
comes
from
prometheus.
Basically,
so
you
just
load
your
alert
rules
in,
and
I
can
actually
show
you
that
really
quick
too.
D
Yeah,
so
I
yeah
rules
here
we
go
so
here's
the
rules,
openstack
rules.
This
is
just
loaded
into
the
console
oc
get.
I
can't
remember
what
it's
called.
I
think
it's
prometheus
rules
we're.
D
D
Ago
so
yeah,
this
is
the
file
that
we
created,
so
you
just
open
stack
rules,
and
then
these
are
the
expressions
that
we've
created
and
then
any
alarms
or
alerts
that
you
want
to
create.
So
if
we
sit
in
this
position
for
10
minutes,
then
the
label
basically
is
severity.
Warning.
D
D
I
believe
the
infrastructure
node
view.
If
there's
any
you
know
recent
alerts
or
whatever
these
need
to
be
tuned
for
the
environment.
Obviously
these
are
the
fact
that
I'm
seeing
a
whole
bunch
of
you
know.
Current
alerts
and
recent
alerts
just
means
that
it's
flapping,
because
those
queries
are
too
aggressive
for
this
particular
environment.
This
is
a
demo
environment,
so
it's
always
heavily
overloaded,
but
that
shows
that
the
alarms
can
basically
show
up
here.
So
that's
that's.
What
results
in
the
alarms
it
can
show
up
on
the
dashboard?
D
D
If
you
make
use
of
the
snmp
trap
functionality,
then
we
have
a
little
bit
of
middleware
that
sits
in
and
listens
for
the
web
hooks
being
sent
as
a
receiver
from
alert
manager,
and
then
it
can
actually
convert
that
and
send
it
to
a
system
that
can
accept
snmp
traps,
but
otherwise
you
just
set
up
your
your
alert
manager.
Just
like
you
would
set
up
alert
manager
for
any
other
system.
So
stf
is
just
making
use
of
those
existing
components.
It's
not
doing
anything
magical.
E
Is
it
possible
to
plug
sort
of
third-party
services
into
this
somewhere
like,
for
example,
if
somebody
wants
to
use
pagerduty
for
alert
management.
D
So
there's
only
a
few
things
that
alertmanager
supports
pagerduty
happens
to
be
one
of
them,
primarily
if
you
want
to
interact
with
any
different
types
of
third-party
systems
for
sending
alerts
or
warnings
or
whatever
the
case
may
be
generally.
Those
are
you
consume
a
web
hook
and
then
you
convert
and
send
it
to
whatever
endpoint
is
you
want,
but
some
of
the
various
built-in
ones
are
like
email,
slack,
a
few
different
ones
and
a
pagerduty
happens
to
be
one
of
them
as
well.
D
No
exactly
so
so
that's
actually
part
of
alert
manager
and
alert
manager
will
do
so
if
it
receives.
I
don't
actually
have
it
working
right
now.
I
didn't
set
it
up
quite
right
when
I
was
deploying
this
because
it's
actually
disabled
for
some
reason,
but
this
is
the
route.
So
you
set
the
route
here
and
then
you
can
group
by
various
jobs.
D
You
can
determine
how
long
you're
supposed
to
wait
for
how
often
you're
checking
things
like
that,
and
then
you
would
have
a
receiver
now
mine's
set
to
null,
because
I
don't
have
any
receiver
set
up,
but
you
basically
set
a
receiver
here
that
may
be,
you
know
pagerduty.
It
may
be
to
email
system,
whatever
the
case
may
be,
and
that
receiver
is
what
ultimately
results
in
the
delivery
of
the
alarms
and
it
will
also
do
the
duplication,
suppression
and
things
like
that.
So
alert
manager.
D
You
can
run
like
you
know
two
or
three
of
them
for
like
high
availability
and
it
will
determine
when
it
gets.
You
know
the
alarm:
if
it's
already
been
sent
to
a
receiver
from
by
one
of
the
alert
managers,
you
won't
get
it
like
multiple
times.
For
example,
okay,.
E
D
Yeah,
so
really
stf
is
designed
as
an
infrastructure
monitoring
tool.
It's
not
really
meant
to
be
the
application
monitoring
tool,
so
you
could
either
run
something
inside
of
your
virtual
machine
that
sends
it
either
to
another
system
that
is
designed
specifically
for
application
monitoring
or
you
can
match
the
same
pattern
where
your
virtual
machine
can
actually
run
say
a
collect
d,
or
something
like
that
now
for
virtual
machines
themselves,
if
you're
just
trying
to
get
cpu
memory,
I
o
things
like
that.
You
don't
need
to
run
anything
inside
the
vm.
We
make
the
collect.
D
Dvert
plugin
deals
with
that
for
you,
so
it
will
talk
to
libert
and
it
gets
information
about
all
of
the
virtual
machines.
So
if
you're
actually
just
trying
to
monitor
the
virtual
machines
themselves,
then
you
don't
actually
need
to
run
anything
inside
of
the
virtual
machine.
That's
already
dealt
with
for
you
using
this
collect
e
vert
plugin.
D
D
C
D
D
So
if
you
wanted
to
as
an
administrator,
if
you
wanted
to
allow
the
workloads
inside
to
also
send
information,
you
would
just
have
to
run
the
qdr
or
be
able
to
point
your
collect
d
at
the
qdr,
and
then
you
would
just
make
use
of
the
collect
d
plug-ins
that
you
want
to
make
use
of.
So
if
you
want,
for
example,
your
example
is
you
know,
I
want
to
know
that
the
apache
running
inside
of
a
virtual
machine
is
still
active.
D
I'd
make
use
of
the
apache
plug-in
inside
of
collect
d
and
then
send
that
into
the
system
and
then
basically
monitor
for
that.
But
that's
not
that's
kind
of
out
of
scope
while
technically
possible,
because
you
just
have
to
send
the
data
once
the
data
sent
and
it's
transported,
then
you
just
make
use
of
it
right,
but
that
is
kind
of
at
a
different
layer.
That
is
not
kind
of
the
the
scope
of
this.
You
know
monitoring
system.
D
Yes,
so
I
can
just
monitor
the
virtual
machine
itself
and
I'll
know
how
much
memory
was
allocated,
so
you
can
see
down
here.
You
know
how
much
total
memory,
how
much
is
unused?
How
much
is
usable?
How
much
is
available
to
me?
You
know
all
of
that
kind
of
stuff,
so
I
can
make
use
of
the
lib
vert
stats
to
determine
that
a
virtual
machine
that
was
given
you
know,
16
gigs
of
memory
is
approaching.
C
D
E
I
mean
I
will
say,
as
a
former
dba,
the
separation
of
infrastructure
modeling
for
application
monitoring
is
not
a
decision.
I've
ever
understood
because,
for
example,
if
I'm
running
a
database
and
the
queries
start
being
inexplicably
slow,
often
frequently.
The
reason
is
that
the
machine
that
they're
on
is
running
out
of
resources
and-
and
it
really
seems
like
you-
actually
want
that
telemetry
unified
rather
than
in
two
separate
systems.
D
Yeah,
like
I
said
you
just
have
to
run,
collect
d
inside
of
the
virtual
machine
right,
so
that
that
that
is
a
decision
for
the
infrastructure
administrator
to
determine
if
that
data
is
appropriate
for
their
system
and
what
they're
monitoring.
But
if
you
just
run
collectd
inside
of
the
virtual
machine,
then
there's
absolutely
no
reason.
You
can't
collect
data
from
inside
of
the
virtual
machine,
but
you
have
to
configure
it
that
way.
D
Right,
like
the
deployment
of
the
of
the
data
collectors
and
the
transport
inside
of
the
virtual
machine,
is
definitely
out
of
scope
of
the
life
cycle
management
system
for
your
infrastructure,
in
this
case
triple
o
right.
So
I'm
running
triplo.
What
I'm
doing
inside
of
the
virtual
machine
is
is
outside
the
scope
of
that.
You
know
system
right,
which
is
what
I'm
using
to
deploy
the
the
data,
collection
and
transport.
D
If
you
happen
to
have
a
virtual
machine
image
that
you've
created
and
uploaded
to
your
data
store
and
when
you
launch
that
vm
and
those
vms
are
set
up
to
automatically
no
matter
what
you're
doing
to
run
a
data
collector
and
a
transport
system-
and
you
can
figure
it
in
such
ways
that
it
can
connect,
there's
absolutely
no
reason.
You
couldn't
centralize
all
that
data
with
the
system.
Wait.
D
Well,
you
need
to
connect
it
to
a
qdr
somewhere,
so
you
can
either
understand
that
there
is
a
vm
that
might
attach
to
the
qdr
running
on
the
host.
But
then
you
have
to
make
sure
that
your
network
routing
allows
the
virtual
machine
to
connect
to
the
host
that
is
running
the
vm
and
you
that
might
be
a
security
issue.
So
you
may
need
to
set
up
another
qdr,
because
every
node
is
running
a
qdr
which
is
very
light
right.
Those
qdr
routers
are
running
in
edge
mode.
D
D
Yeah,
that's
in
theory,
you
could
do
it
that
way.
Now,
obviously,
it
just
depends
if
you
want
to
use
the
local
qdr
for
your
virtual
machines,
and
you
have
to
you
know,
decide
whether
the
security
requirements
of
your
system,
you
know,
allow
for
virtual
machines
to
connect
to
services
running
on
the
host.
D
D
A
C
And
there
are
so
many
different
ways
you
can
monitor
and
everyone
is
very
passionate
about
what
they
think
is
best.
That's
why
I
was
mentioning
that
if
you
had
an
issue
knowing
the
cpu
was
higher,
the
memory
was
low,
could
point
to
your
direction,
but
at
the
same
time
it
is
nice
to
know
that
things
are
responding,
whether
it's
in
the
same
system
or
not,
you
know
connecting
to
the
database.
Are
you
getting
a
good
response
time
connecting
to
your
website?
Is
it
up?
D
Yeah
and
there's
obviously
there's
tenant
implications
and
things
like
that.
So
if
you
have
multiple
customers
or
tenants
inside
of
your
infrastructure,
sdf's
not
designed
to
be
a
multi-tenant
system
right,
so
all
of
your
data
is
collected
in
one
location:
that's
not
separated
out,
so
I
can
say:
customer
abc
can
only
see
the
information
collected
by
their
vms
right.
That's
that's
not
a
scope
for
what
we're
doing
so.
D
If
you
are
the
only
tenant,
you're
running
the
cloud
for
the
purposes
of
say,
running
vnfs
and
all
of
the
workloads
running
on
top
of
that
infrastructure
are
specifically
in
support
of
your
administrators
and
you
don't
need
that
multi-tenant,
you
know.
Separation
at
the
data
store
level,
then
yeah
I
mean
run
your
virtual
machine
run
your
vnf
inside
of
it
and
then
have
your
data
collector
inside
of
it.
That
goes
into
your
main.
D
You
know
monitoring
infrastructure,
you
know
like
stf,
but
if
you
need
some
very
fine-grained
controls
of
who
can
see
what
then
probably
stf
is
not
the
appropriate
solution.
E
You
could
seems
like
actually,
if
that
was
your
situation,
you
could
go
the
other
way
right.
If
you
have
an
application
monitoring
system,
then
you
could
expose
select
data
from
sdf
to
that
system
right
because
like
if
I'm
in
charge
of
the
application,
if
I'm
the
dba,
I
don't
care
how
I
get
the
the
correlated
information
behind
what
the
cpu
is
doing,
what
the
database
is
doing.
D
Yeah
so
like
prometheus
is
not
multi-tenant
aware
right,
like
I
can't,
I
don't
have
separate
logins
that
I
can
say
you
get
this
subset
of
data
right
that
that's
not
how
prometheus
works
right
so,
but
what
you
could
do,
if
you
really
wanted
to
was
you
could
break
those
out
into
separate
data
stores.
You
could
effectively
create
two
service:
telemetry
objects
in
the
same
name
space
and
then
have
a
cloud
configuration
for
just
those
that
tenants
application
and
then,
when
they
send
their
data
back,
it
would
be
listening
on
a
different
transport
address
right.
D
So
you
have
a
different
set
of
smart
gateways
on
a
different
address.
Space
inside
of
the
actual
transport
medium
and
the
applications
would
only
send
to
say
you
know,
application,
slash,
dba
or
something
right,
and
then
the
smart
gateways
would
listen
to
that
topic
on
the
transport
and
it
would
put
it
in
its
own
data
store
and
then
you
could
create.
You
know
routes
just
for
that
person
and
then
you
could,
you
know,
use
you
know
okd's.
You
know
login
system
to
provide
who
can
access
and
log
into
various
routes
or
whatever
too
right.
D
So
it
just
requires
a
little
more
setup
than
what
is
just
kind
of
out
of
the
box.
You
know
takes
more
than
the
10
minutes.
It
takes
to
set
up
stf
normally,
but
in
theory
it
is
totally
doable.
C
D
No,
we
haven't
really
had
that
so
much
so
it's
pretty
much
just
we
we
monitor
the
the
the
github
system.
You
can
open
issues
there.
You
can
always.
You
know,
find
us
on
irc.
I
think
we
have
a
service
telemetry
channel
on.
D
I
want
to
say
on
oftc,
but
generally
just
go
to
the
github
open,
an
issue
and
we're
pretty
responsive
there.
Otherwise
yeah
there's,
we
haven't
had
a
lot
of
other
folks.
You
know
outside
of
our
team,
really
making
use
of
it,
so
we
haven't
really
had
that
need
for
for
a
guide
or
anything
like
that,
but
obviously
we
would.
If
we
started
getting
lots
of
contributions,
then
that
would
be
great
great
problem
to
have.
C
D
No,
I
think
I
mean
it's
definitely
a
little
bit
of
a
different
system
and
it's
not
built
on.
On
top
of
the
you
know,
openstack
services
running
directly
inside
of
openstack,
but
the
idea
behind
it
is
to
allow
information
coming
in
from
various
different
systems.
So
that's
kind
of
why
the
architecture
looks
that
way
and
if
anyone's
interested
in
running
it,
you
know
feel
free
to
reach
out
and
I'd
be
happy
to
help
anyone
work
through
anything
that
they
need.
E
Yeah,
so
thank
you.
Thank
you
leif.
Thank
you,
everybody
for
attending
or
for
watching
this
later
on
youtube
cloud.
Tech,
thursdays
will
be
taking
a
brief
hiatus
and
we'll
be
returning
in
approximately
one
month
on
august
17th
and
the
reason
why
it's
august
17th
is
we're
actually
becoming
cloud
tech
tuesdays.
E
The
reason
being
that
we
wanted
to
move
this
to
is
an
earlier
time
slot
in
order
to
be
more
friendly
to
european
viewers,
who
have
been
complaining
that
this
is
way
too
late
in
the
day
for
them
to
attend
so
yep.
So
you'll
see
us
in
four
weeks
minus
two
days
as
cloud
tech
tuesdays
where
we
will
be
meeting
with
the
kubernetes
1.22
release
team
to
talk
about
how
the
release
went
and
what's
in
the
release
and
all
of
those
other
things
so
see.
E
You
then,
on
cloud
tech
on
ctt,
which
we
still
are.
E
Utc
time,
I
believe,
is
1300.
yeah
hold
on.
E
Yeah,
so
it's
1400,
utc
and
10-
am
I
edt?
Okay,
but
but
you'll
see
more
in
various
places.