►
From YouTube: Istio September Meetup/ How Istio bring excellent observability to HP by John Zheng and Lisa Xu
Description
For large platforms that support hundreds of projects, it is vital to gain insight into the api response code, api response time, api numbers, while monitoring and precisely alerting. Istio can easily achieve these requirements, without any changes to service code. In addition, troubleshooting in a complex service mesh is always troublesome and time-cost. However, through integration of Istio and jaeger, they can combine client requests, tracing, Istio logs, and application logs altogether. In this way, when a problem occurs with the api, it is easy to locate. In this session, HP devops will share their best practices in cluster observability, hoping to bring useful references for everyone.
B
A
Okay,
here
is
today's
agenda,
we're
talking
about
basically
our
business
business
platform
and
how
what
kind
of
features
we
have
enabled
on
your
leverage
from
the
istio
and
also
what
kind
of
customizations
we
have
did
along
with
these
two
features
and
in
the
end,
we'll
have
the
q
a
session.
A
So
we
we
want
to
talk
about
about
with
our
hp
horizon
platform
design
with
the
e-still,
as
you
know,
that
hp
and
as
a
bigger
company,
we
have
lots
of
projects
and
deploy
on
the
cloud.
Some
of
project
has
a
common
features
like
the
single
cylon,
the
usb
measurement
and
the
payment
etc,
and
all
they
also
have
some
specific
project
features.
So
john
and
I
are
working
as
a
software
organization,
so
we
decided
to
provide
a
common
platform
called
horizon
which
were
major
provide
two
parts.
A
The
first
one
is:
we
are
serving
the
common
features
as
a
service
so
that
each
solutions
they
didn't
have
to
developing
the
service
by
their
own,
but
they
can
leverage
our
common
service
in
this
area,
a
second
one,
the
horizon
platform.
We
are
also
aiming
to
provide
a
manager
infra
for
a
service
for
a
solution,
so
solutions
don't
have
to
care
about
or
operate
their
infrastructure,
but
they
can
loading
their
service
in
our
manager
infrastructure.
A
So
in
this
way
we
are
hoping
that
we
can
maximize
the
values
for
the
customer
and
and
also
able
able
to
helping
our
solution
be
able
to
delivering
on
their
solutions
on
to
the
marketing,
with
the
low
cost
and
with
the
hyper
higher
quality.
A
So
here
is
the
how
the
the
platform
looks
like,
so
you
can
see
in
the
left
right
corner.
We
have
this
cluster
called
call
cluster
and
the
inside
of
the
core
cluster.
We
have
hosting
other
call
service
like
the
all
service,
im
service,
payment,
service,
notification,
service
and
etc,
and
all
the
courses,
I
suppose,
the
access
from
the
ingress
gateway
and
then
in
the
left
corner.
We
also
have
a
solution
clusters
and
the
solution
in
our
platform
we
treated
them
as
a
individual
tenant
and
each
tenant
will
have
their
own
namespace.
A
Their
namespace
will
be
isolated
using
the
policy
and
their
service
were
also
exposed
from
the
ingress
gateway.
So
if
the
solution
service
they
want
to
consuming
our
core
service,
there
was
through
the
mutual
tms
and
we
set
up
a
solution
cluster
and
the
core
cluster.
As
those
issue,
multi-cluster
are
using
the
rapidly
the
control
panel
a
plan,
and
if
we
have
some
solution
cluster,
which
is
hosting
by
themselves
and
not
running
our
on
our
manager
infra,
they
can
still
have
the
way
to
consuming
our
core
service
from
this
channel
directly.
A
So
consider
if
we
country
we
are
building
a
platform
and
we
are
provided
this
kind
of
service
to
our
solutions.
There
must
be
some
requirements.
We
want
to
fulfill.
For
example,
you,
as
we
are
providing
the
core
service
for
larger
audience
and
solutions,
how
we
can
make
sure
that
our
service
is
performed
very
well
and
how
we
can
get
alert.
A
A
So
we
we
are
hoping
that
we
didn't
highly
rely
on
the
application
team
to
providing
hundreds
of
locks
to
supporting
the
trump
shooting,
but
we
are
hoping
that
as
a
infra,
we
are
able
to
provide
this
kind
of
functionality
for
people
to
trap,
shooting
and
monitoring.
So
I
think
kudos
to
the
istio
that
provided
a
very
excellent
observability
in
order
to
support
this.
A
So
how
is
to
provide
us
a
way
as
as
we
all
know
that
in
in
the
easter
side,
they
are
adding
a
proxy
that
car,
along
with
every
application
deployed.
So
in
this
ways
you
will
let
all
the
program
applications
available
for
the
traffic
management
and
also
the
encryptable
availability,
so
istio
generator
detail
telemetry
for
all
service
communication
with
their
mesh,
and
this
telemetry
actually
will
provide
a
good
way
for
the
service
behaviors.
A
So
in
this
way
they
can
empower
us
through
troubleshooting
and
maintain
and
optimize
our
application
and
the
most
important
things
that
we
are
not
adding
any
burden
to
our
service
developer,
so
istio
provider
following
ways
for
the
telemetries,
the
first
one
is
access
lock.
A
So,
as
all
the
traffic
flow
in
to
the
mesh
issue
is
able
to
generate
all
the
full
records
of
them
api
request,
including
the
source
and
this
destination
method,
and
in
this
way
it
will
able
to
tell
you
that,
like
the
how,
when
what
of
the
logins
and
the
second
way
for
the
telemetry
is
for
the
matrix
matrix,
actually
is
able
to
provide
a
way
for
monitoring
and
and
also
helping
you
to
understand
the
service
behaviors
in
aggregate
and
is
still
able
to
generate
the
metrics
in
a
signal
like
the
error,
traffic
and
letters
and
etc,
and
also
is
you're
able
to
provide
a
default
default
monitoring
dashboard
out
of
box.
A
And
so
the
third
thing
is
for
the
distributed
trace.
So
instill
is
able
to
generate
all
the
distributed
trade
spans
for
each
service
and
in
this
way
or
along,
you
have
the
inside
of
the
all
the
call
flows
and
also
the
service
dependencies
in
the
mesh.
A
A
So,
as
I
mentioned
as
we
are
working
for
platform,
it's
key
for
us
to
know
that
to
show
some
data
like
our
service,
core
service
utilization,
the
performance
of
our
service,
for
example.
If
I
want
to
know
my
core
service
api
access
account
and
my
core
service
api
rate
and
my
core
service
api
response
time,
how
should
I
do
because
I
don't
want
to
ask
my
core
service
developer
team
to
implement
any
code
for
this.
I
want
my
infra
can
tell
this
data
directly
to
me.
So
with
the
solution
we
come
out.
A
Let's
take
a
look,
so
this
is
our
solution,
so
here
you
can
see.
This
is
our
cluster
you
can
consider
and
the
inside
of
the
cluster
we
are
leveraging
the
friendly
to
collecting
all
the
locks
so
on,
contrary
where
our
solution
are
collecting
locks
from
the
easter
proxy
and
also
the
istio
ingress
gateway.
A
So
from
here
after
we're
collecting,
we
also
are
passing
this
kind
of
logs
in
in
the
for
the
field
which
we
consider
is
important,
and
then
we
store
all
the
data
in
the
elect
search
and
then
we
can
able
to
view
the
data
and
the
generated
dashboard
in
the
keybinder.
So
let's
take
a
look
deeper.
How
we
implemented
this.
A
So
by
default
we
were
not
able
you
were
not
able
to
get
in
the
in
y
access
log
from
still
so
in
order
to
enable
it
during
the
installation
you
have
to
use
in
the
instacado
to
enable
this
invoi
access
log,
and
now,
once
you
enable
it,
you
were
able
to
see
to
view
the
logs
in
this
way
like
this
is
the
logs
in
the
istio
ingress
getaway,
and
this
is
locked
in
one
of
the
container
of
the
istio
proxy.
A
So
this
locks
looks
very
similar
and
contained
lots
of
informations.
So
let's
take
a
little
deeper.
What
kind
of
informations
will
be
content
in
this
log?
We
highlight
some
of
the
key
informations.
We
consider
will
be
very
useful
for
us
in
this
access
log
and
the
first
one
is
the
start
time
for
the
api
call.
A
The
second
one
is,
is
the
method
of
your
api
and
what's
the
early
api
pass
and
the
what's
the
response
code
for
this
api,
whether
you
success
or
it's
not
and
what's
the
duration
for
the
api
and
the
rest,
I
rest
request
id,
which
john
will
also
mention
later,
is
very
important
information
and
there
was
a
host.
A
So
after
we
understanding
as
a
formatter
for
the
logs,
then
we
can
do
something
which,
based
on
our
request,
so
in
the
front
d
we
are
do.
We
are
doing
the
passing
of
our
logs,
based
on
on
the
on
the
keywords
which
are,
we
think,
is
important
in
a
friendly
config
map.
So
here
is
one
example
we
are
using
in
our
project.
We
are
passing
all
the
important
things
like
the
response
code
message,
as
I
mentioned
the
proxy
and
we
store
each
of
the
values.
A
In
our
you
know,
our
new
field,
which
we
start
and
after
we
pass
in
this
kind
of
pass
this
kind
of
format,
and
we
are
able
to
store
in
the
research
and
then
we
are
able
to
view
this
kind
of
logs
in
this
format
in
a
more
readable
format
in
the
kibana.
So
in
the
left
panel.
This
is
all
the
new
key
we
generated
based
on
the
logs,
and
this
is
all
the
values
we
get
from.
A
The
logs,
like
the
as
I've
said,
is
a
duration
method
as
a
pass
and
response
code,
et
cetera
and
as
soon
we
are
able
to
store
this
kind
of
data
in
this
more
readable
way,
and
we
are
able
to
group
this
kind
of
data
and
generate
a
report
for
my
needs.
A
What's
the
top
one
there,
I'm
able
to
generate
this
kind
of
report
as
well,
so
all
of
them
is
based
on
each
api
and
also
I
I'm
able
to
space
on
my
duration,
I'm
able
to
see
the
latest
in
my
api,
so
I
can
provide
this
this
kind
of
data
and
ask
a
developer
to
do
some
job
shooting
based
on
that.
A
So
until
now
we
are
able
to
have
a
clear
view
of
my
platform
of
the
api
performance,
and
I
think
this
is
not
enough
because,
as
we
are
operating
the
platform,
we
still
want
to
predicate
a
lot,
and
so
we
can
know
that
if
any
issues
may
happen
in
the
system
and
we
can
get
alert
in
time,
so
we
also
put
some
effort
to
working
for
the
api
monitoring
and
a
lot
of
stuff
in
the
platform,
and
we
do
meet
some
challenges
as
we're
going
on
and
some
major
challenges
we
have,
I
think
previously
is
we
have
some
issues
which
is
not
reported
in
time
or
maybe
some
issues
is
still
in
the
warning
level
and
we
didn't
pay
attention
and
then
later
on,
it's
really
become
as
a
one
instant
in
the
production.
A
So
one
example
I
can
show
you
is
we
have
one
service
and
this
service
has
several
api
will
be
supposed
to
resolution
to
consume
and
the
one
of
the
api
actually
is
very
stable
and
better
access.
The
access
rate
is
also
very
high
and
is
very
stable,
so
you
can
consider
if
you
are
monitoring
on
this
way.
It
will
always
risk
most,
but
there
is
also
another
api
inside
of
this
service,
which
access
rate
is
very
low,
but
it's
very
critical.
If
this
api
failed,
it
will
trigger
a
critical
instant
for
a
solution.
A
So
in
this
way,
because
our
alert
is
based
on
the
service
error
rate,
then
this
kind
of
alert
will
not
get
attention
in
time.
So
we
want
to
find
a
solution
in
order
to
resolve
this
kind
of
situation.
So
we
are
also
concerning
what
we
can
do
to
overcome
this
kind
of
challenge.
So
we
come.
We
are
considering
that
our
monitor
should
be
more
accurate
on
the
api
level
and
also
can
be
more
predictable,
and
so
we
are
considering
some
requirements
or
rules
which
we
can
put
into
our
monitor.
C
A
A
Thank
you,
yeah
yeah
you're
welcome,
yeah,
and,
and
so
this
is
something
we
come
out
the
requirements
we
want
to
base
on
based
on
the
api
level,
and
then
we
can
do
the
monitor.
So
we
look
into
what
the
original
install
did
for
the
mattress
pad.
So
actually,
as
I
mentioned
in
the
beginning,
istio
actually
were
able
to
generate
a
lot
of
metrics
by
default,
and
so
you
can
query
discount
metrics
from
the
original
outbox
of
the
parameters
in
inside
of
estio
and
and
once
more.
A
You
can
also
view
this
kind
of
mattress
from
the
graph
now,
but
in
this
way
you
can
see
this
is
the
service
and
I'm
able
to
see
my
latest
in
and
also
the
success
rate,
but
this
success
rate
is
still
based
on
the
service
level.
There
also,
you
can
see,
there's
also
the
tracings
in
the
keali
to
see
the
service
dependencies
and
also
in
here
you
are
able
to
see
all
the
other
for
the
api
call,
but
this
is
also
based
on
the
service
level.
A
So
all
of
them
provided
from
the
institute
by
default
is
awesome.
But
still,
as
I
mentioned,
we
want
to
do
this
kind
of
things
in
the
service
level
in
the
api
level,
instead
of
a
service
level.
So
we
have
to
come
out
as
a
solution
by
ourself,
so
I
will
hand
off
to
john
and
he
will
give
in
detailed
introduction
regarding
our
solutions.
B
So,
as
lisa
said,
he's
already
introduced
the
whole,
how
we
get
us
loggers
and
viral
logos
and
how
we
pass
the
logos
in
front
d
and
how
we
forward
to
the
electrical
search
is
also
introduced.
The
current
current
istio
metrics
cannot
fulfill
our
requirements
because
it's
on
the
service
level,
not
api
level,
okay.
So
this
is
our
final
solution,
so
we
with
waters
and
logos,
pass
logos
and
save
the
logs
in
the
electric
search
with
that.
We
already
has
all
the
api
level
data
api
information,
like
response
code
response
count
and
response
latency.
B
Based
on
this,
we
do
the
simple
a
lot
with
the
last
a
lot
or
in
most
cases
we
do
more
advanced
a
lot,
because
we
have
more
complex
rules
so
with
first
for
each
minute,
we're
using
a
clone
job
to
do
the
data,
aggregation
and
save
the
aggregation
data
into
the
database,
and
we
use
another
clone
job
to
queries.
Database
find
out
unnormal
data
and
send
a
lot
with
page
duty.
B
B
If
it's
response
code
is
500
and
it
raises
three
times
in
last
minute:
okay,
you
should
erase
such
alert,
but
in
most
case
we
want
to
list
all
the
apn.
We
don't
want
to
list
api
one
by
one
in
config
file
and-
and
we
also
want
to
a
lot
of
unit-
continues
in
two
minutes
like
so
so
we
need
using
this
our
more
complex,
advanced
alert
solution,
so
one
thing
animation,
although
the
rule
is
complex,
but
our
configuration
is
very
simple.
B
So
we
say:
okay,
we
get
this
data
from
aggregate
table
here
and
response
code
is
500
and
more
than
three
times
in
last
two
minutes
and
continue
two
minutes.
That
means
both
of
the
minutes
has
such
case,
so
you
can
find
with
this
it
easily
to
implement
our
requirements
and
reserve
our
challenges
so
based
on
the
api
level,
a
lot
okay,
so
based
on
the
same
solution.
We
think
we
need
enhancement
and
test
our
a
lot
by
implementing
below
requirements.
B
So
all
of
them
is
implemented
and
works.
Well,
like
us,
500
error
rate
is
more
than
five
percentage
in
the
last
five
minutes,
like
the
p90
latency
is
greater
than
500
milliseconds
for
the
last
five
minutes,
or
sometimes
the
average
response.
Time
is
not
high,
but
it
continues
to
increase
right.
So
we
should.
We
will
also
a
lot
if
we
increase
almost
20
percentage
in
last
hour
or
we
can
come.
B
We
also
store
the
p90
average
time
in
the
last
week,
so
we
compare
current
average
response
time
if
it
is
two
times
of
previous
last
week's
p90
average
response
time,
yeah
we,
yes,
we
will
also
a
lot.
So
in
such
case,
maybe
this
api
is
very
very
quicker.
It
is
quicker
like
less
than
100
milliseconds.
B
B
After
that,
we
can
identify
the
issues
and
alert
quickly
and
it
do
reduce
our
club
cloud
platform
downtime
down
and
guarantee
our
sla
and
improve
our
use
appearance
and
definitely
definitely
it
will
reduce
our
depth
of
load
and
it
will
improve
our
work.
Efficiency,
okay,
okay,
so
let's
go
to
another
topic
for
the
easier
opposite
for
perceptibility
the
chasing
integration.
B
B
B
So
this
is
our
challenges
before
we
get
our
solution,
so
we
are
using
jager,
but
as
a
spans
between
the
application
and
the
easter
process
cannot
be
connected
because
therefore
water
or
matter
is
different.
Their
chest
format
is
different
and
also
it's
not
easy
to
match.
This
client
request
with
jaeger
chess,
because
thousands
of
requests
go
to
go
inside
as
a
yeager
and
for
specific
requests.
It's
not
easy
to
search
it
out
same
for
the
jager
chess.
B
B
B
This
is
a
diagram
to
do
to
to
introduce
our
challenge
so
client-side.
Send
our
request
to
service
through
and
service
four
will
send
another
request
to
the
service
bar.
So,
for
each
time
it
will
go
through
the
enviro
policy
and
both
of
these
applications
and
employment
policy
are
sent
the
chase
data
to
the
jager.
B
The
envoy
policy
said
I
forwarded
the
silicon
chest
formatted
only
and
I
also
forwarded
x,
request
id.
This
is
a
voyages
id,
but
our
application
said
iphone
was
a
jaeger
for
meta
because
we
are
using
jager.
B
B
Besides
that
we
do
an
additional
stamper,
all
the
application
when
they
print
their
logo.
They
were
using
the
x
request
id
as
a
prefix.
You
know
for
the
easy
increase,
getaway
or
the
easy
process.
They
already
print
out
the
active
request
id
in
the
envoy
access
logo,
since
we
can
connect
them
together.
B
Is
this
solution
complex,
so
neither
neither
developer
cost
a
lot
of
time.
No!
No!
It
is
very
easy.
As
long
as
you
add
this
green
part
code,
you
can
implement
this
so
previously
for
integrate
with
jager.
So
you
can.
You
need
only
add
this
part
when,
when
you
new,
chaser
and
now
in
order
to
compatible
with
silicon
and
the
x
request
id,
you
need
to
add
this
pattern
with
new
zipper
and
with
the
package
package
request
id
and
both
for
the
inkjet
analytics
character.
B
B
So
this
is
our
benefits.
I
I
I
using
a
sample
to
show
our
benefits.
Maybe
let's
say
there
is
some
problem
with
one
api
request:
we
want
to
get
all
the
chats
and
the
logos
we
send
this
api
with
an
uuid
as
a
http
header,
with
a
header
request
id
a
uuid
with
this
usual
id.
We
can
query
our
success
previously.
It's
not
easy
to
query
the
chest,
but
now
you
put
it
here,
put
the
titles
here:
gui
dx
request
id
with
the
previous
uid
click
here
and
click
the
button.
Okay,
that's
cool!
B
So
the
trace
will
be
found
out
easily
and
you
will
you
can
go
to
the
details
of
the
chase.
You
can
find
it
will
go
to
the
ecl
gateway.
It
will
go
to
the
full
service
in
eastern
policy
and
go
to
the
full
service
application
so
for
different
expense
and
increase.
The
span
of
bus
service
is
a
process
and
a
bus
service
application
great.
So
you
can
find
all
spans
connect
together.
B
So
this
is
more
useful
and
we
can
we
want
to
get
all
the
logs
together,
so
you
can
find
the
since
it
go
to
the
ecl
gateway.
We
can.
We
will
find
the
logs
and
a
acquiring
a
great
uid.
You
can
find
okay,
so
the
envoy
access
log
has
showed
out,
and
then
we
want
to
find
out
the
full
service,
easier
process
and
graduate
okay
cool.
It
is
also
showed
up
and
we
want
to
see
the
full
container.
This
is
our
our
application
container.
B
B
All
the
services
loggers
related
with
this
api
will
be
carried
out
directly.
So
one
thing
you
need
to
mention
is
the
application
need
extract
the
active
request
id
from
the
http
header,
because
equal
request
id
is
always
always
forward
by
spice
easter
and
our
applications.
So
we
can
data
and
print
heater
for
each
logger
as
a
prefix.
B
Cool
okay,
so
let's
go
to
the
final
tip,
so
a
quick
debugging
with
the
envoy
filter.
So
here's
a
challenge:
some
error
rates
on
production
environment
for
our
application.
However,
this
application
loggers
is
not
enough.
How
to
handle
the
general
way.
Is
the
developer,
add
some
logos
and
generate
a
new
image
and
deploy
this
new
image
to
the
production
environment
and
go
to
the
release
chain.
It
needs
color
change
right
and
it
needs
some
time
and
it
even
changes.
Color
back.
B
B
This
is
the
m1
filter
and
for
this
workload
selector
you
can
limit
which
application
should
the
printer,
such
a
body
or
the
body
also
also
also
has
we
don't
want
to
any
application
print
such
as
so
many
heads
and
logs
right,
and
there
are
two
financing:
one
is
environment,
request
and
environment
response
and
on
false
environment
requests.
You
can
print
out
all
requests
has
and
body
same
for
the
environmental
response.
You
can
get
all
the
heads
force,
response
and
print
all
the
keys
values
and
both
heads
and
also
print
out
all
the
body.