►
Description
iFood is the largest food delivery app company in Brazil and is using Kong to deal with all incoming traffic of its microservices. All of the workload creates a perfect yet challenging environment for SRE work, including developing automation, monitoring the system and helping improve resilience. In this session, we will show the environment and techniques we used to improve iFood stability.
A
A
A
B
B
So
now
I'm
talking
a
little
bit
of
ifood
for
those
who
have
never
heard
about
ifood,
it's
like
a
food
delivery,
app.
B
We
are
present
in
latin
america,
brazil,
colombia
and
mexico,
and
in
brazil
we
are
the
number
one
and,
with
the
beginning
of
the
current
team
caused
by
kovind,
a
thousand
of
restaurants
are
relying
on
a
future
survivor.
B
We
are
playing
an
important
role
here,
making
our
systems
reliable
to
ensure
continuity
of
their
business.
Okay,
talking
quickly
about
of
the
future,
you
code,
it
you
want
it.
So
it
means
that
the
software
engineers
not
just
called
the
main
functionality
of
their
microservice,
but
also
all
the
piece
that
make
the
micro
service
work,
suck
us
terraform
code
for
provisioning
infrastructure
work
on
emus
file
to
expose
the
any
point
in
points
to
the
world.
B
There
is
an
another
good
rule
at
thai
food
that
says,
give
as
much
autonomy
to
the
software
engineers
as
possible.
It
means
we
should
create
tools
to
make
it
easier
to
push
code
to
the
production
environment,
to
be
able
to
give
this
autonomy.
Our
tools
must
also
have
some
very
derails
to
helping
the
engineers
on
their
delivery
path.
B
Okay,
that's
it's!
That
is
our
agenda.
We
will
discuss
some
talks
like
first,
we
will
start
talking
about
how
I
food
fits
how
kong
fits
on
my
iphone
at
flu.
Then
we
will
we'll
give
a
thinning
overview
of
the
micro
service
behind
home.
After
that
we
want
to
expose
how
we
could
increase
the
resilience
of
the
platform
touching
in
the
microservice
code,
just
bitcoin.
B
A
A
Before
the
request
goes
through
the
platform
they
they
are
received
to
the
web,
application
firewall
so
waff
or
and
the
cdn,
and
then
it
enters
to
our
platform.
As
you
can
guess
about
the
icons
on
the
graph,
we
are
using
aws,
so
congee,
the
external
load
balancer
for
kong
is
our
first
barrier
from
those
incoming
requests,
but
kong
also
can
handling
the
requests
that
goes
from
one
micro
servers
to
other.
It's
not
really
a
need.
A
We
don't
need
kong
between
one
microsoft,
one
microservice
to
other,
but
it's
really
nice
to
to
have
it
and
we're
gonna
talk
about
why
so
the
same
cluster
of
kong
k
handlers,
they
request
north
south
requests
and
east
to
west
requests,
and
we
use
like
two
fqdns
in
kong
cluster
to
handle
it.
So
one
is
for
the
the
internal
requests
and
the
other
it's
for
the
external
traffic.
A
A
A
single
cluster
of
kong
can
have
up
to
50
instances
and
more
than
300
house,
and
some
about
100
services
and
400
plugins
configured
on
it.
So
it's
really
massive
for
us.
Of
course,
all
those
clusters
are
handling
about
270
000
requests
per
second,
with
peaks
of
one
more
than
100
thousand
requests
per.
Second,
when
people
want
to
buy
some
dinner.
A
Okay
and
to
all
it
worked
how
it
works.
Our
ec2
instances
have
some
pro
meteos
exporters
file
bit
to
ship
the
kong
logs
directly
to
elasticsearch
and
also
chef
client.
So
we
are
using
chef
as
configuration
manager
here,
our
cookbooks
from
chef
installs
kong
at
the
latest
2.1
version
and
some
custom
plugins.
A
A
All
this
future
may
be
a
little
bit
challenging
for
software
engineers
to
also
have
to
configure
the
yaml
files
for
kong.
You
know
we
have
a
lot
of
emo
files,
so
a
lot
of
ways
to
people
have
some
mistakes
or
misconfigurations,
so
think
about
that.
We
code,
two
pipelines.
The
first
one
is
the
validation
pipeline.
A
It
runs
every
pull
requests
and
does
some
basic
validation
tasks
such
as
linkedin
tag
enforcement
check
for
install
it
versus
declaration,
plugins
or
duplicated
routes
or
any
resource
names,
and
we
also
have
some
particular
con.
Configs
validation
such
as
rejects
priority
here
at
food.
We
use
the
rejects
priority
like
every
is
less
at
this,
the
health,
it's
a
plus
one,
to
reject
priority.
A
So
if,
if
you
have
like
slash
orders,
slash
order,
you
you
id
it's
going
to
be
rejects
priority
of
two:
it's
not
really
necessary
to
configure
it,
but
we
want
to
to
be
sure
how
the
behaviors
will
be.
So
we
enforce
that.
We
also
have
the
the
preserve
host
enforcement
so
that
that's
really
a
nice
one.
Sometimes
we
put
kong
in
front
of
f3
buckets
and
you
know
the
f3
s3
buckets
with
websites
configured
on
it.
A
They
must
to
to
have
some
some
host
name
to
to
how
to
the
request
to
it
so
based
on
some
regular
expressions
we
have
here,
we
can
apply
the
hijacks
to
the
fqdn
of
the
kong
service
host
and
figure
it
out
if
the
reject
priority
is
false
or
through.
Okay,
true
is
okay,
but
it
must
be
false
when
you're
talking
about
some
s3
website
buckets
or
some
other
layer,
seven
thing
or
proxy.
That's
handling
the
request
so,
based
on
that
on
the
rejects,
we
enforce
all
their
routes.
A
A
As
you
can
see
in
this
picture
over
here,
we
are
using.
You
know,
jenkins
for
now,
and
we
have
like
in
a
highlight
this
type
of
run,
shaft
only
in
one
instances
one
instance,
and
it
means
like
a
canary.
We
deployed
the
the
change
to
one
instance
and
we
check
if
it's
okay,
if
it's
okay,
like
in
the
pipeline,
we
skip
the
row
back
step
and
go
to
commit
and
pub
in
publish
and
push
all
the
changes
to
shaft
and
to
all
instances
in
the
cluster.
B
Peek,
how
am
I
food
micro
service?
Look
like
food,
so
in
general,
microservers,
which
are
not
just
qr
processors,
looks
like
this
designation.
They
are
it's
any
point
method
to
call.
They
are
written
in
a
variety
of
languages,
sucks
go
rust,
java
and
so
on.
B
B
So
there
is
one
flow
we
will
talk
more
about
when
a
microservice
receives
a
request.
Let's
say
a
new
incoming
order
was
placed
the
microservice.
After
all,
the
validations
will
create
a
new
entry
in
the
database.
This
information
can
be
carried
sometime
in
the
future.
No
news
here,
as
we
know,
as
much
as
we
have
some
read,
replicas
nodes
are
carried
to
a
database,
can
be
expensive,
sometimes
so,
if
microservice,
when
it's
possible
after
writing,
to
new
incoming
order
entry
to
the
database,
it
also
writes
a
file
show
estrous
bucket.
B
A
A
The
request
goes
to
kong
and
kong
matches
the
request
with
one
of
its
house,
the
house
points.
The
service
cong
does
the
proxy
to
orders
that
food
at
bars
and
goes
back
and
that's
awesome.
It
works.
But
what,
if
orders,
the
food
that
buy
goes
down
so
congruent
return?
Some
500
er
to
the
client-
and
you
know-
that's
not
resilient.
A
We
have
the
active
health
checks
so
think
about
aws
elastic
loads,
balancers
health,
shack,
the
active
health
checks
at
kong
works
really
close
to
that
it
keeps
probing
the
service
to
get
the
status
of
the
healthiness.
So
is
it
alive?
Is
it
alive
every
time
it
keep
is
to
probing
the
help
check
in
it?
It's
it's.
Okay,
it's
a
way
to
do
it,
but
kong
also
have
the
passive
health
check
in
it's
really
great
stuff,
so
kong
keeps
tracking
about
the
http
codes
of
the
microservices
is
responding
to
the
client.
B
A
And
yeah,
so
now
we
can
see.
I
think
it's
the
the
other
picture.
A
Okay,
now
we
can
see
we
have
the
health
checks
in
green
over
there.
We
are
not
exposing
all
the
the
configuration
because
they
are
a
little
bit
bigger
to
the
screen,
but
they
are
there
and
the
concept
are
there.
So
the
concept
is
there,
so
kong
now
can
understand
when
orders
the
food.bar
goes
down,
because
it
receives
a
lot
of
500
errors
or
some
tcp
failures
and
announced,
and
it
can
shift
the
traffic
to
others
that
are
there's
a
slash
fallback
that
food.bar.
A
So
it
means
what
what
I
mean
is
kong
are
handling
the
the
traffic
and
proxy,
but
others
that
food
up
bar
goes
down
and
okay
kong
understand
it
because
of
the
the
http
codes
and
stops
the
communication
to
that
servers
that
that
service
or
that
that
target
right.
A
So
after
that
kong
will
in
the
in
the
shadow
in
the
behind
the
scenes,
will
probe
the
orders,
the
food.bar
service
until
it
gets
healthy
again
and
when
it
does
kongs
enables
or
restores
the
communication
to
orders
the
food
that
buyer.
So
the
fallback
is
going
to
be
active,
100
percent
of
the
time
when
orders
that
food.bar
is
not
so
we
have
happy
resilience
there.
It's
really
nice,
no
okay,
but
we
still
have
some
room
to
improvement.
A
So
we
have
some
challenges
here
at
the
passive
health
shacks
kong
also
skipped
tracks
of
tcp
failures
and
timeouts
to
figure
good
thresholds
to
it
was
a
little
bit
challenging
for
us
and
took
us
a
lot
of
extreme
experimentations
and
load
tests
to
figure
it
out.
What
is
a
good
threshold
to
its
service
or
a
target.
A
Another
thing
to
keep
in
mind
is
at
kong
upstream.
The
health
check
part
is
the
same
part
of
the
service
part,
so
this
is
can
maybe
not
true
when
using
like
actuator
from
spring
booked
activator
frameworks
like
that
they
may
sometimes
have
another
part
to
get
the
the
the
healthy
check,
endpoint
and
another
challenges.
A
A
The
path
may
be
not
the
same,
maybe
not
matches
the
main
app
path,
so
they
this
can
be
a
little
bit
challenging
when
using
the
secret
breaking
in
congo,
but
for
those
challengings
we
may
have
some
solution.
A
One
alternative
to
the
challenges
can
be
using
kong
itself
as
the
fallback
target.
It
may
be
seen
a
little
bit
crazy,
but
let's
check
it
out.
So
when
you're
using
kong,
as
as
the
fallback
target,
we
can
use
like
the
request,
transformer
plugin
or
the
response,
transformer
plugin
or
any
other
plugin
to
kung
works
and
response
and
deal
with
the
request
as
the
same
way
of
the
main
app
does.
So
we
can
use
kong
as
the
the
fallback
app
so
duplicator
resources
are
not
needed
anymore.
A
So
you
know
when
I,
when
we
said
our
microservice,
they
write
data
to
the
database
and
also
to
ask
free
buckets
and
you
call
the
heart
and
the
data
hot
and
the
coded
data.
So
we
can
use
kong
to
get
this
code
data
and
when
the
main
app
is
not
working,
so
imagine
that
we
can
use
cong
and
request
transformer
to
get
this
code
data
or
we
can
use.
A
A
A
A
A
So,
as
we
mentioned
it
earlier,
we
are
using
this
git
ops
fashion
way
of
do
this
stuff,
so
the
changes
flows
is
described
in
this
image
over
there.
So
the
engineer
opens
a
pull
request.
The
pull
request
goes
through
the
validation
pipeline.
It
gets
validated
and
gets
the
reviews
and
eventually
it
gets
emerged
and
a
git
tag.
A
The
git
tag
is
used
to
the
deploy
pipeline,
that
good,
that
does
the
camerish
style
of
deploying
updates
the
shaft
and
roll
out
all
the
all
the
changes
to
all
kong
instances.
But
you
know
in
the
outage
it
may
be
a
little
bit
slow
process
to
get
these
changes
to
the
the
colony.
Although
this
is
a
really
good
flow,
you
know
you
have
changes
that
are
homologated
by
the
pipeline
and
you
have
the
reviews.
A
It
is
easier
to
roll
back
because
you
just
need
to
deploy
the
the
latest
latest
tag.
It
is
have
some
documented
changes,
so
you
have
like
auditing
because
you
know
whom
change
it
when
and
when,
and
we
have
like
this
scanner
we
mentioned
before,
but
still
a
little
bit
slow
process.
A
A
And
that's
what
it
would
like,
so
the
safe
mode.
Our
plan
b
consists
this
of
a
predefined
scenarios
such
as
enabling
rate
limiting
in
out
x,
or,
I
don't
know,
disabling
health
in
how
to
you
know
something
like
that.
These
predefined
scenarios
they
lived
at
git
as
well,
and
the
sr
team
implemented
a
single
page
application
that
renders
the
status
of
the
scenarios
retrieving
all
the
information
from
the
the
chef
and
also
from
the
git
when
a
safe
mode
is
enabled
a
request,
goes
through
the
back
end
of
our
solution.
A
A
Remember
the
single
page,
application,
backend
and
worker
were
developed
by
the
sre
team.
So
please
do
not
judge
the
layout
it's
ugly,
but
it's
working.
So
we
have
all
the
switches
over
there.
You
can
see
that
kung
davi
proxy.
We
have
all
these
pre
defined
safe
modes.
So
imagine
here
we
have
just
the
discovery
legacy
25.
A
B
Okay,
guys
as
good
as
srv,
we
should
talk
about
monitoring.
B
Okay,
the
frequently
asked
questions
about
kong
r.
How
many
500
class
errors
does
come
closer
are
responding?
Is
the
fallback
app
really
working
it's?
It
is
working
correctly
to
have
answers
to
those
questions.
We
use
the
congo
prometheus
book
plugin.
B
B
B
B
So
talking
about
of
these
two
metrics,
the
metric
hpp
status
can
be
useful
to
understand
what
is
the
percentage
of
500
class
errors
in
a
particular
route
or
serves
the
match.
Chris
kong
upstream
started
healthy.
I'm
talking
about
this
mattress
because
it
was
called
by
carnero.
No
I'm
joking.
B
This
magic
is
exposed
the
healthy
state
of
others
or
target
of
an
upstream.
This
could
be
healthy,
unhealthy
then,
as
ever,
it
means
once
the
the
look
up
of
the
targets.
Fk
then
fail
and
the
last
one
health
checks
of
once
the
health
checks
of
the
upstreams
are
not
configured,
so
it
means
that
the
host
order,
dot,
foo
dot
bar
may
have
two
ips
10,
dot,
10.10
or
and
sorry
10.10.10.
B
Okay,
this
is
an
example
of
a
rule
of
promises
to
monitoring
the
percentage
of
ears
in
the
target.
A
B
Okay,
this
is
an
example
of
a
notification
received
on
his
leg,
so
the
promiscuous
real
fighter,
the
the
alert.
Then
we
receive
it
and
is
like
so
it's
a
simple
integration
to
to
make
visible
the
the.