►
From YouTube: Reshaping Authentication to Improve Resiliency
Description
Decoupling a legacy application into a microservices architecture is always a challenge. Decoupling an application that has 70 million users and 500,000 requests per minute without impacting uptime is an even bigger challenge. In this talk Kong Summit 2019, OLX Lead Software Engineer, Rodrigo Orofino shares how OLX has reshaped its authentication mechanism, making its platform more elastic and resilient.
A
Hello
thanks
for
coming
in
this
talk,
I'm
gonna
tell
you:
how
is
going
to
improve
the
resilience
of
our
platform
or
likes
is
one
of
the
biggest
sites
Inglot
in
America
we
get
loads
of
data
and
a
lot
of
users.
First,
let
me
introduce
myself
I
have
another
soft
engineer.
My
name
is
Rodrigo.
Rufino
I've
been
doing
this
for
almost
12
years
now,.
A
I'm
from
Brazil,
so
sorry
about
my
accent:
I
live
in
Rio,
the
hosts
of
2016
Olympic,
Games
and
I
work
at
OLX.
What
is
the
relax
or
likes
or
marketplace?
It's
a
place
where
you
can
sell
usage
stuff,
you
don't
need
anymore
and
you
can
also
buy
stuff
for
a
better
price.
Just
like
Craigslist
from
us
all
express
you
is
a
joint
venture
of
two
of
the
biggest
investors
in
the
marketplace.
Business
masters
and
a
dev
intern
at
the
Ventura
is
the
lead
investor
of
Lebon
Kwan
in
France
and
Vito
in
Russia.
A
Naspers
is
the
lead
investor
of
oil
axing
in
europe.
The
whole
store
whole
country,
whole
continent,
and
also
let
it
go
in
u.s..
They
are
normally
compatible,
so
they
are
competing
around
the
world,
but
in
Brazil
they
get
together
should
be
relaxed,
and
because
of
that
we
have
the
biggest
marketplace
we
kind
of
don't
have
other
competitors.
A
A
25%
of
our
items
are
being
sold
in
the
first
24
hours
and
almost
65%
sold
in
the
first
week.
We
are
also
the
number
one
place
to
sell
our
whole
arch
in
America
and
also
your
phone.
So
we
don't
trade
in
our
phones.
How
you
guys
do
it
here
in
Brazil,
you'll,
not
really
buy
a
new
phone
and
you're
selling
our
likes.
A
It's
a
little
bit
late
in
Brazil.
Right
now,
it's
closer
to
9
p.m.
but
I,
think
I
started
talking.
We
probably
sold
more
than
100
items,
so
we
proved
self-professed
more
than
50
items
per
minute
and
what
how
we
keep
our
platform
so
fresh
is
because
we
get
a
lot
of
new
ads
every
day.
So
we
have
have
a
little
bit
more
than
half
a
million
new
ads
daily.
A
A
A
It
had
a
proprietary
back
hand
written
in
C
by
another
company
from
a
de
vente
called
company
called
blockage
is
a
Swedish
company.
They
start
building
it
in
1996.
So
when
our
Laxus
started,
it
was
almost
15
years
old.
As
you
may
notice,
we
were
handling
session
on
pretty
badly.
They
were
storing
it
in
memory
by
the
PHP
app
by
the
time.
It
was
not
a
big
problem,
because
most
of
our
features
didn't
require
logging.
We
don't
have
the
didn't,
have
shed
or
favorites
premium
galleries
recommendation.
Nothing
like
that.
A
It
was
basically
just
need
an
account
for
selling
an
item,
so
you
can't
login
or
register
and
then
put
your
your
ad
there
and
you
could
go
look
out.
We
don't
need
this
session
anymore,
but
it
starts
to
change
when
start
growing.
When
not
spreads
and
added
income
came
together
to
be
relaxed,
they
decide
to
gathering
in
the
mobile
market,
so
they
asked
detecting
to
build
a
new
mobile
solution
as
fast
as
they
could.
So
they
did
it.
The
tech
team
put
together
a
new
API
and
build
the
Clyde
clients.
A
This
API
was
talking
to
the
same
one
live,
so
it
was
using
the
same
strategy
and
solve
the
problem
right.
Nope
Taruna
turns
out.
We
didn't
have
features
that
were
good
enough
to
be
on
the
mobile
market.
So
we
start
writing
new
features,
get
new
product
managers
and
they
start
thinking
outside
the
box.
We
keep
going,
we
develop
a
new
feeding
level,
a
set
of
new
features,
and
this
time
a
lot
of
the
features
did
require
authentication
so
when
they
even
hit
a
need
another
problem,
this
time
the
problem
was
a
little
bit.
A
Bigger
scaling
was
complicated.
Why
we
excused,
we
were
still
storing
sessions
in
the
PHP
app,
so
using
a
load.
Balancer
was
not
as
easy
as
it
should,
and
because
of
that,
we
kept
scaling
vertically.
So
we
had
an
on-premise
data
center
every
time
in
each
scale
we
just
buy
a
new
harder
put
it
there.
This
was
not
a
good
strategy
and
we
at
this
time
we
need
to
move
to
the
cloud.
We
need
to
keep
keep
things
going.
A
We
were
spending
a
lot
of
developed
effort
and
hiring
people
is
hard
in
here
real
as
it
is
here.
So
we
are
wasting
good
engineers
managing
infrastructure.
So
we
decided
to
move
to
the
cloud
and
we
could
couldn't
do
that
anymore.
So
what
we
did
we
decide:
Cheers,
sighs,
I'm,
sticky
or
section
pinning.
A
Well,
it's
basically
a
strategy
to
make
sure
the
load
balancer
would
products
all
requests
coming
from
one
specific
user
to
the
same
server
instance
and
how
he
did
that
we
basically
ping
the
size
shown
to
the
user
when
they
log
in
so
when
the
guy
login,
we
save
a
cook
in
his
browser
or
something
on
mobile
web
and
now
every
time
the
user
makes
a
request
and
it
goes
through
the
load
balancer.
We
would
redirect
redirect
it
through
to
the
same
instance
again
so
right,
easy
nope.
We
just
changed
viewers
problems
in
a
few
months.
A
A
Sectionals
used
to
last
a
lot
longer
in
mobile
people
are
not
used
to
logging
every
time
they
need
to
use
the
app,
but
we
were
still
using
the
same
strategy.
Juice
just
stick
the
sessions
through
the
users.
So
in
a
few
months
we
got
a
lot
of
bad
reviews
on
the
App
Store.
So
it
was
a
problem.
It
was.
Business
was
not
happy
about
that
product
also
was
not
happy.
So
we
just
developed
new
problems.
A
The
problem
was
sessions
were
being
lost
every
time
one
of
the
server
wars
terminated.
So
so,
when
I
use
a
users
that
were
laga
to
that
session,
just
watch
the
server
just
what
a
logout,
hence
giving
us
a
lot
of
negative
reviews
in
the
cloud.
It
should
be
easy
to
scale
our
scaling,
so
we
are
always
putting
stuff
into
the
new
instance
in
the
closer
removing
it
and
when
we
moved
that
guy
everybody
gets
logging
on
so
big
problem
increase.
The
trash
was
also
hard
for
our
service
because
we
are
not
rebalance
the
requests.
A
When
we
put
a
new
instance
in
production
in
the
closer,
only
new
users
would
be
redirected
today,
so
we
could
not
reach
distribute
the
requests
evenly.
So
it
was
not
really
Alaska
by
the
by
the
way.
So
we
also
had
a
huge
codebase.
Everybody
was
working
on
the
same
wrapper,
so
everybody
was
having
to
use
dev
box.
By
this
time.
We
have
a
lot
of
external
dependence,
people
using
different
stuff
there.
So
developer
experience
was
bad.
When
you
you
couldn't
want,
you
couldn't
run
the
test.
You
know
your
server
anymore.
A
You
know
your
machine,
you
had
to
run
it
in
our
changings,
and
tests
would
take
like
two
hours
to
finish
so
a
long
feedback
cycle.
So
we
could
finish
our
feature:
try
to
put
in
production,
open
your
merger
request
and
discover
that
it
would
break
the
something
two
hours
later
so,
and
we
also
had
a
bad
cycle
time
why
we
were
deploying
only
twice
awake
and
normally
one
of
these
deploys
would
be
hold
it
back
or
cancel,
because
one
thing
would
find
a
problem
and
it
would
be
easier
to
just
take
everything
out.
A
A
They
also
decide
that
you
would
start
using
the
t-shaped
engineer
model,
so
we
would
have
most
functional
teams
and
handle
the
whole
development
lifecycle,
so
we
would
think
it
build
it
and
run
it.
One
of
the
first
teams
that
was
ever
created
was
their
County
with
the
mission
to
decouple
authentication
and
accounting
points
from
the
mall
with
how
using
the
I
screen
scope
strategy,
a
team
should
remove
their
parts
from
the
models
live,
so
we
would
break
them
on
the
left
part
by
part,
and
the
things
should
go.
A
There
discovery
what
its
responsibilities
take
it
out,
make
it
work,
the
more
life
work
using
the
new
micro
service
and
it
was
pretty
important
to
nothing,
could
be
broken
because
our
end
users
could
not
surfer.
We
already
had
a
lot
of
bad
reviews
on
this
story.
So
was
not
a
good
time.
This
way
we
could,
we
could
gradually
migrate
to
over
to
micro
service
without
affecting
our
up
times.
A
This
was
our
first
attempt.
We
started
creating
two
new
micro
service,
one
for
handling
authentication
and
order
for
accounts
information
in
general.
The
off
service
options
obviously
was
responsible
for
authentication.
She
was
now
a
new
identity
provider
and
we
were
using
a
Redis
to
store
the
sessions.
So
we
had
the
search.
The
keys
was
the
session
ID
and
the
valise
was
the
login
user
information
and
again
Fein
started
to
look
good
again.
A
We
could
scale
words
only
again,
no
problem
with
the
api's
everything
looked
great
business
kept
growing,
we
kind
of
doubled
the
number
of
our
users
for
one
year
to
the
other,
and
then
we
hit
the
proliferation
of
micro
service.
Now
we
were
more
than
100
engineers
dividing
in
two
entities.
Everybody
was
using
whatever
the
technology
they
want.
So
a
lot
of
different
link
language
eating's
had
their
own
set
of
micro
service,
normally
two
or
three.
So
in
just
a
few
months
we
went
from
this.
A
A
Incidents
in
production
were
a
lot
more
common
than
we
were
comfortable
with
a
lot
of
the
times.
It
was
related
to
that
Redis.
There
was
a
single
point
of
failure
when
things
went
wrong
there.
Sometimes
they
just
lost
all
the
sessions
and
everybody
was
logging
out
from
the
platform
and
why
that
was
happening.
A
We
had
a
huge
volume
of
requests,
new
officers,
and
sometimes
we
had
requests
duplication,
because
if
two
different
micro-service
needed
to
make
sure
a
user
was
logged
in
they
couldn't
they
couldn't
make
sure
that
the
guy
was
out
in
cadiz,
so
both
would
call
the
officers
for
the
same
request
coming
from
the
end-user.
So
a
lot
of
request,
duplication.
A
We
also
had
by
coupling
every
service,
was
calling
our
micros
microsoft
directly.
So
improving
anything
off
service
was
hard
because
we
had
like
maybe
50
different
services
using
it.
So
it's
hard
to
coordinate
the
effort
to
everybody
make
updates.
So
we
also
had
a
lot
of
code
duplication.
A
lot.
Every
micro
size
had
the
same
authentication
logic
when
Microsoft
needed
to
make
sure
a
user
was
authenticated.
She
would
make
a
request
for
officers
passing
the
session
token.
So
first
security
problem
there
they
would
check
the
response
if
it
was
not
valid,
it
should
return
4:01.
A
Also,
if
the
request
was
coming
from
web
browser,
it
the
responsibility
of
the
team
to
make
sure
the
right
redirect
flow
would
work,
so
they
would
come
up
with
some
different
strategies
to
do
that,
we
were
also
not
fault-tolerant.
Any
problem
in
the
raddest
was
a
downtime
for
us.
We
had
nothing
we
could
do
and
after
seeing
it
is
all
we
had
trust
accept
that
our
system
design
was
just
bad.
It
was
not
going
to
work
for
us,
so
we
got
back
to
the
drawing
board.
This
time
we
had
new
requirements.
A
We
need
to
keep
things
backward-compatible.
We
needed
a
better
performance,
we
need
to
be
more
resilient.
We
need
to
avoid
code,
duplication
for
the
other
teams,
so
they
could
iterate
faster
and
last
but
not
least,
we
can't
lose
the
access
to
revoke
the
racket,
lose
the
ability
to
revoke
access
fast.
So
it
means
we
can't
go
full
stateless,
it's
a
huge
safety
issue
for
us,
so
it
was
time
for
a
new
strategy
and
after
studying
the
problem
again,
we
come
up
with
one.
A
First,
we
create
a
new
identity
provider.
This
time
we
would
persist
the
tokens
date.
The
talking
data
in
a
database,
not
just
in
the
raddest
we
would
use
the
red-
is
just
a
cache
layers
to
avoid
reading
from
database
all
the
time
we
also
choose
to
move
to
a
different
strategy.
We
would
use
two
different
kind
of
tokens
are
short-lived
stately
to
stateless
token
and
I
stayed
full
refresh
token,
and
of
course,
this
time
we
use
an
API
gating
after
two
weeks
experiment
with
some
guys.
A
A
Kong
has
a
lot
of
nice
built-in
plugins.
We
are
still
using
some
of
them,
but
we
tried
choose
then
for
our
you
SK
for
authentication,
but
we
need
to
keep
things
backward-compatible
and
our
likes
was
not
using
any
kind
of
standard
for
that.
So
we
had
a
proprietary
and
proprietary
stuff,
so
we
decide
to
build
our
own
plugin
and
make
use
of
one
of
the
good
characteristics
of
of
Kong.
It's
easy
to
extend
Kong
with
plugins.
A
These
are
the
options
of
the
plugins
we
built.
As
you
can
see,
you
have
a
few
choice
you
can
make
here
when
adding
a
route
to
Kong.
The
most
important
options
are
resolved,
authentication,
optional
and
water.
Current
accounts
information.
Do
you
need?
So
if
you
need
the
email
from
the
log,
it
use
it
on
your
micro
service.
You
put
that
in
there.
Also
anything
you
need,
so
we
we
would
take
care
of
just
getting
up
to
now.
I'll
go
I'm
going
to
show
you.
A
They
require
the
flow
of
a
request
that
pass
through
Kong
using
our
plugin.
So
first
the
recap:
request
comes
from
the
web
browser
or
mobile
web.
So
it's
coming
directly
from
the
client
it
hit
Kong's
come
check.
If
the
route
being
access
has
the
plug-in
enable
on
it,
if
it
has,
it
will
start
running
the
plugin,
the
plug-in
will
check
if
the
route
your
accents
need
any
account
information.
If
you
need
any
information
other
than
the
account
ID,
we
will
make
a
call
to
account
service
to
fetch
it
in
parallel.
We
validate
the
stateless
token.
A
A
Come
injects
this
information
in
the
upstream
headers
for
the
upstream,
so
we
get
all
account
information
that
you
need
there
and
procs
the
request
to
the
upstream.
Now
the
upstream
can
easily
read
the
information
they
need
Frank
outs
on
from
the
header,
so
it's
easy
to
do
it
in
any
language,
so
we
don't
have
problems
with
that.
They
make
Microsoft's
do
whatever
they
need
to
do
and
returned
to.
Hong.
Kong
now
includes
the
new
stateless
token
if
he
got
one
to
English
the
request
and
return
to
the
client.
A
But
what
happens
if
the
user
is
not
a
logging?
Well,
if
look
user
is
not
logged
in
we
have
a
few
conditions.
I
have
a
few
different
case.
Remember
this
screen.
The
first
thing
you
can
check
here
is:
if
the
the
route
is,
the
authentication
is
optional.
If
it
is,
we
don't
need
to
do
anything.
We
just
don't
don't
put
any
information
on
the
header,
so
the
the
Microsoft
can
know
the
guy
is
not
authenticated.
A
If
you
keep
that
as
required,
and
you
did
not
turn
on
the
redirect
feature,
it
will
just
return
401.
So
what
we
normally
do,
we
put
a
P
is
for
for
mobile
without
this
option
and
when
we
are
using
it
for
the
web.
With
turn
on
this
feature,
if
the
redirect
feature
is
turned
on
or
turn
on,
the
plug-in
has
another
job
to
do.
It
will
generate
a
JWT
or
ja
for
the
request,
mainly
the
HTTP,
verb
method
and
the
body.
A
So
then,
if
you
read
redirect
the
user
to
the
login
page,
but
this
time
passing
this
drug
in
the
query
stream,
so
after
the
users
log
in
the
accounts
thing
can
take
care
of
the
read
redirect
flow
for
you.
We
just
need
to
reprocess
the
JWT
or
job,
and
then
we
take
care
for
the
whole
redirect
flow
for
you.
So
the
team
doesn't
need
to
think
about
it
anymore.
A
A
We
read
this
a
lot
of
the
requests
because
we
eliminate
the
duplications
and
we
also
using
the
stateless
talking.
So
we
are
avoiding
calling
the
new
identity
provider
all
the
time.
So
we
don't
need
a
lot
of
Kingston's
anymore.
We
are
just
using
c6c
five
large
instance.
That's
times,
six
five
large,
not
X
large,
a
big
difference.
We
also
using
the
raddest
only
as
cash,
so
we
don't
need
more
60
GB
of
memory
anymore,
just
15,
so
we
save
a
lot
of
money.
A
There
we
also
include
the
RDS
to
store
call
information
and
our
tokens
for
the
identity
provider,
but
this
RDS
was
not
getting
a
lot
of
reads
because
Congress
pretty
efficient
to
cache
the
service
and
routes,
and
we
are
caching
most
of
our
requests
to
refresh
tokens
in
those
ready.
So
it
was
not
a
big
deal.
A
We
also
saw
a
nice
boost
in
production.
Performance
increase
a
lot.
It
takes
less
than
10
milliseconds
to
our
Authenticator
request,
now
only
request
that
needs
a
lot
of
account
information
that
gets
laid.
So
if
you
don't
need
it
only
there
can
ready.
We
normally
can
do
that
in
less
than
three
milliseconds,
so
we
kind
of
made
the
whole
platform
faster.
A
Uptime
also
improved
a
lot
but
I'm
just
say:
first,
is
not
there
time
for
OLX
in
general,
just
for
accounting,
the
authentication
flow.
So,
as
you
can
see,
it
got
a
lot
better
and
mainly
because
now
we
have
fallback
strategy.
When
things
go
wrong,
we
don't
need
to
upset
our
our
users
have
time
every
time.
So
we
also
could
implement
an
another
fallback
strategy,
some
kind
of
battery
pattern.
So
we
could
extend
the
duration
of
the
stateless
talking
if
things
get
really
bad.
So
it's
a
reading
huge
emergency.
A
We
could
just
turn
on
a
flag
and
say:
okay,
everything
out
is
valid
for
one
hour
and
we
can
buy
us
some
time
to
make
things
get
back
up.
So
another
nice
benefit
was
better
the
API
management
right
now
we
have
a
little
bit
more
than
50
service
registry
in
Kong
using
the
plugin,
but
we
have
hundreds
of
services
and
we
are
making
use
of
the
concept
of
consumers
income
to
to
put
things
in
there.
So
we
registered
all
things
as
consumers
income
and
we
gave
them
different,
API
keys.
A
A
A
We
know
which
things
are
clients
of
our
service,
how
long
they
take
to
respond
if
they
are
closing
to
exceeding
rate
limit
the
response
status
code
that
they
are
giving
us
all.
This
good
stuff
debugging
become
a
lot
easier
and
it
was
just
using
the
built-in
plugins
with
them
didn't
have
to
do
anything
other
than
just
a
neighbor
it.
So
it
was
awesome.
A
How
do
it?
How
do
how
do
we
the
plight?
Okay,
we
are
running
an
AWS
GCS
cluster.
We
have
we're
managing
the
clothes
by
ourselves.
Using
lambda
functions,
choose
scaling
and
scale
out,
so
we
deploy
it
as
a
container
right
now,
so
we
have
a
closer
for
the
accounting
we
before
that
we
were
experienced
using
ec2
the
first
time
we
put
everything
in
production
you
just
we
just
used
easy
to
use
Parker
for
from
Hodge
Corp
to
build
the
ami
and
we're
using
terraform
integrate
with
rice.