►
From YouTube: Migrating unicorn company to real-time stream processing
Description
Every company is a data company nowadays. Ruslan will present how he has migrated Bolt from batch data workloads and synchronous processing to real-time data streaming and asynchronous processing.
He will talk about the problems he faced along the way and the lessons team has learned from this migration. Ruslan will also focus on the unleashed value real-time data provided to the company.
A
A
A
A
A
A
A
A
All
of
us
live
in
the
era
of
big
data,
meaning
that
volumes
of
the
data
we
have
to
process
and
store
and
handle
only
grows
over
the
time.
But
users
don't
actually
care
about
that.
They
want
to
have
access
to
both
current
data
or
the
fresh
data
and
all
the
historical
data
and
last
but
not
the
least,
is
actually
the
speed
of
data
delivery.
Users,
usually
don't
want
to
wait
for
the
data
to
get
to
propagate
through
the
systems
or
to
get
delivered
to
the
final
system.
A
They
want
to
have
access
to
it,
the
fastest,
the
better.
Now
that
we
know
which
requirements
people
have
towards
the
data,
let's
discuss,
why
would
any
company
want
to
actually
adopt
the
stream
processing
and
remember?
We
have
talked
briefly
about
the
importance
of
being
efficient
in
today's
business
and
that's
exactly
the
point
of
this
slide
com.
Any
company
would
prefer
to
adopt
stream
processing
because
it
actually
puts
data
in
to
the
motion.
It
allows
data
to
flow
throughout
the
systems
and
it
means
that
company
can
develop
so-called
reactive
microservices.
A
Those
are
the
services
which
actually
react
or
respond
to
some
kind
of
change
in
some
state
or
to
some
kind
of
event.
Imagine
you
are
buying
something
on
amazon
when
you
click
this
buy
button,
many
actions
have
to
happen
like
invoice
has
to
be
generated
and
sent
to
your
email.
The
inventory
on
the
warehouse
has
to
be
reserved.
A
A
A
We
all
live
in
the
era
of
microservices
and
they
have
plenty
of
different
benefits
like,
for
example,
easiness
of
deploy
or
maybe
being
decoupled
one
from
another,
but
they
also
come
at
some
cost
and
usually
that's
how
it
looks
in
your
production
environment
and
want
it
or
not.
At
some
point,
you
end
up
in
a
zoo
where
you
have
different
plenty
of
micro
services
talking
to
each
other
and
good
luck
deriving
some
minion
from
this
house.
A
Unfortunately,
right
now
there
is
no
engineer
friendly
or
easy
way
to
guarantee
that
some
communication
between
the
services
has
happened
and
those
of
you
who
have
worked
with
similar
setups.
They
know
that
usually
people
solve
it
with
some
kind
of
distributed
transactions
or
state
machines,
and
they
also
know
how
painful
it
is
and
when
we
at
bolt
thought
about
all
right.
How
would
we
tackle
this
this
problem?
How
can
we
migrate
towards
stream
processing?
A
Where
do
we
get
events
from
we
started
thinking
from
the
other
direction
we
use
mysql
as
our
relational
database
for
all
our
microservices
and
those
of
you
who
know
who
have
worked
with
databases.
They
might
know
that
most
of
the
modern
databases
they
have
the
so-called
commit
log
under
the
hood.
Basically,
it
is
the
log
of
all
the
changes
which
database
has
applied.
A
A
A
One
of
the
crucial
requirements
which
we
have
agreed
on
when
we
started
developing
this
whole
pipeline
was
that
well
it
has
to
be
scalable
because
company
grows
year
over
year.
We
launch
new
cities,
new
countries,
we
provide
rides
to
more
and
more
customers,
and
all
of
us
agreed
that
it
has
to
be
scalable.
A
But
any
time
when
someone
says
to
me
anything
about
scalability,
there
is
one
piece
of
software,
one
system
which
comes
to
my
mind
straight
away
and
that's
apache
kafka
for
those
of
you
who
might
who
might
not
be
familiar
with
apache
kafka.
It
is
the
messaging
system
which
was
developed
at
linkedin
and
later
on,
it
was
open
sourced.
A
A
Features
which
apache
kafka
provides
is
actually
write.
Once
read
many
semantics,
it
means
that
you
can
send
or
write
one
event
to
it,
which
can
be
processed
by
number
of
independent
consumers
at
their
own
pace
at
different
points
in
time
in
the
future,
there
is
no
need
to
duplicate
this
same
event
so
that
it
can
be
processed
by
many
consumers.
You
can
write
it
once
and
then
process
it
as
many
times
as
you
want
so
now
that
we
understand
from
where
can
we
get?
A
Can
we
get
those
events
from
and
that
we
persist
them
to
apache
kafka
now
comes
the
part
of
stream
processing,
let's
deep
dive
into
it
right
now
on
the
market,
there
are
plenty
of
different
frameworks
and
libraries
which
are
actually
doing
stream
processing.
So
it's
up
to
you
to
decide
whichever
one
of
those
you
would
like
to
to
use,
but
before
maybe
discussing
them,
let's
define
what
actually
a
stream
is,
and
stream
has
two
important
characteristics.
First
of
all,
it
is
unbounded
flow
of
data.
A
It
basically
means
that
you
are
getting
the
events
right
now.
You
will
get
them
in
the
future
and
there
will
be
no
end
to
that.
You
cannot
say
that
at
some
point
they
will
stop.
It
is
unbounded,
you
will
simply
keep
getting
them
continuously,
and
second
important
characteristic
is
that
it
is
happening
in
real
time.
You
don't
have
to
do
it
once
an
hour
once
a
day
once
a
week.
It
is
happening
at
any
given
point
in
time.
A
There
are
two
types
of
stream.
Whenever
we
are
talking
about
the
the
stream
processing
as
a
framework,
there
are
two
types
of
stream
processors
which
you
have
to
very
clearly
differentiate
between
first
one
are
so
called
stateless
processors
and
they
process
every
single
event.
Independently
of
all
the
previous
events,
they
don't
have
any
history
whatsoever.
A
Examples
of
such
stream
processors
can
be,
for
example,
some
transformations
or
maybe
extracting
some
field
from
event
or
maybe
repacking
the
event
into
different
form,
something
which
you
can
do
for
every
event
separately
and
second,
one
is
so-called
stateful
stream
processors
and
they,
whenever
they
process
any
new
event.
They
also
keep
all
the
history
of
previous
events,
which
they
have
processed
examples
here,
can
be,
for
example,
calculating
some
aggregations
or
maybe.
A
A
Second,
one
k,
sql
is
a
good
tool
for
let's
call
them
lightweight
stream
processing,
but
sometimes
you
would
want
you
might
want
to
do
the
some
very
sophisticated
logic
it
might
be,
for
example,
something
related
to
your
business
logic
when
you
need
to,
I
don't
know,
do
some
kind
of
very
specific
aggregations
or
transformations
and
so
on,
and
in
this
case
kafka
system
provides
you
with
the
kafka
streams
library.
It
allows
you
to
define
the
stream
processing
tasks
as
a
java
applications.
A
So
whatever
you
can
express
in
your
java
code,
you
can
do
with
your
stream
processing.
Very
good
news
is
that
both
of
the
frameworks
are
can
be
actually
extended.
If
you
need
to
implement
something
which
is
very,
very
much
tailored
towards
your
business,
you
can
write
the
so-called
user-defined
functions.
A
Basically,
you
define
the
java
function
and
it
can
be
whatever
like
aggregation
or
it
can
do
some
business
logic,
and
then
you
can
call
this
function
boss
from
the
k
sql
like
any
sql
function
here
or
from
your
kafka
streams
application,
and
it
can
be
everything
it
can
like.
It's
already
said,
be
something
very
much
related
towards
your
business
logic,
or
it
can
also
be
some
kind
of
fancy,
machine,
learning
and
stuff,
and
it
can
be
used
for
fraud,
prevention
for
anomaly
detection,
anything.
A
So
that's
the
setup
which
we
have
adopted
at
bolt.
We
ingest
data
from
all
the
source
databases.
We
persist
them
into
the
kafka
brokers.
We
do
the
stream
processing
with
the
libraries
I
have
mentioned
and
also
persist
the
results
to
kafka
and
then
those
results
are
consumed
by
number
of
different
consumers.
We
store
some
events
to
our
data
lake.
We
also
allow
backend
micro
services
to
consume
those
events
and
do
the
business
logic
and
decision
making
on
top
of
them,
which
problems
have
encountered
during
our
throughout
our
way.
A
A
That
was
it's
not
actually
a
problem,
but
that
was
like
one
relatively
big
challenge
and
obstacle.
We,
which
we
had
to
overcome.
A
Second
problem,
is
actually
whenever
you're
working
with
stream
processing,
you
can
be
getting
hundreds
of
thousands
events
per
second,
and
if
someone
comes
to
you
and
says,
like
hey,
you
know,
this
number
is
not
actually
correct.
Good
luck,
debugging
it
because
it's
really
really
hard
to
find
where
actually
this
error
is
happening.
A
A
Let's
say
you
have
released
some
new
logic
and
later
on
a
few
days
after
that,
you
have
realized
that
there
was
a
mistake
there,
but-
and
now
you
can
replay
this
history
replay
those
events
and
correct
your
make
some
adjustments
towards
your
business
logic,
but
you
should
never
think
that
you
have
to
replay
only
some
specific
piece
of
data
because,
like
I
said
stream
is
unbounded
flow
of
data,
so
you
should
always
think
of
it
in
the
following
way.
A
And
other
problems
which
we
have
seen
along
the
way
are
the
let's
say:
for
example,
data
deduplication
network
is
not
reliable,
different
services
can
fail,
and
eventually
it
would
lead
to
duplicating
some
of
the
events,
and
so
it
is
important,
so
your
consumers
are
prepared
for
that.
A
Next,
one
is
types
in
compatibility
whenever
you
integrate
many
different
systems
and
make
them
talk
to
each
other,
you
have
to
be
very,
very
careful
so
that
they
process
all
the
data
and
types
the
same
way
and
last,
but
not
the
least
like
I
said
if
you
want
to
start
sourcing
the
events
from
your
databases.
You
should
also
think
about
how
do
you
handle
database
schema
migrations
like,
for
example,
adding
some
fields
to
the
table
or
maybe
changing
the
type
or
creating
new
tables.