►
Description
For more Continuous Delivery Foundation content, check out our blog: https://cd.foundation/blog/
Sharding Clouddriver with Stormdriver - Michael Graff, OpsMx
OpsMx's Stormdriver allows Clouddriver instances to be sharded in various ways, including using both the standard Java Clouddriver and the Go implementation for Kubernetes. This is a presentation on the challenges, success stories, and general ideas surrounding how to scale Clouddriver per account, cloud, or any other axis.
A
Hello,
my
name
is
Michael
Graff
and
I
work
at
optimx
and
I
would
like
to
talk
about
a
project
I've
been
working
with
called
stormdriver,
so
about
me.
Well,
once
again,
I
work
at
opsamex
I
have
some
title
there
not
so
concerned
about
that.
I
love
to
write
your
code
these
days,
it's
quite
nice
and
quite
easy,
and
quite
small
as
a
final
product.
I,
am
on
the
Spinnaker
slack
and
I'm.
Also
on
CNF
slack
under
my
name
feel
free
to
reach
out
to
me
with
any
questions
or
concerns,
or
ideally
PR
request
topics.
A
Optimax
provides
SAS
solutions
for
Spinnaker
and
Argo
CD,
as
well
as
a
number
of
other
platforms.
We
do
custom
pipeline
development
and
help
you
maintain
that
and
and
so
on.
We
also
do
on-prem
Spinnaker
support
for
both
the
open
source
and
our
own
internal
version,
and
we
are
quite
happy
to
discuss
anything
with
anybody
who
has
any
questions
about
that
as
well.
A
So
I
want
to
make
certain
that
I
talk
about
Cloud
driver
first
to
sort
of
set
the
stage
and
what
is
cloud
driver
Cloud
driver
is
the
heart
of
Spinnaker,
in
that
it
is
the
thing
that
actually
speaks
to
the
different
clouds
and
maintain
it
pulls
for
infrastructure,
and
it
also
manipulates
an
infrastructure.
It
also
knows
about
artifact
accounts
and
even
though
those
are
very
unrelated,
you
know
there's
a
for
example:
a
get
artifact
account
type,
but
there
isn't
a
git
cloud
account
it.
A
They
happen
to
be
in
the
same
in
the
same
service.
It
knows
about
all
the
accounts,
so
you
can't
shard
currently
without
storm
driver
using
with
different
accounts.
There
is
a
way
to
Shard
Cloud
driver,
but
I'll
go
over
that
in
a
little
bit
of
detail
about
why
that
doesn't
really
scale
so
well
it
can.
It
can
really
consume
memory
and
it
is
written
in
Java.
It's
a
standard
part
of
Spinnaker,
it's
required,
and
it's
automatically
configured
generally
speaking,
so
you
can
Shard
Cloud
driver
based
upon
the
actions
it's
going
to
take.
A
For
example,
you
can
have
a
caching
Cloud
driver
component
that
all
that
does
is
query
the
cloud
accounts
and
update
its
data
and
then
send
it
to
a
cache.
The
caching
response
then
comes
back
and
the
read-only
instances
will
then
query
that
cache
and
get
the
data
from
there.
So
the
read-only
instances
don't
actually
manipulate
the
cloud
or
look
at
the
cloud.
They
only
look
at
the
cache
and
this
does
separate
concerns
of
types
of
actions,
but
every
one
of
these
still
has
to
have
the
full
list
of
accounts.
A
In
some
cases
our
customers
have
5
000
accounts.
If
you
have
5000
accounts
configured
into
Spinnaker,
it's
really
difficult
to
to
do
that
with
a
single
single
instance.
It
takes
a
lot
of
memory.
A
So
how
do
you
scale
Cloud,
driver
or
sharded,
so
that
you
can
have
the
accounts?
But
well
you
can't
and
that's
because
Cloud
drivers
also
know
about
every
account.
Large
pods
and
kubernetes
aren't
a
good
idea,
it's
better
to
have
many
smaller
pods
than
one
or
two
large
pods,
and
we've
actually
had
customers
who
had
to
reduce
features
in
order
to
make
a
the
cloud
driver
fit
on
the
on
the
nodes
that
they
had
in
kubernetes.
A
It
can
take
32
gig
and
maybe
the
node
only
has
32
gigs,
so
they
wouldn't
they'd
have
to
somehow
limit
that
memory
down
to
something
smaller.
It's
also
a
problem
with
kubernetes,
if
you're
the
biggest
pod
on
that
node
and
that
node
runs
out
of
memory
it.
It
really
is
likely
it's
going
to
murder
that
particular
pod,
so
this
division
of
labor,
the
mode
only
gets
it
so
far.
A
So
what
stormdriver
does
is
it
allows
us
to
Shard
per
account.
We
still
have
a
situation
where
the
individual
accounts
have
to
be
sized
appropriately
they'll
fit
on
a
particular
Cloud
driver,
but
you
could
have
a
cloud
driver
just
for
one
account
if
it
was
a
really
big
one
and
we
also
notice
over
time.
Most
people
now
have
small
accounts
and
they
have
large
numbers
of
them.
A
Storm
driver
is
written
and
go
and
in
my
testing,
I
I
set
up
one
Cloud
driver
that
had
four
100
AWS
accounts,
one
one
that
had
100
namespaces
in
kubernetes,
one
that
had
no
accounts
just
artifact
accounts,
the
other
ones
had
no
artifact
accounts.
Just
artifact
accounts
on
this
one
and
the
fourth
one
had
a
mixture
of
AWS
and
kubernetes
accounts,
but
no
artifacts
and
storm
driver
during
normal
operation.
All
my
testing
took
about
200
mega
ram.
A
It's
written
to
be
fast,
so
it's
written
not
to
interfere.
It
doesn't
query
each
of
the
cloud
drivers
serially.
It
doesn't
in
parallel
and
combines
the
responses.
So
the
memory
footprint
is
really
based
upon
the
response
size,
the
number
and
and
the
total
response
size
and
the
number
of
simultaneous
queries
that
are
going
to
stormdriver.
A
This
is
it,
it
sits
in
the
middle
Orca
gate,
Igor
Cloud
driver
and
the
cloud
driver
source
code
are
not
modified.
Just
the
configuration
their
or
Cloud
driver
doesn't
care
who
it's
talking
to,
but
Orca
gate
and
egoris
configuration
is
set
up
instead
of
talking
to
a
cloud
driver
or
an
aha
instance
of
a
cloud
driver
instead
to
talk
talks
to
stormdriver
and
then
storm
driver
is
configured
with
the
URLs
for
the
various
Cloud
driver
instances.
A
So
why
do
this
thing?
Well,
we
needed,
for.
We
need
to
break
it
up
when
we
have
5
000
accounts
in
a
single
Cloud
driver.
It
takes
a
long
time
to
start,
and
even
if
we
turn
off
the
the
runtime
checking
at
the
beginning,
it
can
still
take
significant
amounts
of
time
to
start
so.
Configuration
mistake,
for
example,
might
cause
a
very
large
downtime.
A
We
would
also
prefer
to
have
smaller
pods
for
all
the
reasons
I
covered
before
and
the
configuration
isn't
necessarily
the
best
configuration.
It
doesn't
necessarily
add
a
lot
of
resilience,
but
it
does
add
some
complexity.
A
So
how
does
this
thing
work?
Well,
it
maintains
an
internal
list
of
credentials
and
artifact
credentials
by
pulling
every
every
one
of
the
cloud
driver
instances,
and
it
does
that
every
10
seconds-
and
this
is
only
for
internal
routing,
so
it
it
pulls
using
a
user.
That's
currently
Anonymous,
but
it's
configurable.
That
user
has
to
be
configured
such
that
it
has
full
access
to
read
the
list
of
all
accounts.
A
It
doesn't
have
to
be
able
to
modify
anything,
but
it
it
needs
the
credentials
response
has
to
return
everything
with
artifact
credentials,
and
this
is
only
for
routing.
So
we
send
that
to
everybody
at
once.
All
the
different
Cloud
driver
instances
and
maintain
that
internal
routing
table
that
routing
table
isn't
used
Beyond
internal
routing,
though
it's
not
returned
to
the
client
or
at
any
point.
A
A
We
have
posts
and
and
put
requests,
in
which
case
we
have
to
those
are
mutating
requests
typically,
except
for
the
put
request
for
the
artifact
fetch,
which
is
really
a
get
a
fancy
get,
and
then
we
have
unknown
requests.
So
similar
items
are
things
like
credentials
and
artifact
credentials.
We
send
out
a
query
to
every
cloud
driver
that
we
know
about
in
parallel
and
whenever
they
respond
or
they
time
out,
we
then
respond
with
the
superset
of
the
of
the
data
on
the
three
of
my
dog
on
the
maps.
A
It's
the
same
idea.
We
scatter
gather,
send
out
the
request
to
everybody
and
combine
the
results
Singleton's.
We
know
who
to
send
the
request
to
this
is,
for
example,
in
this
account.
This
give
me
the
details
of
this
pod
or,
in
this
account,
give
me
the
details
of
this
instance,
VM
instance.
Then
we
know
exactly
where
to
send
the
request,
so
we
can
just
route
it
directly
there.
In
some
cases
we
don't
know
so
we
ask
everybody
and
whoever
returns
interesting
data
and
non-empty
response,
for
example,
is
the
one
we
use.
A
We
get
back
an
empty
object.
We
treat
that
as
a
possible
response
and
if
we
get
back
something
better
with
something
with
actual
data,
then
we'll
return
that
a
common
one
here
is
tasks
where
we
don't
track
who's
operating
the
task.
So
when
we
do
a
post
request,
we
get
back
an
ID.
We
send
that
to
orca
Oracle
will
then
pull.
We
don't
know
where
that
ID
belongs.
A
So
we
just
ask
everybody
it's
fairly
quick,
so
it
doesn't
really
add
much
latency
and
then
there's
pagination
I'm
not
going
to
cover
pagination,
but
the
API
is
a
little
bit
strange,
and
so
we
just
kind
of
Punt
on
this.
What
we
do
is
we
limit
the
number
of
responses
to
2000
and
if
we
get
more
than
that,
we
just
stop
keeping
track
of
them
and
we
send
that
that
response
back.
A
So
this
works
just
fine
and
we
don't
do
pagination.
We
send
back
2
000
and
if
you
ask
for
page
two,
you
get
nothing
posts.
Inputs
are
handled
because
we
know
exactly
which
artifact
account
or
which
main
which
cloud
account
to
send
the
request
to.
So
we
send
it
directly
there
and
unknown
requests.
Well,
we
handle
these
differently.
A
If
it's
a
get
request
and
we
don't
know
what
the
URL
is
for,
we
will
pick
randomly
and
send
it
to
one
of
the
instances
and
it
is
logged,
but
past
development
time,
I've
never
seen
that
log
message
occur
unless
I
Mis
type
a
URL
on
the
curl
command
manually,
then
I
see
it.
Of
course,
post
requests
are
handled
since
they
are
mutating
requests
posts
and
put
requests
and
delete
and
all
the
other
HTTP
verbs
that
can
that
can
be
used
to
modify
state
those
are
handled
in
such
a
way
that
we
will
reject.
A
If
we
don't
know
what
the
what
that's
for.
This
also
includes
Cloud
driver.
Sorry,
Cloud
driver
accounts
types
that
we
don't
know
right
now.
We
support
AWS,
Foley,
kubernetes,
fully
I,
believe
Azure
and
gcp
work,
but
I
have
not
tested
those
fully
at
this
point.
But
those
are
the
only
four
that
that
storm
driver
currently
supports
it's
very
easy
to
add
others.
I
just
don't
have
access
to
testing
them.
A
So
security,
well,
the
internal
routing
table
that
we
use
is
never
released.
We
use
it
for
routing,
but
we
don't
use
it
as
any
responses.
So
we
respect
the
x-miniker
user
header,
which
is
set
by
all
the
other
applications,
all
the
other
components,
and
it
calls
us
with
that
header.
We
pass
that
header
along
when
we
do
the
scatter
gather,
requests
or
the
individual
requests
that
way.
The
rbac
model,
we
don't
care
about
VR
back
model.
A
Another
really
nice
advantage
of
this
is
that
we
can
use,
go
Cloud
driver
and
there's
a
little
bit
of
information
about
it
here.
I
won't
go
over
it
in
great
detail,
but
it
allows
us
to
use
kubernetes
only
Cloud
driver
and
it's
written
in
go
it's
much
leaner
than
the
Java.
It
takes
up
much
less
time
to
do
its
tasks.
It
goes
directly
to
the
cluster.
Instead
of
caching
and
having
a
delay,
you
can
run
multiple
copies
of
it.
A
It
shares
State
through
a
database,
so
it's
very,
very
resilient
and
overall,
it's
just
a
much
easier
to
work
with
kubernetes
only
Cloud
driver,
but
it
does
have
limitations
and
therefore
we
want
to
mix
in
we.
We
have
some
things
that
we
use
the
Java
side
for
and
something
is
used,
go
Cloud
driver
for
in
production,
and
this
storm
driver
allows
us
to
mix
and
match
those
and
put
accounts
in
in
whichever
place
we
think
is
most
appropriate.
A
So
her
findings.
Well,
once
again,
we
haven't
really
been
using
deployments
very
much
recently
because
it
doesn't
add
as
much
value
anymore.
A
A
We
are
able
to
mix
in
go
Cloud
driver
and
possibly
other
a
single
other
types
of
cloud.
Driver
implementations
in
the
future
and
programming
to
go
is
quite
fun,
so
I
don't
actually
have
a
demo.
I
have
an
anti-demo
and
the
reason
I
call
it.
The
answer
demo
is
because
it's
really
hard
to
demo
an
API,
so
instead
I'm
going
to
show
some
traces.
A
This
is
using
Jaeger
and
it
shows
the
various
responses
and
those
responses
are
the
orange
responses
are
the
important
ones
the
other
ones
are
the
the
durations
are
valid,
but
the
times
are
slightly
offset
for
some
reason,
probably
because
we're
in
a
different
time
zone
and
what,
for
whatever
reason
it
just
seemed
like
the
clocks
were
slightly
off,
but
you
can
see
that
the
overall
query
that
stormdriver
issued
here
to
fetch
a
list
of
credentials
was
11.75
milliseconds
and
the
longest
request
from
stormdriver
from
one
of
the
cloud
drivers
was
11.42
and
that
11.42
it's
only
a
few
0.3
milliseconds
longer.
A
So
that's
quite
good-
and
this
is
the
internal
query
that
we
issue,
but
overall
the
responses
are
very
little.
Latency
is
added.
This
is
a
real
life
example
where
Igor
is
asking
us
for
list
of
credentials.
We
had
three
milliseconds
and
I'm
willing
to
take
three
millisecond
hit
on
a
query
like
this.
In
order
to
make
certain
that
we
get
the
responses
properly,
so
where
can
you
get
this
thing?
Well,
you
can
get
it
from
github.com
optimax
stormdriver.
A
It's
an
Apache
2.0
license
and
I'm
happy
to
take
pull
requests
reach
out
on
slack,
send
me
email,
whatever
you'd
like
to
do
and
I
am
happy
to
discuss.
Thank
you
very
much.