►
From YouTube: Cloud Tech Thursdays: Scaling OpenStack at CERN
Description
Cloud Tech Thursday explores the full modern open source cloud stack, from hardware to serverless. Learn about new ideas, projects, and releases around Kubernetes, OpenStack, hybrid cloud enablement, and many other topics.
This episode: How OpenStack is scaled to meet the needs of CERN
B
Good,
let
me
go
ahead
and
introduce
my
compatriots.
First,
we
have
josh
burkus,
who
is
the
kubernetes
community
person
here
at
red
hat?
We
have
mike
perez,
who
is
the
seph
storage
community
architect
and
myself,
amy
marisch,
who
is
the
openstack
community
person
here
at
red
hat,
and
we
are
very
pleased
to
announce
that
today
we
have
balmera
moreira
from
cern
to
talk
about
scaling.
These
openstack
cloud
at
cern
romero.
C
So
hello,
my
name
is
belmir
murdaira,
I'm
a
computer
engineer
at
cern.
I
joined
around
12
years
ago.
C
C
C
It's
not
that
short.
Okay,
just
slides,
fair.
A
C
Enough
for
the
hour,
I
can
start
sharing.
C
A
C
So
that's
why
I'm
gonna
through
very
fast
through
this
one,
all
right
so
yeah.
I
can
start
so
this
session,
it's
about
how
we
scale
openstack
at
cern.
So
currently
we
run
thousands
of
nodes
and
thousands
of
virtual
machines
and
we
go
through
through
the
steps
from
the
beginning
in
2013
to
today,
but
maybe
the
audience
is
not
familiar
with
cern.
So
in
the
next
couple
of
slides
I'll,
give
you
an
overview
about
the
organization
and
the
role
of
the
cern
cloud
infrastructure
in
the
organization.
C
So
cern
is
the
european
organization
for
nuclear
research.
It
was
established
in
1954,
initially
with
only
12
member
states,
and
this
number
has
been
growing
over
the
years.
Currently,
there
are
23
member
states,
black
sits
in
the
border
between
france
and
switzerland,
and
it's
very
very
close
to
geneva.
C
C
C
The
slater
complex
at
cern
is
a
succession
of
different
machines,
accelerators
that
accelerate
particle
beams
to
higher
energies
very
close
to
the
speed
of
light.
The
lhc
is
one
that
you
can
see.
Clearly
in
the
satellite
picture
is
the
stern
largest
accelerator.
It's
also
the
world's
largest
accelerator
with
a
27
kilometers
diameter,
and
it
crosses
two
countries,
france
and
switzerland,
for
comparison.
You
can
see
the
geneva
airport
here
and
to
have
an
idea
about
the
size
of
this
machine.
C
C
C
C
C
So
these
are
the
experiments,
atlas,
cms,
lhcb
and
alice.
These
are
the
particle
detectors
where
the
collisions
occur.
These
machines
are
huge.
They
have
up
to
45
meters,
long
25
meters
in
diameter
and
more
than
12
000
tons,
and,
of
course,
everything
is
100
meters
on
the
ground
mike
visited
this
some
time
ago.
C
A
detector
is
basically
a
digital
camera,
but
they
can
take
up
to
40
million
picture
seconds.
This
produces
up
to
one
petabyte
of
raw
data
every
second,
of
course,
we
cannot
handle
all
this
data.
Well,
our
search
system
doesn't
support
this.
So
what
physicists
do
they
have
triggers
in
the
experiments
that
try
to
identify
in
real
time
the
interesting
events
and
everything
else
is
discarded.
C
And
with
all
these
pictures,
physicists
can
have
a
representation
of
the
collision
events.
The
the
analysis
of
all
these
data
gives
gives
physicists
insights
how
the
particles
interact,
but
detectors
are
not
only
undergrounds
at
the
cern
site.
They
are
also
in
space.
So
this
is
ams
the
alpha
magnetic
spectrometer
that
was
installed
in
the
international
space
station
in
to
measure
antimatter
and
cosmic
rays
and
to
search
for
dark
matter.
C
C
C
Over
90
of
the
resources
in
the
compute
in
the
data
center
are
provided
through
the
cerns,
openstack
private
clouds,
and
to
understand
the
motivation
why
we
build
a
private
clouds,
we
need
to
go
back
through
the
beginning,
so
29,
2011
and
and
then
we
will
see
the
evolution
of
the
cloud
infrastructure
over
the
years
and
some
of
our
architecture
decisions.
C
C
This
it
has
two
floors.
This
is
one
of
the
floors,
but
one
of
the
limitations
that
we
have
in
this
data
center
is
the
power
capacity.
Currently
it
has
a
power
capacity
of
4
megawatts
and
is
not
easy
to
extend
data
center.
So
that's
why?
If
you
visit
now,
the
center
you'll
see
that
most
of
the
racks
are
not
completely
full,
usually
they
are
all
full
and
power
constraints
is
one
of
the
reasons.
C
From
2013
to
2019
over
six
years
and
there
we
only
run
compute
nodes
for
the
openstack
clouds.
All
the
compute
nodes
for
computing
for
processing
were
for
the
openstack
cloud,
so
this
is
in
hungary
and
it
was
a
huge
challenge
for
us,
because
when
we
launch
in
production
our
cloud
infrastructure,
we
had
two
different
locations,
one
in
cern
geneva
and
the
other
in
hungary.
C
So
the
challenge
was
not
only
to
deploy
openstack
at
that
time,
but
also
to
run
these
different
locations
transparently
and
we're
going
to
talk
a
little
bit
more
about
this
later,
and
these
are
another
location
when,
where
we
run
our
cloud
infrastructure,
these
are
now
compute
containers
ident
with
high
density
for
computing,
and
you
can
see
some
when
they
were
installed
and
this
cooling
hardware
being
installed
right.
So
this
is
one
of
our
dashboards
for
monitoring
and
you
can
see
the
size
of
our
current
quality
infrastructure.
C
So
we
have
around
300
000
cores
in
the
clouds
3
400
users.
More
than
4
000
projects
around
30,
000
visual
machines.
This
changed
a
lot.
We
are
in
the
process
that
we
are
decommissioning
a
lot
of
hardware
because
it
we
are
replacing
it.
So
that's
why
you
see
this
big
drop
in
compute
nodes
and
number
of
vms
that
we
added
beginning
of
the
mode.
C
We
also
have
a
lot
of
services
in
the
cloud,
so
we
have
ironic
supervision
bare
metal
you.
We
have
around
8
000
bare
metal
nodes,
magnum
clusters.
Usually
most
of
these
clusters
are
kubernetes
more
than
600
and
also
volumes
for
cylinder.
You
can
see
that
we
have
a
lot
of
block
storage
more
than
three
petabytes.
D
So
quick
question
from
the
audience
kind
of
unrelated,
which
is:
can
you
direct
our
audience
member
to
where
they
can
look
at
jobs
that
are
available
at
certain.
C
Y'all
there
is
a
web
page
for
jobs.
I
think,
if
you
search
jobstern,
you
will
immediately
find
it
I'll.
C
A
C
All
right
so
going
back
to
2011.,
so
that
was
a
period
of
change,
so
we
are
in
a
period
where
we
had
a
lot
of
computing
requirements.
They
were
increasing
a
lot
and
the
lhc
was
running.
There
was
a
need
for
more
computing
resources
and
we
had
these
power
constraints
in
the
computer
center
in
geneva.
So
we
need
to
expansion
options
in
a
different
site.
C
So
that's,
why
cern
opened
at
the
international
standard
to
for
all
the
member
states
to
have
another
data
center
and
the
hungary
one?
So
that's
why
we
got
that
data
center
in
hungary,
so
the
project
started
in
2011
and
the
data
center
was
ready
in
2013
just
in
time
for
the
launch
of
our
private
openstack
cloud
infrastructure,
where.
D
C
It
was
a
time
that
there
was
nothing
available
actually
to
manage
a
data
center
of
our
size
in
2000,
so
we
needed
to
build
all
these
tools.
However,
reality
in
2011
was
completely
different.
There
were
a
lot
of
open
source
projects
that
probably
were
doing.
Definitely
they
were
doing
a
better
job
and
we
did
much
more
functionality
than
the
tools
that
we
made
in
a
house
and
then
other
problem
was
when
an
organization
build
their
own
tools.
C
Attracting
people
to
work
on
those
tools
is
actually
very
difficult
because
they
only
add
value
inside
the
organization
and
when
you
arrive
there,
you
need
to
learn
them,
and
if
you
live,
that
knowledge
will
not
it's
not
interesting
for
other
organizations
and
companies.
So
it
was
a
time
that
we
needed
to
really
adopt.
Opens
the
open
source
tools
available
to
to
manage
our
data
center,
so
we
start
looking
to
the
all
the
options
available
so
why
to
build
also
cloud
infrastructure.
C
At
that
time,
everything
was
running
on
physical
machines
and
it's
a
completely
shift,
not
only
the
way
we
manage
the
data
center,
but
also
for
all
the
users
that
that's
what
we
needed
to
tell
them.
Well
now
we
need
to
transfer
these
workloads
to
virtual
machines
and,
as
you
can
imagine,
a
lot
of
them
were
server
others
they
liked
it
to
have
their
machines
in
center
and
control
them,
and
this
was
a
huge
cultural
change.
C
It's
it
brings
a
lot
of
advantages
and
something
that
was
quite
easy
to
sell
was
to
re-improve
the
responsiveness,
because
at
that
time,
if
someone
needed
a
machine,
they
needed
to
fill
a
lot
of
forms,
and
maybe,
after
a
few
weeks
or
even
months,
they
will
get
to
the
physical
machine
to
work
on
having
called
the
infrastructure.
With
an
api
sole
service,
they
could
immediately
take
that
machine,
so
we
started
identifying
neutral
chain
and
the
things
that
we
clearly
need
was
a
configuration
management
tool.
So
there
were
a
few
options
at
that
time.
C
At
that
time
we
decided
for
puppet
and
this
what
we've
been
using
since
then
puppet
is
not
only
used
to
configure
the
openstack
infrastructure,
but
is
worldwide
used
through
the
organization
to
configure
all
the
it
services,
for
example,
monitoring
tools,
and
there
are
a
lot
of
projects
open
source
projects
that
were
in
the
define
and
that
we
are
using
today,
so
kibana
search
collect
the
fluent
the
that
we
are
using
to
manage
not
only
the
monitor,
not
only
the
openstack
resources,
but
all
the
resources
in
data
center
under
the
cloud
manager
tool.
C
So
it
was
a
time
that
there
is
not
much
available
and
when
we
started
really
looking
into
this,
this
was
2009.
I
started
looking
to
open
ebla,
open
ebla
was
a
open
source
tool.
Actually,
we
start
looking
to
open
about
to
visualize
the
batch
system
and
we
we
were
quite
successful.
We
did
huge
scalability
tests
because
one
of
the
concerns
at
that
time
it
was
these
open
source
cells-
were
able
to
scale
to
our
needs.
So
we
were
able
to
create
more
than
15
000
visual
machines.
C
That
was
one
of
the
options,
but
also
cern,
beginning
of
20
2006
was
managing
a
small
visualization
tool
that
was
built
on
top
of
this
microsoft
system,
center
virtual
machine
manager,
where
the
cern
team
basically
built
a
web
interface
on
top
of
it,
and
it
was
a
basic
web
interface
where
the
certain
users
could
go
and
create
virtual
machines,
selecting
an
image
and
basically
was
it.
So
there
was
no
api
interaction.
It
was
only
that
web
interface.
C
It
was
a
virtualization
tool,
but
it
was
quite
popular,
so
it
had
in
2011.
It
had
thousands
of
virtual
machines
running
in
that
microsoft
infrastructure,
but
that
was
a
time
where
openstack
was
released
2010
and
that
was
a
game
changer.
So,
with
all
the
industry
support
to
this
new
cloud
tool,
I
think
it
was
clear
from
the
beginning
that
was
the
right
choice
for
us
to
to
invest
in
openstack,
to
understand
this
tool
and
to
join
the
community.
C
C
You
can
find
the
presentation
in
this
link.
It's
quite
funny
now
going
back
all
these
years
and
see
this
presentation.
C
C
Cloud
stack
was
not
there
yet
yeah,
it
was
only
after
eucalyptus,
it
was
an
option,
but
it
there
was
not
a
lot
of
deployments
running
eucalyptus
at
at
least
at
our
scale.
So
it
was
not.
It
wasn't
never
considered
as
a
good
option
for
us.
C
C
We
believe
that
nova
was
complicated
at
that
time
with
these
diagrams.
Well,
when
you
see
it
today,
it's
completely
different.
It
was
a
time
where
also
where
there
was
only
two
projects
that
was
swift
and
nova,
nothing
else
so
glance.
I
think
it
was
only
available
on
boxer
or
katus,
so
everything
was
no
good
even
to
create
users,
you
needed
to
do
nova
manager
user,
create
something
like
this.
E
C
C
B
C
All
right,
so
this
was
2011,
and
so
we
needed
to
to
get
our
hands
dirty
on
this.
So
we
created
several
prototypes
and
the
goal
was
to
adding
functionality
in
this
different
prototype.
So
we
started
with
with
what
we
call
guppy,
because
so
it
was
a
very
small
and
fragile
animal
and
you
see
that
the
animals
over
time
get
more
stronger
with
more
functionality.
C
Well,
this
was
deployed
the
first
prototype
with
the
fedora
16,
and
why?
Because
at
that
time
there
was
the
federal
cloud
sick
team
that
released
the
rpms
for
fedora.
C
Then,
in
our
open
ebla
tests
we
are
always
using
zen,
and
but
that
was
a
pivotal
moment
also
for
kvm.
It
was
integrated
on
rail
six.
I
believe,
and
it
was
the
only
one
supported
there.
C
We
started
from
the
beginning
using
the
openstack
puppet
modules,
and
actually
we
helped
them
to
to
develop
the
initial
puppet
modules
with
the
puppet
labs.
At
that
time
it
was
funny
times
and
yeah
that
was
just
initially
initially
tests.
We
went
then
to
a
different
release,
and
this
was
always
closed
on
it
for
us
to
test,
and
you
see
that
at
this
time
we
already
moved
to
center
six,
our
scientific
linux,
six
and
also
I
preview.
C
So
you
see
the
challenge:
it's
completely
new
product
trying
to
scale
this
or
to
move
our
own
infrastructure
to
openstack
two
data
centers
and
then
two
different
visualization
technologies,
keystone
ldap
integration.
What
was
also
tried
during
this
version?
You
can
imagine
that
we
have
a
huge
that
directory.
C
Then
the
last
prototype
we
opened
to
some
of
our
community.
We
tried
to
put
all
of
the
services
into
aj
and
we
had
in
this
prototype
more
than
600
compute
nodes.
C
This
was
already
beginning
of
2013
and
basically
we
launched
our
cloud
infrastructure
in
july
2013.
from
the
beginning.
It
was
clear
that
if
we
get
serious
with
this,
we
need
to
engage
with
openstack
community,
because
there
is
no
way
that
we
alone
will
be
able
to
solve
what
the
issues
will
need
the
help
of
the
committee.
So
you
see
that,
from
early
on,
we
started
attending
meetups,
also
helping
the
community
and
the
organizing
meetups.
This
one
was
at
cern
in
end
of
2013.,
and
this
was
my
first
openstack
summit.
C
C
Right
so
the
cern
cloud
infrastructure,
so
we
started
using
scientific
linux
6.,
so
that
was
in
2013..
Later
we
moved
to
centos
center
7.
C
from
the
beginning
that
we
are
using
the
rdo
packaging
and
it's
great
to
have
all
these
packages
and
all
the
best
thing
that
red
that
does.
However,
we
still
have
projects
that
we
need
some
internal
packages.
So
we
need
to
rebuild
all
of
this
and
from
the
beginning
that
we
are
using
the
upstream
public
modules
for
for
openstack.
C
So
some
some
considerations
when
we
start
to
building
this
so
the
number
of
compute
nodes,
we
started
very
small
with
only
a
few
hundred
compute
nodes,
but
we
knew
that
we
wanted
to
move
all
the
data
center
to
to
openstack.
So
at
the
end,
is
a
few
thousand
compute
nodes.
So
is
this
tool
able
to
to
scale
to
these
numbers
at
that
time?
C
If
you
look
back,
there
are
not
a
lot
of
big
sites
using
openstack,
so
that
was
always
a
concern
different
locations,
so
the
data
center
that
I
just
mentioned
at
the
member
of
openstack
projects.
That
was
a
time
that
every
week
was
a
new
openstack
project
popping
up.
It
was
very
hard
to
to
follow
all
of
this,
and
then
there
was
all
these
split
of
functionality.
That
also
was
happening
so,
for
example,
nova
volumes
moving
to
cinder
nova
network
moving
to
quantum.
C
C
There
is
always
this
constant
move
of
people
at
cern,
so
we
needed
to
automate
all
of
this
automation
of
creation
of
the
projects
when
the
people
leave
the
organization,
all
the
project
removal
all
these
automation
needed
to
be
invented,
and
then,
of
course,
so
we're
having
a
large
infrastructure,
all
the
automation
that
is
needed
to
manage
the
infrastructure
itself.
All
the
procedures
just
needs
to
be
figured
out
because
everything
is
new,
so
the
kind
of
workloads
that
we
run
in
the
infrastructure-
mainly
it's
physics,
data
analysis.
C
So
all
the
data
from
the
lhc
experiments
and
many
other
experiments,
I.t
services
and
then
many
other
infrastructure
that
is
required
for
the
organization,
the
experiments
services
to
run
the
experiments
itself,
engineering
services
to
develop
different
tools
for
the
experiments
and
also
personal
vms.
So
any
user
at
cern
has
the
possibility
to
to
have
a
project
and
run
their
desktops,
their
personal
vms
in
the
infrastructure.
C
C
We
have
no
more
flavors,
but
also
they
are
ephemeral
because
they
are
only
processing
jobs,
so
things
like
lime,
migration.
We
are
not
really
interested
for
for
this
kind
of
visual
machines.
C
Then
we
have
the
pets
that
are
all
the
service
vms,
where
performance
is
less
important.
But
what
is
really
important
is
that
we
can
keep
these
virtual
machines
running
and
the
live.
Migration
is
a
huge
requirement
for
for
all
of
those.
It's
multi,
it's
a
mix
of
operating
systems
that
we
have
here.
We
have
a
lot
of
windows
vms
as
well.
C
C
Organization,
so
2013
is
when
we
finally
open
the
private
cloud
infrastructure
to
our
users.
We
started
with
two
cells
and
cells
at
that
time
was
a
quite
new
concept.
Not
a
lot
of
people
were
using
that
we
decided
to
use
cells
instead
of
regions,
even
if
we
had
two
different
data
centers,
mainly
because
we
wanted
this
to
be
as
easy
as
possible
for
the
users
to
migrate
their
workloads
to
the
infrastructure
I've.
These
were
users
most
of
them.
C
They
they
are
not
computer
service,
but
they
need
to
to
manage
their
their
applications,
physicists
that
they
have
their
projects.
C
We
wanted
to
be
as
easy
as
possible
for
them
to
move
their
workloads
from
physical
servers
to
cloud
infrastructure.
So
that's
why
we
wanted
to
reduce
it
maximum
the
number
of
concepts.
C
C
We
didn't
want
to
have
a
single
point
of
failure.
We
accelerometer
bad
idea.
We
moved
back
some
time,
then
after
glance
was
safe
backhands,
but
at
the
beginning,
really
in
2013
it
was
actually
all
the
images
were
stored
on
afs,
because
the
cyph
cluster
at
that
time
was
not
ready.
C
C
I
think
nothing
special,
very,
very
common
architecture.
So
this
is
the
diagram
that
you
can
see
that
we
have
two
cells
geneva
vigner,
and
this
is
all
services
that
are
running
at
at
that
time
for
cellulometer
we're
running
mongodb
for
the
database.
C
There
was
no
gnocchi
at
that
time,
stacktalk.
We
we
started
running
that
at
that
time
it
was.
It
was
very
good
to
have
a
perspective
of
the
service,
and
then
we
we
keep
more
or
less
the
the
same
architecture.
This
is
the
top
cell.
The
architecture
of
cells
v1
is
completely
different
from
cells
v1.
So
that's
why?
You
see,
you
may
not
recognize
the
service
that
we
have
here
from
the
current
architecture
of
openstack.
C
So
this
was
the
vm
growth
since
we
launched
for
our
users
in
in
july
2013,
and
you
can
see
that
this
is
the
cumulative
number
of
vms
that
were
created
in
cloud.
This
is
only
until
april
2017,
but
the
pattern
keeps
the
same
after
2017,
and
this
is
the
number
of
vms
growing.
So
you
see
that
this
was
very
well
adapted
by
our
community
and
also
we
worked
quite
hard
to
basically
move
all
the
compute
all
the
physical
nodes
from
the
data
center
to
the
infrastructure.
C
C
So
all
the
servers
that
were
that
are
dedicated
to
computing
are
converted.
However,
not
everything
in
the
data
center
was
converted
to
compute
nodes
and
not
everything
runs
on
top
of
the
openstack.
One
example
is
storage,
so
does
it
doesn't
make
sense
to
run
the
storage
service
on
top
of
openstack,
so
they
continue
bare
metal
machines,
managed
by
the
storage
team.
C
Right
so
ironic
is
a
quite
recent
product
in
in
our
clouds.
So
I,
if
I
remember
well
in
2019,
it's
available
since
2018
2019
ironic
started
to
for
us.
You
know
it
was
a
requirement
because
a
lot
of
people
a
lot
some
of
the
use
cases
that
that
we
had
didn't
fit
well
virtual
machines,
so
people
that
really
needed
a
huge
virtual
machines,
a
full
node
virtual
machine.
So
there
is
not
didn't
make
a
lot
of
sense
to
visualize
that
environment
for
them,
because
they
were
losing
a
little
bit
of
performance.
C
So
we
will
apply
it
bare
metal,
the
ironic
service
and
initially
our
goal
was
to
to
have
an
api
for
for
for
the
users
to
interact
with
the
cloud
with
bare
metal
the
same
way
they
interacted
with
virtual
machines.
But
we
had
bigger
goals
for
ironic,
not
only
that
not
only
having
a
pool
of
parameter
nodes
for
people
to
use,
but
also
to
change
all
the
workflows
in
the
data
center
and
what
our
goal
at
that
time
was
to
manage
all
the
resources
in
the
data
center
using
bare
metal,
including
the
compute
nodes.
C
So
currently,
all
the
compute
nodes
in
the
infrastructure
are
managed
by
ironic.
We
have
this
kind
of
exception
and
we
have
a
lot
of
this
in
our
infrastructure.
C
C
Okay,
so,
as
I
said,
scale
implies
simple,
because
if
from
the
beginning,
you
know
that
we're
gonna
manage
nodes
dark,
the
architecture
needs
to
be
simple,
so
this
is
an
overview
about
our
architecture,
something
that
we
decided
from
the
beginning
was
to
ease
away
to
the
openstack
services.
So
we
don't
have
a
few
physical
nodes
like
most
of
people,
three
physical
nodes
and
we
run
all
the
openstack
control
plane
in
those
nodes.
C
So
we
just
we
try
to
distribute
as
much
as
possible.
So
we
have
machines
that
only
run
keystone.
We
have
around
16
of
them.
We
have
machines
that
only
run
glance,
api,
neotrend
and
so
on.
Why?
Because
all
this
isolation
allow
us
to
upgrade
all
these
different
openstack
components
independently
and
allow
us
to
focus
in
one
problem
at
a
time,
then
for
nova.
C
Also,
we
have
this
kind
of
architecture
we
run
all
the
api
is
completely
isolated.
Then
we
have
the
level
one
that
is
the
main
control
plane
for
the
for
cell
zero,
where
we
have
the
schedulers
and
conductors.
C
We
have
one
independent
bicycle
instance:
per
cell
and
again
is
to
have
this
insulation,
and
then
we
have
the
the
cells
itself
and
they
are
very,
very
simple:
they
have
the
control,
plane
again
is
related
and
then
all
the
compute
nodes
so
with
which
this
represents
that
we
have
one
control
plane
or
only
one
server,
acting
as
a
control
plane
for
around
200
compute
nodes
and
in
total
we
have
around
80
cells.
E
Yes,
can
you
talk
about
the
benefits
of
taking
the
sales
approach
of
isolating
different
services
within
child
cells,
like
how
did
you
reach
that
requirement
in
your
infrastructure.
D
C
C
Yeah,
I
think
this
slide
is
good
for
this.
Let's
explain
this,
so
we
decided
to
to
go
through
the
cell
routes
because
we
didn't
have
at
the
beginning
to
have
this
region
concept
to
our
users,
but
actually
cells
have
a
lot
of
benefits.
C
Basically,
they
can
act
as
the
failure
domains
also,
they
can
allow
you
to
configure
those
servers
in
that
cell
in
a
particular
way,
and
we
are
looking
at
those
advantages
basically
to
to
the
player
infrastructure.
For
example,
you
can
see
here
that
the
rv
ability
zones
are
basically
sets
of
different
cells.
C
Meaning
that
if
this
cell
goes
down,
the
variability
zone
is
not
completely
down
it's
just
graded,
okay,
and
because
we
have
so
many
cells
means
that
the
control
plane
that
we
have
for
each
one.
It's
only
one
server,
meaning
that
if
that
server
goes
down,
the
workloads
continue
to
run,
because
it's
only
the
apis
for
this
particular
cell.
C
C
But
what
I
would
like
to
show
with
this
slide
is
that
we
also
run
all
these
services
that
I
show
you
in
this
slide:
keystone,
glass,
newton,
all
the
openstack
control
plane.
We
run
it
on
top
of
the
cloud
itself,
so
the
control
plane
doesn't
run
in
physical
machines
dedicated
for
the
control
plane
or
doesn't
run
in
a
different
clouds
only
for
the
control
plane.
It's
running
the
cloud
itself
that
it
manages.
C
So
that's
why
we
have
this
inception
again.
It's
like
ironic,
as
I
mentioned
early,
that
manages
also
the
compute
nodes,
even
if
it's
an
openstack
project.
C
So
you
see
that
in
the
servers
side
by
side
with
the
user
vms,
we
have
keystones,
we
have
glances
vms
and
so
on
and
keystone,
for
example,
and
all
the
other
servers
are
distributed
between
the
different
availability
zones,
the
same
availability
process
that
we
give
to
our
users.
We
use
it
and
also
distributed
between
the
different
cells.
So
that's
why
we
reach
the
high
variability
for
the
control
plane.
B
D
C
D
D
C
All
right
so
between
cern
and
vigner,
it's
1
000,
600
kilometers,
and
that
translates
to
around
24
milliseconds
of
latency,
and
this
was
at
the
beginning.
We
are
trying
not
only
to
figure
out
how
to
set
up
the
new
data
center,
but
also
to
set
up
openstack
on
top
of
it.
Then
we
had
this
legacy,
this
latency
issue
as
well.
C
What
you
see
in
this
slide
as
well
is
the
connections
between
vigner
and
cern.
So
we
add,
two
network
links,
100
gigabits
bandwidth
between
the
two
centers
completely
redundant.
So
that's
why
you
see
two
and
after
a
few
years
we
also
added
a
third
one.
So
we
add
a
connection
with
with
a
total
of
300
gigabits
per
second
between
the
two
centers.
C
And
this
is
what
basically
for
us
this
was
like
connecting.
It
was
a
cloud
interconnect
with
the
peering
networks,
because
it
was
the
same
network
for
people
that
are
used
to
public
clouds.
C
Of
course,
having
the
data
center
there
that
had
some
architecture
implications.
For
example,
we
run
the
databases,
we
started
to
run
the
databases
in
geneva,
but
the
latency
was
so
high,
and
that
was
a
time
that
we
didn't
have
nova
conductor.
So
all
the
compute
nodes
were
connecting
to
the
databases
at
that
time.
C
Another
thing
was
ceph
because
the
staff
cluster
was
in
geneva,
so
at
the
beginning,
because
the
latency
it
was
a
very
bad
experience
for
users
to
have
block
storage
in
vigner,
so
in
2015,
saf
cluster,
the
storage
team
deployed
the
saf
cluster
in
vigner,
for
these
use
cases
or
block
storage
for
the
cloud
and
other
things,
for
example,
a
glance
cache.
C
C
C
So
the
data
centering
vigner
was
operational
between
or
cern
was
using
the
data
center.
It
continues
to
run
between
2013
and
2019.
We
knew
this
from
the
beginning.
It
was
a
contract
for
only
four
five
years
that
was
extended
one
more
year
by
the
end
of
2018.
We
are
running
there,
17
over
sales,
three
hundred
compute
nodes
and
for
the
availability
zones.
There
we
had
the
78
nodes
so
early
2019,
we
started
decommissioning
the
clouds
in
the
vignette
center
and,
as
you
can
imagine,
that
was
another
challenge.
C
So
we
needed
to
remove
all
these
cells
from
the
infrastructure,
and
this
was
completely
in
november
2019
and
one
interesting
part
is
that
2
500
servers
actually
returned
to
geneva
because
they
were
late
purchase
and
it
was
still
very
good
servers
so
and
this
service
was
the
ones
that
were
added
to
the
computer
containers
that
I
show
you
at
the
beginning.
In
that
picture,.
B
C
All
right
so
cells
versus
regions,
so
I
list
here
some
of
the
advantages
of
cells
and
regions
to
try
to
to
make
it
more
clear
why
we
went
for
sales
at
the
beginning.
Basically,
sales
is
too
sharp
than
nova
deployments.
It's.
It
only
applies
to
nova.
C
There
is
no
other
service
that
has
cells,
and
that
is
actually
an
issue,
isolite
server
domains
and
it's
completely
transparent
users
and
one
of
the
things
it's
also
logical
partition
for
operators
and
allow
us
to
to
have
different
configurations
for
a
particular
cells
and
distinguish
the
different
cells
with
different
configurations,
which
is
important
for
us,
for
example,
for
the
batch
use
case
that
has
a
completely
different
configuration
for
when
comparing
to
the
service
cells
regions.
C
C
But
you
have
that
fault,
tolerance,
it's
a
completely
different
environment
that
should
be
managed
in
completely
isolated
way.
So
that
is
the
big
advantage,
and
actually
now
we
are
running
multiple
cells.
By
multiple
I
mean
three
cells,
so
in
2013
for
us
it
was
more
simple
to
manage
one
small
cell
than
two
small
cells.
C
However,
when
the
infrastructure
grow
to
at
this
point,
it's
more
simple
actually
to
manage
two
or
three
small
clouds
than
one
big
one,
and
one
of
the
main
reason
reasons
that
also
we
move
into
regions
is
because
neutron
neutron
doesn't
have
this
logical
partition
with
cells,
and
it
was
a
big
point
of
failure.
So
if
neutron
was
down
or
or
anything
was
affecting
neutron,
it
was
visible
in
all
the
clouds.
C
So
all
the
users
will
see
this
and
partitioning
neutron
between
two
three
different
regions
allow
us
to
to
improve
the
reliability
of
the
cloud.
A
lot
neutron
agents
are
quite
chatty
with
drivetmq
cluster,
so
needs
to
be
a
very
big
right,
mq
cluster,
and
that
is
always
an
issue
to
maintain
also
the
new,
the
the
regions
that
we
have
now
they
are
per
use
case,
so
one
region
is
focused
in
the
I.t
services
and
user
vms.
C
D
C
Well,
we
we
were
forced
to
upgrade
to
neutron,
because
nova
network
is
not
supported
anymore.
However,
for
the
old
regions
we
still,
we
are
still
running
nova
network.
So
we
have
six
six
cells
where
we
still
run
nova
network,
because
we
still,
we
are
still
evaluating
how
we
gonna
migrate.
This
is
quite
scary,
migrating
from
nova
network
to
neutron
without
interruption
of
the
vms.
C
Just
thinking
about
that
the
vms
could
lose
network
connectivity
for
some
time.
It's
very
scary
for
us
and
those
cells
are
running
important
service
for
the
organization.
So
that's
why
we
are
still
trying
to
figure
out
how
to
how
to
do
it
in
the
right
way.
C
C
There
is
no
workload
interruption,
just
people
are
not
able
to
connect
to
to
their
vms,
using
the
openstack
apis
or
to
to
openstack
operations
using
the
the
apis
in
that
particular
cell.
However,
this
simplifies
a
lot
the
the
deployment
because
we
have
around
80
cells.
If
we
had
availability
through
all
these
cells
control
plane,
we
will
have
a
lot
a
lot
of
control
plane
to
manage
and
then,
as
I
already
mentioned,
we
also
run
the
the
control
plane.
C
On
top
of
the
cloud
itself,
drive
team
queue,
it's
very
challenging
to
scale
and
maintain
so
we
try
to
not
run
right
into
clusters
at
all,
and
what
we
find
is
that
if
we
have
very
small
clusters
like
we
have
purcell
reptimq
is
quite
stable
and
not
having
the
complication
of
writing.
Q
clusters
simplifies
a
lot
of
deployment
and
operations.
C
Do
you
want
me
to
go
faster?
I
don't
know
if
we
have
a
hard
line
to.
A
I
mean
we're
fine
on
time.
You
can
go
over
if
you
want
no
problem.
C
Okay,
I'll
I'll
try
to
go
faster
now,
so
my
sequel,
databases
again
like
rabbitmq,
we
don't
have
a
cluster
for
my
for
the
mysql
databases,
so
we
have
independent
mysql
instances
and
the
funny
thing
is
most
of
them
run
on
top
of
the
cloud
infrastructure.
C
C
So
openstack
and
when
we
have
a
cloud
infrastructure
where
we
want
to
have
a
lot
of
functionality,
that
translates
to
a
lot
of
openstack
projects,
and
these
are
the
current
openstar
projects
that
we
run,
and
you
see
the
version
that
we
have
now
in
our
cloud,
and
you
see
that
they
are.
They
are
in
different
versions
and
is
this
isolation
of
deployment
that
allow
us
to
have
this
so
nova
still
in
stein.
One
of
the
main
reasons
that
we
still
stand
is
because
we
still
have
those
cells
running
over
network
and
upgrading.
C
Now,
it's
very,
very
risky.
We
have
a
lot
of
patches
for
another
network
to
continue
to
run,
but
you
see
that
most
of
these
services
are
running
very,
very
recent
releases,
and
that
is
one
thing
that
we
always
try
to
keep
up
with
the
openstack
release
cycle.
C
Having
so
many
open
many
many
openstack
projects
and
managing
all
of
this,
and
most
of
them
manage
thousands
of
resources.
For
example,
sender
manages
thousands
of
volumes
with
the
petabytes
of
storage
behind
it's
always
tricky.
For
example,
ironic.
We
have
now
around
8
000
nodes
that
are
managing
ironic
and
we
started
reaching
scalability
issues
in
ironic.
C
So
again,
that
is
one
of
those
things
that
you
you
get
if
you
are
scaling
the
infrastructure,
and
fortunately
there
is
this
functionality,
conductor,
groups
that
is
more
or
less
like
nova
cells
in
ironic
and
now
we
are
taking
advantage
of
this
so
splitting
logically,
the
ironic
deployments
scale
also
means
staging,
even
if
we
try
to
upgrade
most
of
the
services
every
six
months,
the
configuration
is
always
changing,
so
the
configuration
through
the
private
models
is
always
changing
so
trying
to
have
a
ci
cd
testing.
C
Everything
before
we
deploy
the
production
is
quite
important
and
we
have
a
staging
process
process
to
have
everything.
First,
on
pre-stage
testing
in
a
small
number
of
nodes
and
qa
and
then
going
through
different
master
levels
until
reach.
Everyone
in
the
infrastructure
test
stack
is
what
we
call
to
our
testing
infrastructure.
Very
few
few
nodes
to
test
upgrades
and
new
configuration
options.
C
Scale
also
translates
into
automation.
We
are
using
several
projects
for
automation,
for
example,
rally
to
prove
infrastructure
every
day.
Raleigh
deploys
thousands
of
virtual
machines
in
infrastructure
just
to
make
sure
that
every
cell
is
okay
and
everything
is
running
as
intended
run.
Deck
is
a
project
that
we
use
a
lot
for
operations.
C
For
example,
we
have
different
teams.
The
repair
team,
for
example,
doesn't
have
access
to
the
openstack
resources.
However,
we
have
all
these
procedures
and
run
back
jobs
that
they
can
trigger.
For
example,
when
a
node
is
needs
a
repair,
they
can
trigger
a
job
that
will
basically
try
to
lie
migrate.
All
the
instances
in
that
node
notified
users,
if
that
is
not
possible,
that
that
node
needs
any
repair
intervention
and
then
we
have
misreal.
C
Mral
is
also
an
openstack
project
that
we
use
for
workflows,
for
example,
for
all
the
projects,
creations
and
all
the
projects,
removal
when
a
user
leaves
organization
so
going
through
all
the
resources
from
users
and
make
sure
that
they
are
deleted,
and
all
of
this
is
automated
scale
also
means
permanent
changes.
C
So
upgrades
through
the
openstack
release
cycle
is
every
six
months,
so
we
run
15
openstack
projects.
So,
as
you
can
imagine,
every
it's
almost
an
upgrade
day
for
us
and
then
also
we
have
the
open
operating
systems
distributions
upgrade.
So
we
started
with
scientific
linux
6.
C
There
is
no
easy
way
to
move
from
six
to
seven:
it's
a
required
reinstallation.
In
our
case,
and
now
we
are
facing
again
the
move
from
center,
seven
to
synthesize
and
stream,
and
we
are
working
on
this
hardware
commissions.
So
every
around
five
years
copy
nodes
need
to
be
the
commission
and,
as
you
can
imagine,
a
lot
of
live
migrations
need
to
happen
to
try
to
do
this
transparent
to
the
users.
C
So
recently
we
just
removed
around
or
we
like
migrated
around
900
visual
machines,
because
we
are
the
commissions
some
sales
and
we
continue
to
do
it.
This
is
a
lot
of
work.
We
wrote
a
recent
blog
post.
You
can
follow
our
work
here.
C
Security,
as
you
all
know,
so,
meltdown
spectre.
A
couple
of
years
ago,
created
a
lot
of
fuzz,
so
we
needed
to
actually
reboot
most
of
our
cloud
infrastructure
because
of
this
also
disable
hyper
threading,
reducing
the
number
of
cars
available
and
these
operations.
When
you
have
thousands
of
nodes,
it's
a
lot
of
planning
a
lot
of
work
currently
for
kernel
upgrades.
C
We
are
trying
to
automate
this
because
when
we
have
all
these,
these
compute
nodes
running
they
run
for
years
and
the
upgrade
in
the
kernel
is
quite
difficult
without
disrupting
the
user.
So
we
are
trying
to
automate
this
to
basically
having
a
tool
that
continues,
live
migrating
instances
in
the
infrastructure
and
when
the
compute
node
is
empty.
Just
reboots,
the
compute
node
for
the
kernel
upgrade.
C
And
scale
it's
of
course
teamwork.
So
what
cern
or
openstack
team
it's
six
seven
people,
but
over
the
years
we
added
the
participation
of
thousands
dozens
of
different
students,
fellows
project
associates
that
joined
the
team
for
a
few
periods
of
time
and
contributed
a
lot
for
for
this
project.
B
Nah,
this
was
great
yeah,
you
know
for
people
who
didn't
know
what
cern
was.
I
think
they
got
a
really
good
understanding
of
it,
and
you
know
the
infrastructure.
The
fact
that
you
are
running
different
versions
of
openstack
based
on
what
project
is
something
that
most
people
don't
do,
and
you
all
have
really
good
reasons
for
why
you're
doing
it.
A
C
So
I'm
happy
to
answer
your
questions
also,
the
audience
they
they
can
follow
me
on
twitter.
Ask
me
questions
there.
Some
questions
through
the
email,
I'm
happy
to
answer
them.
C
D
Well
he's
responding.
I
do
have
one
of
my
own
since
I'm
coming
here
from
container
land
right.
One
of
my
questions
is:
are
you
already
running
some
workloads
that
are
distributed
as
container
images
and,
if
not,
are
you
planning
on.
C
C
So
there
are
a
lot
of
applications
using
containers
as
their
deployment
method.
We
also
start
playing
with
the
containers
to
deploy
openstack.
C
We
recently
we
are
experimenting
with
one
region
trying
to
deploy
openstack
on
top
of
kubernetes
using
the
elm
charts,
openstack
elm
shards,
and
we
are
experimenting
with
it
currently.
For
example,
in
production
we
have
half
of
the
glance
requests
going
through
a
glance
that
is
deployed
in
a
kubernetes
cluster.
C
So
we
are
seven
car
members
that,
but
then
we
have
all
these
fellows
and
project
associates
that
joins
our
team,
but
usually
they
don't
do
operations
and
they
do
more
investigative
work
like
looking
for
different
projects,
evaluate
different
openstack
projects
or
kubernetes
associated
projects,
and
then,
if
we
think
it's
worth
to
to
invest
in
those
projects,
so
then
we
go
further
and
we
try
to
implement
them
to
deploy
them
in
the
cloud.
A
C
So
currently
we
have
some
people
doing
work
on
gpus
trying
to
understand
how
to
to
have
gpus
in
the
clouds
other
people
looking
into
how
to
have
functions
as
a
service
in
the
cloud,
for
example,
we
have
all
these
different
projects
that
are
always
going
on.
C
Oh
yeah
sure,
so
at
the
beginning
we
collaborated
a
lot
with
nectar,
for
example,
that
is
a
scientific,
scientific
research
network
in
australia,
and
at
that
time
they
were
using
openstack.
They
are
still
doing
and
they
were
quite
big
and
we
changed
a
lot
of
ideas
how
to
deploy
openstack
and
more
recently,
we
collaborate
with
the
ska.
That
is
the
square
kilometer
array,
basically,
is
the
biggest
or
will
be
the
biggest
telescope
telescope
in
the
world,
the
sites
in
south
africa
and
also
australia
for
observation,
and
we
did
interesting
projects
with
them.
C
So
they
are
not
available
in
openstack
by
default,
so
we
collaborated
with
ska
to
develop
this
also
running
kubernetes
clusters
on
bare
metal.
There
was
a
lot
of
work
that
was
done
in
collaboration,
for
example,
with
ska
in
this
area.
A
Cool
awesome
all
right,
so
I
dropped
links
to
both
nectar
and
ska
in
the
chat.
If
folks
are
curious
about
what
those
organizations
are
all
about,
if
there's
anything
else
feel
free
to
reach
out
to
me
question
wise
short
at
redhead.com,
I
can
pass
along
to
belmiro
and
the
team
here,
but
without
any
further
questions
I
think
we'll
just
wrap
up
here.
So
thank
you
very
much
from
here.
This
was
awesome
presentation.