►
Description
Case Study: OpenShift @ UPS
Freddy Montero (Red Hat)
Kevin Chiang (UPS)
at OpenShift Commons Gathering 2019
at Red Hat Summit
A
Hello,
my
name
is
Freddy
Montero
I'm,
an
architect
with
the
container
practice
at
Red
Hat,
how
many
of
you
guys
buy
stuff
online,
especially
from
Best,
Buy
anyone
and
hence
how
many
of
you
guys
check
the
tracking
of
your
packages
going
at
ups.com.
Of
course,
that
tracking
comes
to
the
opposite
clusters
that
we
have
at
UPS.
What
we're
going
to
be
talking
about
today
is
going
to
be
our
journey
of
doing
software
more
or
less
open
ship
upgrades
without
any
downtime,
and
this
is
going
from
open
chef,
304
to
OpenShift,
3.11
and
beyond.
A
B
B
Let's
go
over
the
agenda,
so
UPS,
who
we
are,
what
we
do
also
the
background
and
also
our
journey,
like
Freddy,
mentioned
our
journey
with
open
ship,
how
we
started
and
why
we
pick
why
we
pick
you
can
shift
and
also
with
the
monthly
open
operating
system,
patching
and
also
the
upgrade
path
from
3.4
to
3.9
and
now
from
3.9
to
3.11
and
some
of
the
lessons
we
learned
along
the
way.
And
finally,
the
accomplishment
and
romantic.
B
Ups,
what
do
we
do?
What
can
Brown
do
for
you?
So,
as
you
can
see
on
this
slide,
most
of
you
think
at
UPS
we
should
packages,
there's
a
lot
more
than
what
we
do
then
shipping
package.
So
so,
as
you
can
see,
on
the
left
hand
side
we
have
the
global
small
package
surface,
that's
what
you
got
most
people
are
familiar
with,
so
we
have
the
domestic
package
and
we
have
the
global.
We
have
the
global
global
package
services.
B
So
some
of
the
interesting
facts
that
I
found
along
the
way
we're
how
many
packages
cars
do
we
have
so
today
we
have
about
over
a
hundred
and
a
hundred
twenty-three
thousand
package
cars
all
around
the
world.
We
have
about
a
hundred
fifty
thousand
of
storefronts
all
around
the
world
and
200
over
200
aircrafts.
B
B
B
So
these
are
mission,
critical
applications
and
no
out
no
outage
no
downtime
for
Mission
for
open
ship.
So
how
we
accomplish
this?
What
we
did
was
we
use
some
of
the
rehab
tools
that
can
that
can
migrate
the
existing
infrastructure
without
any
incident,
and
last
year
we
had
also
wanderer
had
Innovation
Award
in
2018.
B
Okay,
the
journey,
how
we
started
why
we
went
this
path,
so
traditionally
we
started
with
when
a
when
a
developer.
When
an
application
comes
come,
you
need
to
get
some
work
done.
What
they
first
do
is
they
they
take
their
the
application
teams,
develop
their
develop
developers,
take
their
code
and
develop
their
in
their
laptops,
and
then
eventually
they
would
take
that
code
and
hidden
off
to
the
operation.
Team's
operation
teams
will
then
schedule
jobs.
B
I'm
sorry
make
a
change,
a
request,
change
controls,
schedule,
jobs
to
put
put
put
the
code
into
a
into
a
development
environment
and
then
eventually
put
it
in
UAT,
environment
and
stress
in
the
farm,
eventually
in
production
and
during
the
whole
process.
If
something
was
to
go
wrong,
the
applicant
that,
if
something
was
to
go
wrong,
the
operation
would
go
back
to
the
developer
and
then
back
and
forth
a
lot
of
times
were
wasted.
So
that's
when
we
started
looking
to
hey.
B
B
The
difference
is
that
once
your
once,
their
code
is
done
developing
what
they
would
do
is
ship
them
off
to
a
git
repository
and
with
the
right
configuration
when
when
a
code
is
updated,
it
would
automatically
push
into
the
development
environment
and
they
they
can
start
doing
their
testing
and
if
all
is
good
with
a
push
of
a
button,
essentially
that
code
get
pushed
to
push
to
to
the
next
environment,
which
is
stress
and
then
another
with
a
push
of
a
button.
It
gets
pushed
to
the
production
and
in
the
middle
of
process.
B
B
So
why
do
we
pick
rahat
OpenShift
so
prior
to
read
how
open
chef?
Actually
today,
we
still
have
it
so
whenever,
whenever
we
we
have
many
different
environments
right
so
for
the
dot
Nets,
we
have
the
windows.
We
have
the
windows
platform
for
the
java
environment.
We
have
BA
WebLogic
or
WebSphere.
So
all
of
these
are
separate
environments
that
operation
teams
need
to
support.
B
Other
applications
that
sitting
on
the
platform-
and
the
next
part
is
the
application
team
controls,
operation
tasks
so
essentially
operation
operation,
team
control,
therefore
their
own
destiny
right.
They
can
scale
up
odds
when
they
have
when
they
have
more
traffic
coming
to
their
environment.
They
can
monitor
their
own
CPU
and
say:
hey.
Do
I
need
more
track,
do
I
need
more
pods
for
for
their
environment,
and
then
they
can
also
self
help
a
persistent
storage
request
and
then,
lastly,
the
application
and
portability
they
develop,
whatever
they
have
on
their
development
and
their
laptop.
B
It
can
be
ported
over
to
any
environment
that
they
want
ok,
monthly,
OS
patch.
So
one
of
one
of
our
mandates
from
our
security
team
is
that
we
have
to
be
patching
every
single
month
and
during
so
for
for
and
for
the
patches.
It
covers
the
kernel
patches,
security
configuration
and
some
of
the
custom
configuration
that
the
the
OS
team
that
the
different
from
the
OS
team
so
with
having
over
so
in
our
environment.
B
We
have
about
over
four
hundred
four
hundred
or
service
for
the
open
ship
environment,
14,
open
ship
clusters,
and
it
requires
no
outage.
How
do
we
do
it
so
by
doing
so,
what
we
did
was
we
create
our
own
answerable
scripts
and
once
the
ansible
scripts
is
created,
we
would
execute
through
the
ansible
tower.
So
I'll
just
give
you
a
high
level
how
the
ansible
script
looks
like.
How
does
it?
How
does
the
path
looks
so
from
a
from
a
single
server
perspective,
we
would
essentially
tell.
B
Okay,
so
so
from
a
single
per
server
perspective,
we
would
have
a
pre
task
so
prior
to
starting
any
any
patch
first
thing.
First,
you
have
to
check
to
ensure
that
hey
the
environment
is
fully
operational.
Once
once
do
we
deem
that
the
script
will
determine
the
environment
for
the
operation
once
the
the
environments
operation,
then
we
can
start
looking
hey.
The
first
task
is
we're
going
to
take
whatever
server
that
we're
going
to
be
working
on
and
then
coordinate.
B
So
in
the
middle
of
the
in,
in
the
all
the
way,
on
the
right
hand,
side
where
hey
the
patch
actually
starts,
and
then
it
will
do
all
the
configuration
change,
the
kernel
updates
and
then
finally,
it'll
do
a
server
reboot
and
the
last
piece
is
it'll
check.
Hey
did
the
operator.
Did
the
OS
patch
cause
any
problems
and
if
all
is
successful,
it
will
put
put
everything
back
in
the
mix.
B
So
that's
from
a
single
server,
that's
from
a
single
server
patch
view.
So
how
does
an
table
tower
play
in
this
whole
picture?
What
answerable
path
tower
does
for
us
is?
It
does
a
scheduling
so,
as
you
may
know,
that
with
openshift
everything
is
broken,
things
are
broken
down
to
masternodes,
infra
notes
and
application
notes.
So
what
we
do
is
using
the
capability
tower.
B
It
will
continue
how
to
to
the
point
where
we
stopped
and
then
also
it
goes
the
same
for
the
infra
infrastructure,
because
it's
a
front
door.
It's
it's
a
front
door
to
open
ship.
We
also
do
it
one
server
that
time
in
event
that
if
there's
again
of
the,
if
there's
any
failures
manual,
intervention
has
to
come
and
fix
the
issue
and
then
move
on
where,
where
it
gets
better,
is
the
application
nodes.
We
have
multiple
application
nodes
and
then
there's
some
concurrency
that
we
can
happen
over
there.
B
B
So
how
do
we
do?
How
do
we
upgrade
from
3.4
3.9,
with
with
the
upgrade
of
3.4
to
3.9,
there's
an
EDD
upgrade,
which
that
means
that
essentially
it's
we're
forced
to
have
an
outage.
We
have
to
bring
down
the
data
at
CD
database
and
in
upgrade
upgrade
the
database
in
order
to
avoid
outage.
Essentially
what
we
did
was
did
the
blue
blue
green
deployment,
which
that
means
is
hey.
B
If
we
have
two
clusters,
one
is
on
the
3.4
and
it's
servicing
existing
customers
and
then
3.9
environment,
which
will
be
the
new
build
and
then
we'll
have
two
pipelines,
and
one
is
the
3.4
type
line
that
would
deploy
that
can
deploy
code
out
to
the
existing
environment
and
the
other.
The
other
pipeline
is
for
the
3.9.
So
once
all
the
testing
is
completed
for
the
3.9
environment,
essentially
the
application
team
will
just
do
a
DNS
failover.
So
so
we
have
multiple
application
teams
they
can.
They
have
full
control
of
when
we
give.
B
B
3.9
to
the
3.11,
we
are
currently
doing
that
today.
Over
the
past
weekend,
we
just
completed
one
of
the
our
production
environment.
So
how
do
we
do
it?
So
we
had
to
create
again.
We
have
to
create
our
ansible
script,
there's
a
lot
of
customizations
that
we
have
to
do
to
to
make
to
ensure
that
everything
is
happening
correctly.
B
So
what
are
some
of
the
customization
that
we
do
so,
for
example,
the
the
hostname
has
to
be
in
the
right
format
that
opus
ship
requires
the
proxy
settings
that
we
have
in
configuration
a
log
location
because
from
3
93.9
to
3.11
there's
a
major
shift
in
everything
becomes
a
pod,
so
even
for
the
infrastructure,
even
for
what
we
had
before
was
the
the
RPM
base.
Everything
becomes
a
pod
format,
so
all
of
that
is
to
ensure
that
everything
is
copasetic
is
we
had
to.
B
We
had
creative
script
to
make
that
happen,
and
once
that's
done,
we
have.
Then
we
start
using
the
Red
Hat
out
of
box
play
books.
Again,
everything
is
divided
in
two.
We
have
the
control
plane
and
then
in
which
that
has
the
master
nodes
and
then
also
it's
broken
down
to
the
infra
nodes
and
then
half
notes
again
master
nodes,
we're
doing
one
at
a
time
and
infra
knows
one
out
of
time
in
app
nodes.
We
can
do
it
concurrently
to
save
to
save
a
lot
of
time.
B
And
lastly,
once
the
once
the
environment
is
upgraded,
a
3.11,
we
would
have
our.
We
create
the
post
ansible
playbook
that
we
use
to
ensure
that
everything
is
running
at
this
exact
same
manner
for
a
different
cluster.
So
for
and
also
a
lot
of
customization
again
comes
in
play
where
performance,
customization,
router,
customizations
and
application
logging
and
time
sync,
for
example,
etc.
B
What
are
some
of
the
lessons
learned
that
we
learn
from
upgrading
from
3/4
or
actually
beyond
open
ship
in
general?
The
top
2
is
more
of
I.
Guess
it's
learning,
curves
skill
sets
so
I'll.
Just
read
it
off
a
open,
open,
open
ship
skill
gaps
to
design,
implement
and
support
the
infrastructure
as
a
short
period
of
time.
So,
as
you
may
know,
open
shift,
it's
not
just
hey,
you
can
just
learn
open
ship
and
and
that's
it
with
open
ship
there's
so
many
different
components.
You
have
the
networking
you
have.
B
You
have
ansible
that
you
instable
you
have
the
reloj
as
there's
so
many
different
components
and
in
order
to
be
successful
with
deploying
open
shift,
you
have
to
have
expertise
in
so
many
different
areas.
So
that's
what
a
lot
of
challenges
that
we
saw
that
that
we
encounter
hey.
You
have
one
person,
that's
a
good
in
one
area,
but
not
good
in
the
other,
so
we
had
to
learn
a
lot
from
hey
every
different
area.
We
kind
of
have
to
start
learning
how
everything
works.
In
order.
B
When
we
encounter
problems,
we
can,
we
can
resolve
it
in
a
quick
manner.
The
third
one
is,
the
resource
planning
is
critical.
Today
we
have
one
full-time
myself
and
two
consultants.
We
need
to
we're
looking
at
increasing
our
resources
to
be
able
to
to
be
able
to
support
our
existing
infrastructure
from
a
technical
side.
We
have
a
a
proxy
router,
so
one
of
the
big
things
that
we
we
ran
into
was,
after
we
deployed
3.4,
3.4
environment,
and
we
had
all.
B
We
had
all
the
applications
team
on
one
one
cluster
and
what
we
got
was
that
periodically
we
would
have
application
team
come
to
us,
then
go
hey,
we're
getting
some
drops
and
we're
getting
sometimes
out
or
some
requests
are
taking
longer
and
then
later
on
later
on,
we
realized
that
hey
a
tree
proxy
is
a
single
process
single
threaded
process.
So
what
that
means
is
that
hey
grant
a?
We
created
multiple
physical
servers
for
each
a
proxy
to
sit
for
the
pot
to
sit
on.
B
However,
being
that
it's
a
single
process,
it
will
only
use
one
out
of
56
cores
that
that
we
have
allocated
so
the
resolution.
For
that
is
weak,
we
started
creating
more
multiple
pods
on
each
single
servers
so
that
we
can
we
so
that
the
process
can
be
utilized
on
the
particular
box
more
efficiently
and
in
custom
role.
We
set
permission
with
the
upgrade
of
33.9
or
3.11.
B
There's
some
custom
rules
that
we
have
to
go
back
and
revisit
and
work
with
the
app
team
to
make
sure
that
the
custom
world
fulfils
their
their
requirement
and
open
ship.
Uninstall
job
did
not
clean
up
directories.
One
of
the
things
that
we
do
at
UPS
is
we
would
so,
for
example,
if
we
were
to
upgrade
to
a
different
version,
we
would
reuse
on
the
application
notes
and
the
uninstall.
B
What
we
learned
was
that
the
uninstall
doesn't
always
clean
everything
up,
so
what
we
have
to
do
is
we
have
to
manually
go
into
each
server
and
make
sure
that
everything
is
clean
up
correctly
and,
lastly,
performance
save
mode.
Essentially
it's
a
it's
a
flag
that
you
turn
on
and
off
that
hey.
If
it's.
If
it's
turned
on
meaning
that
hey,
if
the
server
is
not
being
utilized
heavily,
it
will
start
shutting
things
down
or
it
will.
It
will
start
shutting
things
down
and
make
it
had
cost
problem
for
us.
B
Okay
accomplishment
in
roadmap
2017
we
started
with
the
open
ship.
We
build
out
the
3.4
infrastructure
2018,
we
start
onboarding
a
lot
more
applications
and
then
we
again
we
build
out
the
infrastructure,
and
then
we
upgraded
from
3.4
to
3.9
and
through
2019
to
2020,
were
in
the
process
of
upgrading
3.9
to
3.11
again
we're
we're
onboarding
additional
applications
coming
out
to
open
ship
and
towards
the
middle
or
end
of
this
year.