►
From YouTube: Parallel GPU Quantum Circuit Simulations on Qiskit Aer
Description
Parallel GPU Quantum Circuit Simulations on Qiskit Aer
Jun Doi
A
A
Air
is
the
one
of
the
components
of
the
open
source:
blood
Quantum,
Computing
platform,
G,
Street
and
gesture
is
the
Quantum
circuit
simulator,
that's
land
on
the
cross
card,
computers
and
adjusted
supports
various
types
of
simulation
method
here:
State
Vector
simulation
and
the
Unitarian
density,
Matrix
and
stabilizer
and
MPS.
So
usually,
we
are
using
State
Vector
simulator.
That
is
the
standard
simulation
method,
and
these
types
stabilizer
and
the
MPS
is
used
for
the
large
station
such
simulation.
A
But
the
quantum
subject
is
very
limited
for
use
for
this
simulators
and
just
if
they
are
also
support,
various
types
of
noise
models
that
behaves
the
actual
content,
computers.
A
So
now
I'd
like
to
talk
about
the
GPU
support
for
the
GC
player
So,
currently
just
their
support,
these
three
types
of
simulation
methods,
data
beta
and
unitary
and
density
Matrix,
and
now
we
are
planning
to
add
GPU
support
for
the
stabilizer
simulator
and
also
we
are
now
developing
the
tensor
Network
simulator,
that
that
is
the
enhanced
Simulator
for
the
MPS
simulator.
A
So
here,
I'd
like
to
show
the
performance
of
the
GPU
acceleration
on
the
state
Vector
simulator
on
gcga,
so
blue
line
shows
the
accumulation
time
of
the
CPU
simulation
and
the
green
line
shows
the
simulation
time
of
the
by
using
the
GPU.
A
So
we
have
a
16
gigabyte
of
memory
on
V100,
so
we
can
simulate
up
to
29
cubits
by
using
single
GPU
here
and
by
using
the
6
gpus.
We
enhance
the
number
of
the
qubits
to
the
32
cubits
and
also
we
can
store
the
state
of
the
a
content
started
on
the
6
gpus
and
also
we
can
put
it
on
the
CPU.
So
by
using
these
memories,
we
can
simulate
up
to
35
cubits
on
this
machine.
A
So
let
me
introduce
how
to
install
the
gpus
support
for
the
gcd
player
in
this
chart.
First,
installing
the
existed
Itself
by
using
the
clip
install
key
skit
and
after
that
we
have
to
uninstall
the
existing
tested
layer
that
that
is
the
test
data
for
the
CPUs,
so
keep
uninstall
GC
there.
And
finally,
we
installed
the
separate
binary
for
the
GPU
supported
TC
there
by
using
the
Deep,
install
cheats.
Gpu
like
this,
so
you
you
can
now
learn
the
GPU
support,
CCT
air,
so
to
run
the
existed
pair
with
the
GPU
support.
A
In
the
script
you
just
got
this
option
device
equals
GPU.
So
then
the
simulation
goes
to
the
GPU.
So
this
is
the
simple
example
to
run
the
Content
Volume
Circuit
by
using
the
state,
Vector
method
and
GPU.
A
So
let
me
example:
let
me
explain
about
the
implementation
of
the
parallel
Quantum
circuit
simulation
in
qctr
to
simulate
the
large
number
of
cubits,
a
Quantum
States,
distributed
into
multiple
gpus
or
a
multiple
process
on
the
cluster
by
using
the
MPI.
A
So
if
we
do
not
divide
the
state
by
Chunk,
we
have
to
prepare
the
large
buffer
to
receive
the
state
from
the
different
distributed
memory
space.
But
by
dividing
into
this
small
Chunk,
we
only
have
to
prepare
the
receiving
buffer
for
one
chunk.
So
we
can
save
the
memory
usage
by
using
this
technique
and
also
we
optimize
the
a
data
Exchange.
A
By
using
the
transpiling
technique
before
learning
the
actual
simulation,
so
this
is
the
input
side
screen
Quantum
circuit
and
we
divide
the
state
into
the
chunk.
So
if
the
date
is
inside
the
Chunk,
we
do
not
have
to
transfer
data
between
chunks,
but
if
the
some
of
the
Jets
are
on
the
out
of
the
chunk,
so
NC
is
the
chunk
size
and
if
the
gate
operation
is
on
the
Cubit
larger
than
NC,
we
have
to
transfer
data
between
chunks.
A
So
this
is
the
another
example
to
use
the
multiple
gpus
if
you
have
on
the
system.
So
this
is
also
very
simple:
just
adding
the
protein
cubits
option
here.
A
Also,
this
example
shows
how
we
use
the
a
multiple
nodes
on
the
cluster
by
using
the
API,
but
unfortunately,
there
is
no
binary
distribution
for
the
MPI
support.
So
please
build
from
the
source
code
if
you
want
to
use
NPI
and
the
this
example
is
also
simple,
and
this
blocking
two
bits
option
is
as
similar
to
the
GPS
multi-gpus
case
and
by
using
the
NPI.
The
result
is
returned
to
all
the
processes,
but
by
querying
querying
the
metadata
in
the
result
you
can
know
which
MPI
rank.
A
You
are
learning
on,
so
to
learn
the
a
simulator
on
the
multiple
node,
just
passing
this
python
code
to
the
MPI
Lan
command.
A
So
we
are
using
one
node
to
the
eight
nodes
here
and
we
we
are
also
using
the
quantum
volume
circuit
to
test
this
one
and
sorry
left
hand.
Side
graph
shows
the
strong
steering
and
the
right
hand.
Sides
graph
shows
the
weak
straining
so
the
story
so
strong
steering
shows
the
fixed
cubic
subject
to
the
full
program.
A
So
in
this
case
the
performance
of
the
two
node
is
not
good
compared
to
the
one
node
because
of
the
mpis
transfer
data
transfer
overhead,
but
by
increasing
the
number
of
the
node.
The
simulation
time
decreases
like
this
and
for
the
weak
stating
ideally
the
graph
shows
the
horizontally,
but
for
the
large
static
simulation.
The
performance
is
not
so
good,
but
it
is
important
that
we
can
simulate
the
large
number
of
cubic
by
using
the
sum
of
the
nodes
on
the
Clusters.
A
So
Christy
they
are
also
supports
the
short
level
parallelization
using
the
master
CPUs.
So
the
very
short,
very
short
simulation
is
used
for
the
subject
within
intermediate
measurements
or
the
simulating
the
noise
models.
A
So
if
the
simulation
simulation
has
some
of
the
multiple
shots
here,
the
key
City
are
automatically
distribute
this
shot
into
the
multiple
gpus.
If
the
system
has
multiple
gpus
like
this,
but
most
of
the
cases
for
the
multiplication
simulation,
the
number
of
sort
of
the
is
static
is
very
small.
In
that
case,
the
overhead
of
the
GPA
execution
is
the
bottleneck
like
this.
A
So
in
this
case
the
calculation
time
on
the
GPU
is
very
small,
but
the
overhead
is
the
dominant
for
the
performance,
so
it
is
not
good
and
if
the
problem
size
is
larger,
so
in
this
case
the
simulation
time
itself
is
larger
than
the
overhead.
So
we
can
ignore
the
overhead,
so
if
we
Implement
by
using
the
very
short
but
the
execution
technique
on
GPU.
So
this
is
the
example
for
the
noise
simulation.
A
A
So
we
by
synchronizing
the
short
execution
like
this,
and
we
calculate
in
the
single
single
GPU
kernel
in
the
vertical
box
here.
So
we
can
decrease
the
GPU
overhead
so
for
the
across
cross
noise
model.
Originally,
the
class
knows
more
close
operator
is
inserted
to
the
subject
here,
so
we
can
synchronize
to
execute
in
a
single
as
if
you
can't
so.
This
is
the
performance
evaluation
of
the
butt
shot,
but
the
mouse
short
optimization.
A
Shows
the
CPU
execution
and
the
original
implementation
shows
the
orange
line
here
by
GPU,
and
the
speed
up
is
not
so
large
here
because
of
the
large
GPU
plus
gpus
overhead,
but
by
using
the
bat
shot
execution,
which
includes
the
performance
a
lot
and
for
the
comparison
we
also
brought
the
density
Matrix.
So
by
using
the
density
metric
simulator,
we
can
simulate
the
noise
model
only
by
once
one
shot,
so
it
is
very
fast,
but
the
it
takes
much
memory
and
has
a
large
computation
overhead.
A
A
So
we
also
support
the
quantum
API
that
is
provided
by
Nvidia.
A
A
So
this
is
the
performance
comparison
and
the
green
green
line
shows
the
performance
of
the
acousted
back
support
and
the
Orange
Line
shows
the
original
gcts
GPS
implementation.
So
for
the
large
number
of
tube,
it's
cooked
little
big
support,
accelerate
about
twice
as
the
original
one.
A
So
if
you
want
to
use
the
large
number
of
cubits,
so
please
try
this
option
so
that,
let
me
summarize
my
talk
so
I
I
introduced
the
polarization
on
the
Quizlet
here
and
we
saw
on
the
future
plan.
We
are
now
developing
the
tensor
net
based
simulation
by
using
the
curtains
on
it.
This
is
the
also
the
component
of
the
cool
Quantum
SDK
from
the
Nvidia,
and
also
we
are
planning
to
implement
stabilizer
simulation
by
using
GPU
support.
A
Thank
you
June
for
the
great
talk,
rather,
questions
for
June.