►
From YouTube: GPU-Accelerated Quantum Circuit Simulations with QCLAB++
Description
GPU-Accelerated Quantum Circuit Simulations with QCLAB++
Roel Van Beeumen (LBL)
A
Hello,
everyone
today,
I'm
gonna
talk
about
some
recent
results
on
our
keep.
You
accelerated,
Quantum,
certain
simulations
with
UC
lab,
plus
plus,
and
that's
joint
work
with
Dan
Camps
from
nurse
care.
So,
let's
start
with
what
is
QC
lab,
plus,
plus
qcla
plus
plus,
is
a
object,
oriented
and
fully
templated
C
plus
packages
package
that
we
have
developed
for
creating
and
representing
Quantum
circuits,
and
please
check
it
out.
It's
free
available
on
GitHub
and
we
have
developed
that
with
the
numerical
linear
algebra
focus
in
mind,
and
these
are
the
following.
A
Three
key
features
of
qcla
plus
plus
first
is
efficiency
and
performance.
Second,
it's
numerical
stability,
for
example,
for
a
rotation
gate.
We
don't
store
the
angle
of
the
gate,
but
we
store
the
cosine
and
the
sine
of
the
angle
such
that
we
can
do
some
numerical
stable
operations
with
it
and
we
try
to
make
it
as
user
friendly
as
possible.
A
So
and
this
with
the
goal
to
allow
for
rapid
prototyping
and
testing
of
quantum
algorithms
in
C,
plus
plus,
we
also
have
a
Matlab
component.
If
you
are
interested
in
that,
n2c,
lab
plus
plus,
is
the
backbone
of
several
other
packages
developed
here
at
Berkeley
lab,
for
example,
f3c.
It's
the
fast
3
fermium
compiler
for
time,
Evolution
Quantum
circuits
of
hamiltonians,
spin
hamiltonians
that
can
be
mapped
to
free
formulas.
A
So
today,
I'm
gonna
talk
about
State
Factor
simulation
and
how
we
do
that
efficiently
in
qclr
plus
plus.
So
how
do
we
apply
a
Quantum
gate
to
a
given
n
qubit
state,
which
is
a
in
effective
space
of
2
to
the
N?
So
mathematically
speak
in
this
corresponds
to
the
following
Chronicle
products,
where
we
have
the
quantum
gate:
U,
which
is
unitary
Matrix
and
then
on
the
left
and
the
right.
A
We
have
some
Chronicle
products
with
the
identity
matrices
today,
I'm
gonna
explain
how
we
can
do
that
very
efficiently
by
calculating
indices
through
bit
operations,
and
for
that
we
will
use
the
big
ending
binary
convention.
So
suppose
we
have
just
a
integer
chain,
then
we
can
represent
its
binary
representation
as
follows,
and
note
that
this
monospace
characters
will
correspond
to
base
which
are
either
zero
or
white.
A
So
here
is
an
overview
of
the
different
type
of
quantum
Gates.
We
are
considering
first
of
all,
the
one
qubit
Gates,
the
two
qubit
gates,
also
kubit
Gates.
So
it's
a
unitary
acting
on
K
qubits.
Then
we
have
the
control
gates,
for
example
a
controlled
one.
Gate
example
is
a
C
note
gate
and
then
these
Gates,
which
are
acting
on
different
parts
of
the
qubits
you
could
see,
for
example,
the
swap
gate
not
on
nearest
neighbor
qubits
as
an
example
of
these
qubits
and
these
Quantum
Gates.
A
So,
let's
start
with
the
most
simple
case,
so
the
one
cubic
gates
with
every
one
qubit
gate
U
here,
which
is
acting
on
the
qubit
cube
and
there
is
a
unitary
two
by
two
Matrix
connected
to
it,
and
the
state
Vector
simulation
breaks
down,
in
fact
into
a
series
of
two
by
two
Matrix
operations.
I
know
all
this
in
dishes.
They
look
very
complicated.
So
let's
immediately
look
to
an
example.
A
So
here
we
consider
three
examples:
we
have
a
three
qubit
State,
three
qubit
circuit,
where
we
apply
the
the
one
cubic
gate
U
on
the
first,
the
second
or
third
Cube.
A
If
we're
gonna,
look
to
the
Matrix
representations
of
these
three
different
examples,
then
we
see
that
if
we
apply
the
one
qubit
gate
U
to
the
last
few
bit,
we
just
have
local
two
by
two
operations.
On
the
other
hand,
if
we
apply
the
cube
the
one
qubit
gate
to
the
first
qubit,
then
we
still
have
local
two
by
two
operations,
which
are
indicated
by
the
different
color
schemes.
But
the
elements
are
now
separated
2
to
the
N
minus
1
apart.
A
A
So
if
you
look
to
these
indices,
AJ
and
BJ,
we
can
immediately
see
that
we
can
split
them
in
three
different
parts:
the
parts
with
indices
before
and
the
qubit
cube
and
the
cubic
Cube,
and
then
the
part,
the
the
bits
after
that,
so
without
lots
of
generality,
we
can
select
this
I
L
bits
as
follows:
where
in
fact
the
numerical
value
is
J
are
equal
to
the
following
bit:
representation:
note
that
this
J
has
one
bit
less
than
this.
I
j
and
BJ.
A
Basically,
they
just
corresponds
to
removing
the
bit
at
position
Q
and
then
we
can
Define
the
following
left
and
right
masks.
Basically,
this
mask
L.
If
we
apply
that
to
J,
then
only
the
first
few
bits
are
kept
and
if
we
apply
the
the
right
masks,
then
this
part
of
the
qubits
are
kept
and
by
using
this
left
and
right
bit
masks,
we
can
write
these
indices
or
calculate
this
indices
AJ
as
follows:
we
just
mask
J
with
them
the
M
right
m
r
and
then
so.
A
Basically,
we
retain
this
part
of
the
images
and
the
of
the
bits,
and
then
we
sum
them
with
ml
Mata,
J
Mass
with
ML.
So
we
keep
these
Q
bits
and
we
just
shift
them
one
to
the
left.
So
then
we
have
in
fact
AJ
and
the
only
difference
between
AJ
and
BJ.
Is
this
one?
And
this
one
we
can
just
add
to
Aha
by
summing
with
2
to
the
N
minus
Q
minus
1..
A
So
here
is
the
overall
algorithm,
it's
a
very
simple
algorithm.
It
just
contains
one
for
Loop,
so
we
Define
The
Masks,
then
in
every
iteration.
We
just
compute
the
indices,
indices,
AJ
and
BJ,
and
then
we
do
the
local
two
by
two
operations.
Now
this
is
the
algorithm
for
a
channel
one
cubic
gate.
Of
course,
if
we
have
a
specific
structure,
we
can
further
simplify
lines.
Five
and
six
of
this
algorithm.
For
example,
if
we
have
a
power
x
gate,
then
the
unitary
operation
is
just
the
swap
operation.
A
With
the
probability
y,
we
have
a
swap
operation
and
multiply
with
minus
I
and
I
and
for
the
power
Z
gate
and
also
for
the
phase
gate.
We
only
have
to
operate,
update
half
of
the
part
of
the
vector
and
then
also
for
the
rotation
Z
gate.
We
can
get
very
efficient
operations
because
we
don't
have
to
it's
it's
a
local
operation
because
we
basically
just
do
a
diagonal
scale.
A
So,
more
importantly,
if
we
have
a
particular
sparsity
structure
in
the
gates,
we
should
definitely
exploit
so
the
second
class
I'm
gonna
talk
about
today
is
the
controlled
one.
Two-Bit
Gates,
and
here
we
consider
four
different
tastes.
So
the
first
two
cases
we
have
the
control
qubit
smaller
than
the
target
qubit.
Where
on
the
right,
we
have
that
the
target
qubit
is
smaller
than
the
control
qubit,
and
then
this
full
dot
corresponds
to
a
one
controlled
gate
and
the
open
Dot
corresponds
to
a
zero
control
gate.
A
So
let's
look
at
two
examples,
so
the
first
example
here,
but
we
now
apply
it
on
a
three
qubit
state.
The
first
thing
we
note
is
that
we
only
have
to
update
half
of
The
Matrix,
so
we
can
exploit
this
this,
and
this
means
that
instead
of
having
what
we
had
before
four
two
by
two
operations,
we
will
only
have
two
two
by
two
operations
and
then,
if
you
look
to,
for
example,
this
right
example,
we
still
have
only
two
operations
we
have
to
do,
but
it's
different
parts
of
the
Matrix.
A
So
if
you
look
here,
we
can
again
compute
them
very
efficiently
through
bit
operations.
So
again,
the
state's
Vector
simulation
breaks
down
in
a
series
into
a
series
of
two
by
two
Matrix
Vector
operations.
But
more
importantly,
we
have
to
know
that
the
number
of
Matrix
Vector
operations
is
reduced
by
a
factor.
Two.
So,
in
this
case
here
are
how
we
can
compute
the
indices
I
j
and
BJ.
We
have
to
distinguish
between
two
different
cases
where
we
say.
A
First,
the
control
qubit
is
smaller
than
the
target
Cubit
and
vice
versa,
and
this
as
corresponds
to
either
0
or
1
respectively
for
zero
or
one
controlled
case.
A
So
in
order
to
implement
it
in
a
very
efficient
way,
we
Define
q0
and
q1,
either
as
the
minimum
or
maximum
of
the
component
Target
cubic
such
that
we
can
always
guarantee
that
q1
is
smaller
than
Q
bigger
than
Q
zero
and
now,
in
this
case,
we
can
again
see
here
that
we
can
split
it
in
five
different
parts
and
we
will
Define
or
we
choose
the
bits
as
follows,
such
that
we
have
three
blocks
for
the
indices,
J
and
now
in
this
case,
since
we
have
in
in
fact
a
two
qubit
gate,
it's
a
controlled
one
cubic
gate,
but
now
we
don't
have
to
Define
two
masks,
but
we
have
to
Define
three
masks
as
false
and
then
for
the
controlled
Gates.
A
We
can
again
via
very
efficient
bitshift
operations
and
bits
operations
in
general
compute.
The
indices,
IJ
and
BJ,
and
if
we
have
a
one
control
gate,
basically
we
have
a
one
here:
the
the
asterisks
are
one
and
not
zero.
We
just
have
to
add
some
constant
date,
so
these
techniques
can
be
generalized
to
multi,
qubit
Gates,
so
for
a
general
two
qubit
gate
acting
on
qubits,
q0
and
q1.
This,
the
State
Factor
simulation
breaks
down
into
a
series
of
four
by
four
metric
operations.
A
A
special
case
of
a
two
Cubit
gate
is
the
swamp
gate,
and
since
we
have
a
lot
of
zeros
here,
the
implementation
of
the
operation
is
very
can
be
also
done
very
efficiently
by
the
following
swap
operations
and
then
for
more
General
multitude
Gates.
The
generalization
is
very
straightforward.
We
only
have
to
add
two
additional
bit
masks
per
extra
qubit
and
especially
for
multi-controlled
multi-cubic
Gates,
so
as,
for
example,
the
toeflinkate
or
the
doubly
control,
not
gate
here
given
on
the
left.
A
This
technique
is
very
efficient
because
the
more
controls
you
have
the
less
elements
in
the
vector
we
have
to
update,
for
example.
Here,
although
we
have
a
state
of
eight
of
length,
eight,
we
only
have
to
update
two
elements
in
the
vector,
so
we
only
do
one
local
operation.
A
So
now,
let's
look
at
the
implementation,
so
we
have
implemented
everything
in
C,
plus,
plus
modern
C,
plus
plus
through
openmp,
offloading
and
I'm,
going
to
use
for
this
example,
the
power
x
gate,
which
is
just
a
simple
swap
operation.
So,
let's
a
little
bit,
let's
have
a
closer
look
to
the
CPU
implementation.
So
we,
what
we
take
as
input
is
the
total
number
of
qubits
which
qubit
the
gate
is
acting
on
and
then
a
pointer
to
the
state
Vector.
So
we
just
defined
the
maximum
number
of
J
values.
A
Then
we
Define
the
left
and
right
big
masks
just
to
some
bit
operations,
and
then
we
have
one
loop
where
in
fact,
in
each
iteration
of
the
loop,
we
calculate
the
indices,
5j
and
BJ,
and
then
we
just
perform
a
simple
swap
operation
and
we
use
the
pragma
statement
here
to
do
the
loop
enrollment.
A
Now,
if
we
look
at
the
GPU
implementation,
basically
it's
exactly
the
same.
Only
what
we
have
changed
is
the
primary
statement.
So
what
we
do
here
is
GPU
offloading
through
openmp,
where,
instead
of
the
the
statement,
pragma
opening
D
parallel
four
we
now
use
the
pragma
open
and
key
Target
themes
distribute
parallel
form.
So
the
only
difference
here
is
these
three
works
in
the
implementation.
A
So
for
the
experiments
we
will
consider
two
different
types
of
circuits.
The
first
circuit
is
the
qft
circuit,
where
we
have
a
lot
of
controlled
operations,
but
far
sometimes
nearest
neighbor
qubits,
but
sometimes
also
the
qubits
are
located
far
away
from
each
other
and
then,
as
a
second
example
circuit,
we
use
a
constant
depth
circuit
for
hamiltonian
simulation,
where
we
have
the
following:
only
nearest
neighbor
connection.
A
So
let's
first
have
a
look
to
the
the
difference
between
the
CPU
and
the
GPU
version.
A
We
have
tested
everything
on
permanent,
so
we
have
enabling
AMD
epic
CPU,
with
64
cores
and
other
28
threats
and
for
the
GPU
we
have.
We
use
one
Nvidia
a100
GPU,
with
40
gigabytes
of
memory.
So
let's
start
on
this
last
plot,
the
blue
curves
correspond
to
single
Precision.
No,
the
solid
lines
correspond
to
single
precision
and
the
dotted
line.
No
sorry,
let's
take
it
back,
the
blue
corresponds
to
CPU.
Correct
is
GPU,
the
dotted
lines
are
single
precision
and
the
solid
lines
are
double
Precision.
So,
let's
first
look
at
the
GPU.
A
Where
we
see
we
have
a
very
nice
behavior
in
the
number
of
the
number
of
qubits,
whereas
for
the
CPU
at
a
certain
point,
it
requires
more
time
we
haven't
had
to
look
into
detail
why
this
is
happening,
but
most
likely.
This
is
due
to
some
caching
effects
for
a
larger
number
of
qubits
and
then
here
on
the
right.
You
see
the
corresponding
speed
up
factors
from
CPU
to
GPU,
where
we
get,
for
example,
up
to
40
speed
up
factor
for
a
large
number
of
theories.
A
Now,
if
we
look
to
the
hamiltonian,
Circuit
basically
has
a
very
similar
Behavior
so
for
the
GPU
very
nice
behavior,
whereas
for
the
CPU
at
a
certain
point,
the
yeah,
the
attaching
effects,
will
become
very
important.
A
So
we've
also
benchmarked
to
Sila,
plus
plus
on
parameter
to
some,
with
some
other
packages
out
there
again
parameter
so
AMD,
epic
and
Nvidia
a100,
and
the
results
I'm
showing
here
are
just
single
Precision,
where
we
have
to
Sila
plus
plus
in
blue,
and
then
we
have
compared
it
to
three
different
open
packages:
Circ
which
uses
Q
Quantum
lipo,
with
both
Q
Quantum
and
Q
pi,
and
where
we
see
that
QC
lab
plus
plus,
especially
for
a
low
number
of
qubits
outperforms,
all
the
other
packages
and
one
possibility,
especially
for
this
big
gap.
A
It's
in
more
than
an
order
of
magnitude
might
be
that
these
packages
all
use.
Python
wrappers
and
then
there
there
is
some
overhead.
We
are
currently
looking
into
that,
but
still
for
a
larger
number
of
qubits
QC
laplers
per
still
outperforms
for
the
qsp
circuit.
A
All
these
packages,
and
if
you
look
to
the
hamiltonian
simulation
circuit,
QC
lab,
is
still
very
competitive.
Only
circuits
here
are
slightly
a
little
better,
so
I
think
I
have
to
wrap
up.
A
So
to
conclude,
what
we
have
developed
is
a
GPU,
accelerated,
Quantum
circuit
simulator,
with
QC
lab
plus
plus,
and
we
have
done
that
by
calculating
indices
via
bit
operations
and
using
an
open,
MP
implementation,
which
is
a
very
portable,
and
this
resulted
that
we
can
do
very
efficient,
GPU
offloading,
giving
us
speed
up
factors
up
to
40
times,
and
the
experiments
have
shown
that
qcla
plus
plus
is
competitive
with
some
other
of
the
packages
which
are
out
there.