National Energy Research Scientific Computing Center (NERSC) Quantum for Science Day, October 24, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: GPU-Accelerated Quantum Circuit Simulations with QCLAB++

Description

GPU-Accelerated Quantum Circuit Simulations with QCLAB++
Roel Van Beeumen (LBL)

A

Hello, everyone um today, I'm gonna talk about some recent results on our keep. You accelerated, Quantum, certain simulations with UC lab, plus plus, um and that's joint work with Dan Camps from nurse care. So, um let's start with what is QC lab, plus, plus qcla plus plus, is a object, oriented and fully templated C plus packages package that we have developed for creating and representing Quantum circuits, and um please check it out. It's free available on GitHub and we have developed that with the numerical linear algebra focus in mind, and these are the following.

A

Three key features of qcla plus plus first is efficiency and performance. Second, it's numerical stability, for example, um for a rotation gate. We don't store the angle of the gate, but we store the cosine and the sine of the angle such that we can do some numerical stable operations with it and we try to make it as user friendly as possible.

A

So and this with the goal to allow for rapid prototyping and testing of quantum algorithms in C, plus plus, we also have a Matlab component. If you are interested in that, n2c, lab plus plus, is the backbone of several other packages developed here at Berkeley lab, for example, f3c. It's the fast 3 fermium compiler for time, Evolution Quantum circuits of hamiltonians, spin hamiltonians that can be mapped to free formulas.

A

The Fable package, which is for fast approximate blocking codings for synthesizing Quantum circuits of approximate blocking coachings of matrices and then to pixel, which stands for the quantum image pixel Library, which can be used for compiling compressed Quantum circuits for, for example, the flexible representation of quantum images.

A

So today, I'm gonna talk about State Factor simulation and how we do that efficiently in qclr plus plus. So how do we apply a Quantum gate to a given n qubit state, which is a in effective space of 2 to the N? So mathematically speak in this corresponds to the following Chronicle products, where we have the quantum gate: U, which is unitary Matrix and then on the left and the right.

A

We have some Chronicle products with the identity matrices today, I'm gonna explain how we can do that very efficiently by calculating indices through bit operations, and for that we will use the big ending binary convention. So suppose we have uh just a integer chain, then we can represent its binary representation as follows, and note that this monospace characters will correspond to base which are either zero or white.

A

So here is an overview of the different type of quantum Gates. We are considering first of all, the one qubit Gates, the two qubit gates, also kubit Gates. So it's a unitary acting on K qubits. Then we have the control gates, for example a controlled one. Gate example is a C note gate and then um these uh Gates, which are acting on different parts of the qubits you could see, for example, the swap gate not on nearest neighbor qubits as an example of these qubits and these Quantum Gates.

A

So, let's start with the most simple case, so the one cubic gates with every one qubit gate U here, which is acting on the qubit cube and there is a unitary two by two Matrix connected to it, and the state Vector simulation breaks down, in fact into a series of two by two Matrix operations. I know all this in dishes. They look very complicated. So let's immediately look to an example.

A

So here we consider three examples: we have a three qubit State, uh three qubit circuit, where we apply the the one cubic gate U on the first, the second or third Cube.

A

If we're gonna, look to the Matrix representations of these three different examples, then we see that if we apply the one qubit gate U to the last few bit, we just have local um two by two operations. On the other hand, if we apply the cube the one qubit gate to the first qubit, then we still have local two by two operations, which are indicated by the different uh color schemes. But the elements are now separated um 2 to the N minus 1 apart.

A

So what we want to do is: can we compute this indices, Alpha, J and B, A, J and B J efficiently? And yes, we can do that. We will be able to do that through bit operations.

A

So more specifically, if we go back to what we had here, this two by two operations and we look at these indices here we can Define this v j and BJ as follows: where we have at the cube qubit, either a zero or a one respectively for AJ and B note that we have 2 to the N minus 1, where n is the number of cubits of these two by two operations?

A

So if you look to these indices, AJ and BJ, we can immediately see that we can split them in three different parts: the parts with indices before and the qubit cube and the cubic Cube, and then the uh part, the the bits after that, so without lots of generality, we can select this I uh L bits as follows: where in fact the numerical value is J are equal to the following uh bit: representation: note that this J has um one bit less than this. I j and BJ.

A

Basically, they just corresponds to removing the bit at position Q and then we can Define the following left and right masks. Basically, um this mask L. If we apply that to uh J, then only the first few bits are kept and if we apply the the right masks, then this part of the qubits are kept and by using this left and right bit masks, we can write these indices or calculate this indices AJ as follows: we just mask J with them the M right m r and then so.

A

Basically, we retain this part of the images and the of the bits, and then we sum them with um ml Mata, J Mass with ML. So we keep these Q bits and we just shift them one to the left. So then we have in fact AJ and the only difference between AJ and BJ. Is this one? And this one we can just add to Aha by summing with 2 to the N minus Q minus 1..

A

So here is the overall algorithm, it's a very simple algorithm. It just contains one for Loop, so we Define The Masks, then in every iteration. We just compute the indices, indices, AJ and BJ, and then we do the local two by two operations. Now this is the algorithm for a channel one cubic gate. Of course, if we have a specific structure, we can further simplify lines. Five and six of this algorithm. For example, if we have a power x gate, then the unitary operation is just the swap operation.

A

With the probability y, we have a swap operation and multiply with minus I and I and for the power Z gate and also for the phase gate. We only have to operate, uh update half of the part of the vector and then also for the rotation Z gate. We can get very efficient uh operations because we don't have to um it's it's a local operation because we basically just do a diagonal scale.

A

So, more importantly, if we have a particular sparsity structure in the gates, we should definitely exploit so the second class I'm gonna talk about today is the controlled one. Two-Bit Gates, and here we consider four different tastes. So the first two cases we have the control qubit smaller than the target qubit. Where on the uh right, we have that the target qubit is smaller than the control qubit, and then this uh full dot corresponds to a one controlled gate and the open Dot corresponds to a zero control gate.

A

So let's look at two examples, so the first example here, but we now apply it on a three qubit state. The first thing we note is that we only have to update half of The Matrix, so we can exploit this this, and this means that instead of having what we had before four two by two operations, we will only have two uh two by two operations and then, if you look to, for example, this right example, um we still have only two operations we have to do, but it's different parts of the Matrix.

A

So if you look here, um we can again compute them very efficiently through bit operations. So again, the state's Vector simulation breaks down in a series into a series of two by two Matrix Vector operations. But more importantly, we have to know that the number of uh Matrix Vector operations is reduced by a factor. Two. So, in this case um here are how we can compute the indices I j and BJ. We have to distinguish between two different cases uh where we say.

A

First, uh the control qubit is smaller than the target Cubit and vice versa, and this as corresponds to either 0 or 1 respectively for zero or one controlled case.

A

So um in order to implement it in a very efficient way, we Define q0 and q1, either as the minimum or maximum of the component Target cubic such that we can always guarantee that q1 is smaller than Q uh bigger than Q zero and now, uh in this case, we can again see here that we can split it in five different parts and we will Define or we choose the bits uh as follows, such that we have three blocks for the indices, J and now in this case, since we have in in fact a two qubit gate, it's a controlled one cubic gate, but now we don't have to Define two masks, but we have to Define three masks as false and then for the controlled Gates.

A

We can again via very efficient bitshift operations and bits operations in general compute. The indices, IJ and BJ, and if we have a one control gate, basically um we have a one here: uh the the asterisks are one and not zero. We just have to add some uh constant date, so these techniques can be generalized to multi, qubit Gates, so for a general two qubit gate acting on qubits, q0 and q1. This, uh the uh State Factor simulation breaks down into a series of four by four metric operations.

A

So we have to calculate four indices, which we again can be Computing very efficiently through bit operations.

A

A special case of a two Cubit gate is the swamp gate, and um since we have a lot of zeros here, um the implementation of the operation is very uh can be also done very efficiently by the following swap operations and then for uh more General multitude Gates. The generalization is very straightforward. We only have to add two additional bit masks per extra qubit and especially for multi-controlled multi-cubic Gates, so as, for example, the toeflinkate or the doubly control, not gate here given on the left.

A

um This technique is very efficient because the more controls you have the less elements in the vector uh we have to update, for example. Here, although we have a state of eight uh of length, eight, we only have to update two uh elements in the vector, so we only do uh one local operation.

A

So now, let's look at the implementation, so we have implemented everything in C, plus, plus modern C, plus plus through openmp, offloading and I'm, going to use for this example, the power x gate, which is just a simple swap operation. So, let's a little bit, let's have a closer look to the CPU implementation. So we, what we take as input is the total number of qubits which qubit the gate is acting on and then a pointer to the state Vector. So we just defined the maximum number of J values.

A

Then we Define the left and right big masks just to some bit operations, and then we have one uh loop where in fact, in each iteration of the loop, we calculate the indices, 5j and BJ, and then we just perform a simple swap operation and we use the pragma statement here to do the loop enrollment.

A

Now, if we look at the GPU implementation, basically it's exactly the same. Only what we have changed is the primary statement. So what we do here is um GPU offloading through uh openmp, where, instead of the the statement, pragma opening D parallel four we now use the pragma open and key Target themes distribute parallel form. So the only difference here is uh these three works in the implementation.

A

So um for the experiments we will consider uh two different types of circuits. The first circuit is the qft circuit, where we have a lot of uh controlled operations, but far sometimes uh nearest neighbor qubits, but sometimes also the qubits are located far away from each other and then, as a second example circuit, we use a constant depth circuit for hamiltonian simulation, where we have the following: only nearest neighbor connection.

A

So let's first have a look uh to the uh the difference between the CPU and the GPU version.

A

um We have tested everything on permanent, so we have enabling AMD epic CPU, with 64 cores and other 28 threats and for the GPU we have. uh We use one Nvidia a100 GPU, with 40 gigabytes of memory. So um let's start on this last plot, the blue curves correspond to single Precision. No, the solid lines correspond to single precision and the dotted line. No sorry, let's take it back, the blue corresponds to CPU. Correct is GPU, the dotted lines are single precision and the solid lines are double Precision. So, let's first look at the GPU.

A

Where we see we have a very nice uh behavior in the number of the number of qubits, whereas for the CPU at a certain point, it requires more time we haven't had to look into detail why this is happening, but most likely. This is due to some caching effects for a larger number of qubits and then here on the right. You see the corresponding speed up factors from CPU to GPU, where we get, for example, up to 40 speed up factor for a large number of theories.

A

Now, if we look to the hamiltonian, Circuit basically um has a very similar Behavior so for the GPU very nice behavior, whereas for the CPU at a certain point, um the uh yeah, the attaching effects, uh will become very important.

A

So um we've also uh benchmarked uh to Sila, plus plus on parameter to some, with some other uh packages out there um again um parameter so AMD, epic and Nvidia a100, and the results I'm showing here are just single Precision, where we have to Sila plus plus in blue, and then we have compared it to three different open packages: uh Circ which uses Q Quantum uh lipo, with both Q Quantum and Q pi, and where we see that um QC lab plus plus, especially for a low number of qubits outperforms, um all the other packages and one possibility, especially for this big gap.

A

It's in more than an order of magnitude might be that these packages all use. Python wrappers and then there there is some overhead. We are currently looking into that, uh but still for a larger number of qubits uh QC laplers per still outperforms for the qsp circuit.

A

um All these packages, and if you look to the hamiltonian simulation circuit, um QC lab, is still very competitive. Only circuits here are slightly a little better, so um I think I have to wrap up.

A

So to conclude, um what we have developed is a GPU, accelerated, Quantum circuit simulator, with QC lab plus plus, and we have done that by calculating indices via bit operations and using an open, MP implementation, which is a very portable, and this resulted that we can do very efficient, GPU offloading, giving us speed up factors up to uh 40 uh times, and the experiments have shown that qcla plus plus is competitive with some other of the packages which are out there.

A

So please check out our software and I would like to acknowledge a nurse for awarding this qis a parameter award and then also the lbml ldrd programming program, which allowed us to make a first implementation of this package.

A