National Energy Research Scientific Computing Center (NERSC) Quantum for Science Day, October 24, 2022, 4 Nov 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: Parallel GPU Quantum Circuit Simulations on Qiskit Aer

Description

Parallel GPU Quantum Circuit Simulations on Qiskit Aer
Jun Doi

A

Take it away, thank you for introduction. So let me talk about the target stimulation on the kitchen there by using gpus, so I am June DOI from IBM Quantum, I'm, research, social.

A

Air is the one of the components of the open source: blood Quantum, Computing platform, G, Street and gesture is the Quantum circuit simulator, that's land on the cross card, computers and adjusted supports various types of simulation method uh here: uh State Vector simulation and the Unitarian density, Matrix and stabilizer and MPS. So usually, we are using State Vector simulator. That is the standard simulation method, and these types stabilizer and the MPS is used for the large station such simulation.

A

But the quantum subject is very limited for use for this uh simulators and just if they are also support, various types of noise models that behaves the actual content, computers.

A

So now uh I'd like to talk about the GPU support for the GC player So, currently uh just their support, these three types of simulation methods, data beta and unitary and density Matrix, and now we are planning to add GPU support for the stabilizer simulator and also uh we are now developing the tensor Network simulator, uh that that is the enhanced Simulator for the MPS simulator.

A

So here, uh I'd like to show the performance of the GPU acceleration on the state Vector simulator on gcga, so blue line shows the accumulation time of the CPU simulation and the green line shows the simulation time of the by using the GPU.

A

So we are using the quantum volume that is the random Quantum circuit and we are learning on this by using the individious Tesla V100 gpus.

A

So we have a 16 gigabyte of memory on V100, so we can simulate up to 29 cubits by using single GPU here and by using the 6 gpus. We enhance the number of the qubits to the 32 cubits and also we can store the state of the a content started on the 6 gpus and also we can put it on the CPU. So by using these memories, we can simulate up to 35 cubits on this machine.

A

We have 512 gigabytes, so we can simulate this and by using the GPU acceleration, we can speed up about 10 times from the CPU simulation like this.

A

So let me introduce how to install the gpus support for the gcd player in this chart. First, installing the existed Itself by using the clip install key skit and after that we have to uninstall the existing tested layer that that is the test data for the CPUs, so keep uninstall GC there. And finally, we installed the separate binary for the GPU supported TC there by using the Deep, install cheats. Gpu like this, so you you can now learn the GPU support, CCT air, so to run the existed pair with the GPU support.

A

In the uh script uh you just got this option device equals GPU. So then the simulation goes to the GPU. So this is the simple example to run the Content Volume Circuit by using the state, Vector method and GPU.

A

So it's very simple.

A

So let me uh example: let me explain about the implementation of the parallel Quantum circuit simulation in qctr to simulate the large number of cubits, a Quantum States, distributed into multiple gpus or a multiple process on the cluster by using the MPI.

A

So to do that, we also divided divide the states into the smaller chunk and data exchange itself is done by this small chunk to save the memory space because we have to refer the states on the different distributed memory to calculate the stimulation.

A

So if we do not divide the state by Chunk, we have to prepare the large buffer to receive the state from the different distributed memory space. But by dividing into this small Chunk, we only have to prepare the receiving buffer for one chunk. So we can save the memory usage by using this technique and also we optimize the a data Exchange.

A

By using the transpiling technique before learning the actual simulation, so this is the input side screen Quantum circuit and we divide the state into the chunk. So if the uh date is inside the Chunk, we do not have to transfer data between chunks, but if the uh some of the Jets are on the out of the chunk, so NC is the chunk size and if the gate operation is on the Cubit larger than NC, we have to uh transfer data between chunks.

A

So this will be the bottleneck to calculate the gate on the a large Cubit number.

A

So before aligning the actual simulation, we insert the swap gate to put all the gates on the larger number of qubits into the chunks like this.

A

So by using this transfer, we do not have to exchange the data between chunks for calculating these dates, but only a data transfer is needed to need it for inside its swap objects like this, so we can decrease the total data Exchange and we can optimize the performance.

A

So this is the another example to use the multiple gpus if you have on the system. So this is also very simple: uh just adding the protein cubits option here.

A

So this shows the chunk size is 20, 12, 22 bits so in in this example, that's number of cubits of the subject is 25, so the if the gate is on the 21 cubits to 20 for cubits, so these gates are transpired into the chunks under the 20 under 20s.

A

So by adding this one, we can use multiple gpus.

A

Also, this example shows how we use the a multiple nodes on the cluster by using the API, but unfortunately, there is no binary distribution for the MPI support. So please build from the source code if you want to uh use NPI and the this example is also simple, and this blocking two bits option is as similar to the GPS multi-gpus case and by using the NPI. The result is returned to all the processes, but by querying querying the metadata in the result you can know which MPI rank.

A

You are learning on, so to learn the a simulator on the multiple node, just uh passing this python code to the MPI Lan command.

A

So this is the performance of the margin of the simulation, so we also using the nvidia's Tesla V100 on the power system, AC 922 system.

A

So we are using one node to the eight nodes here and we we are also using the quantum volume uh circuit to test this one and sorry uh left hand. Side graph shows the strong steering and the right hand. Sides graph shows the weak straining so the story so strong steering shows the fixed cubic subject to the full program.

A

So in this case the performance of the two node is not good compared to the one node because of the mpis transfer data transfer overhead, but by increasing the number of the node. The simulation time decreases like this and for the uh weak stating uh ideally the graph shows the horizontally, but for the large static simulation. The performance is not so good, but uh it is important that we can simulate the large number of cubic by using the sum of the nodes on the Clusters.

A

So Christy they are also supports the short level parallelization using the master CPUs. So uh the very short, uh very short simulation is used for the subject within intermediate measurements or the simulating the noise models.

A

So if the simulation simulation has some of the multiple shots here, the key City are automatically distribute this shot into the multiple gpus. If the system has multiple gpus like this, but uh most of the cases for the multiplication simulation, the number of sort of the is static is very small. In that case, the overhead of the GPA execution is the bottleneck like this.

A

So in this case the calculation time on the GPU is very small, but the overhead is the dominant for the performance, so it is not good and if the problem size is larger, so in this case the simulation time itself is larger than the overhead. So we can ignore the overhead, so if we Implement by using the uh very short but the execution technique on GPU. So this is the example for the noise simulation.

A

So some of the shot here and the original date is the average box and the noise is red. One bit box shows the power noise here so in so we insert the ID date if we do not have a noise here to synchronize the execution of the original Gates.

A

So we by synchronizing the short execution like this, and we calculate in the single single GPU kernel in the vertical uh box here. So we can uh decrease the GPU overhead so for the across cross noise model. uh Originally, the class knows more close operator is inserted uh to the uh subject here, so we can synchronize to execute in a single as if you can't so. This is the performance evaluation of the butt shot, but the mouse short optimization.

A

Shows the CPU execution and the original implementation shows the orange line here by GPU, and the speed up is not so large here because of the large GPU plus gpus overhead, but by using the bat shot execution, which includes the performance a lot and for the comparison we also brought the density Matrix. So by using the density metric simulator, we can simulate the noise model only by once one shot, so it is very fast, but the it takes much memory and has a large computation overhead.

A

So the performance will be not good than the state Vector simulator for the large number of tubids.

A

So uh we also support the quantum API that is provided by Nvidia.

A

So, but currently we do not have a binary distribution for cool Quantum support for gctr, so please build from the source code. If you want to use the quick Quantum support, so it is also simple to enable the this support by using the quantum requested back in agriculture option in in the runtime.

A

So this is the performance comparison uh and the green green line shows the performance of the acousted back support and the Orange Line shows the original gcts GPS implementation. So for the large number of tube, it's cooked little big support, accelerate about twice as the original one.

A

So if you want to use the uh large number of cubits, so please try this option so that, let me summarize my talk uh so I I introduced the polarization on the Quizlet here and we saw on the future plan. uh We are now developing the tensor net based simulation by using the curtains on it. This is the also the component of the cool Quantum SDK from the Nvidia, and also we are planning to implement stabilizer simulation by using GPU support.

A

A

Thank you June for uh the great talk, rather, questions for June.