qiboteam / qibojit Goto Github PK

View Code? Open in Web Editor NEW

14.0 15.0 3.0 898 KB

Accelerating Qibo simulation with just-in-time compilation.

Home Page: https://qibo.science

License: Apache License 2.0

Python 99.11% Nix 0.89%

gpu quantum quantum-computing quantum-algorithms quantum-circuit numba cupy quantum-annealing

qibojit's Introduction

qibojit

This package provides acceleration features for Qibo simulations using just-in-time (JIT) custom kernel with numba, cupy and cuQuantum.

Documentation

The qibojit backend documentation is available at qibo.science.

Citation policy

If you use the package please refer to the documentation for citation instructions.

qibojit's People

Contributors

Stargazers

Watchers

Forkers

renatomello vodovozovaliza nithyasrivs

qibojit's Issues

Slow when using repeated execution

import time
from qibo import Circuit, gates
from qibo.quantum_info.random_ensembles import random_unitary

matrix1 = random_unitary(2)
matrix2 = random_unitary(4)

c = Circuit(2)
c.add(gates.UnitaryChannel([(0,), (0, 1)], [(0.1, matrix1), (0.2, matrix2)]))
c.add(gates.M(0, 1))

start_time = time.time()
result = c(nshots=1000)
final_time = time.time() - start_time

print(final_time)

Takes 12sec with qibojit (numba) and 0.2sec with numpy. I suspect this is because the qibojit overhead for small circuits is multiplied by nshots due to the repeated execution.

Single precision vs double precision performance

We should benchmark the performance in single precision vs that in double precision and ensure that the results make sense / see if there is margin for improvements.

@stavros11 when you have time, could you post your results with multiqubit gates?

CPU fallback for probabilities is not working

Description

The CPU fallback for the computation of probabilities is not working.

How to reproduce the error

Run the following code. (replace 31 with the maximum number of qubits that your GPU supports)

import qibo
from qibo.models import Circuit
from qibo import gates

c = Circuit(31)
for i in range(31):
    c.add(gates.H(i))
    c.add(gates.M(i))
result = c(nshots=1000).frequencies()

Results on my machine:

Qibo 0.1.7rc1.dev0|INFO|2021-11-18 11:17:16]: Using qibojit backend on /GPU:0
[Qibo 0.1.7rc1.dev0|WARNING|2021-11-18 11:17:19]: Falling back to CPU because the GPU is out-of-memory.
Traceback (most recent call last):
  File ".../qibo/backends/abstract.py", line 110, in cpu_fallback
    return func(*args)
  File ".../qibo/core/gates.py", line 270, in calculate_probs
    probs = state.probabilities(measurement_gate=self)
  File ".../qibo/core/states.py", line 75, in wrapper
    return func(self, qubits=set(qubits)) # pylint: disable=E1102
  File ".../qibo/core/states.py", line 82, in probabilities
    state = K.reshape(K.square(K.abs(self.tensor)), self.nqubits * (2,))
  File ".../qibo/backends/numpy.py", line 152, in abs
    return self.backend.abs(x)
  File "cupy/_core/_kernel.pyx", line 1161, in cupy._core._kernel.ufunc.__call__
  File "cupy/_core/_kernel.pyx", line 586, in cupy._core._kernel._get_out_args
  File "cupy/_core/core.pyx", line 2540, in cupy._core.core._ndarray_init
  File "cupy/_core/core.pyx", line 184, in cupy._core.core.ndarray._init_fast
  File "cupy/cuda/memory.pyx", line 718, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1395, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1416, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1096, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1117, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 1355, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 17,179,869,184 bytes (allocated so far: 34,359,759,872 bytes).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../script.py", line 9, in <module>
    result = c(nshots=1000).frequencies()
  File ".../qibo/abstractions/circuit.py", line 712, in __call__
    return self.execute(initial_state=initial_state, nshots=nshots)
  File ".../qibo/core/circuit.py", line 309, in execute
    state.measure(self.measurement_gate, nshots, self.measurement_tuples)
  File ".../qibo/core/states.py", line 86, in measure
    self.measurements = gate(self, nshots)
  File ".../qibo/core/gates.py", line 294, in __call__
    self.result = self.measure(state, nshots)
  File ".../qibo/core/gates.py", line 275, in measure
    probs = K.cpu_fallback(calculate_probs)
  File ".../qibo/backends/abstract.py", line 116, in cpu_fallback
    return func(*args)
  File ".../qibo/core/gates.py", line 270, in calculate_probs
    probs = state.probabilities(measurement_gate=self)
  File ".../qibo/core/states.py", line 75, in wrapper
    return func(self, qubits=set(qubits)) # pylint: disable=E1102
  File ".../qibo/core/states.py", line 82, in probabilities
    state = K.reshape(K.square(K.abs(self.tensor)), self.nqubits * (2,))
  File ".../qibo/backends/numpy.py", line 152, in abs
    return self.backend.abs(x)
  File "cupy/_core/core.pyx", line 1500, in cupy._core.core.ndarray.__array_ufunc__
  File "cupy/_core/_kernel.pyx", line 1161, in cupy._core._kernel.ufunc.__call__
  File "cupy/_core/_kernel.pyx", line 586, in cupy._core._kernel._get_out_args
  File "cupy/_core/core.pyx", line 2540, in cupy._core.core._ndarray_init
  File "cupy/_core/core.pyx", line 184, in cupy._core.core.ndarray._init_fast
  File "cupy/cuda/memory.pyx", line 718, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1395, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1416, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1096, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1117, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 1355, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 17,179,869,184 bytes (allocated so far: 34,359,759,872 bytes).

Additional details

The first OOM error from CuPy is expected: in order to compute the probabilities, we compute and store K.square(K.abs()) of the state vector, so the required memory is increased by 50%.
Then, the CPU fallback is activated, the engine is changed from CuPy to Numba, and so self.backend is changed from cp to np.
However, K.abs of File ".../src/qibo/core/states.py", line 82, in probabilities still calls CuPy, which raises OOM error again.
My guess is that K.abs(state) calls CuPy regardless of self.backend being cp or np because the state (self.tensor in qibo/core/states.py) is still a CuPy array.
However, I tried to remove that behavior by setting NUMPY_EXPERIMENTAL_ARRAY_FUNCTION=0 but it didn't change anything.

A possible solution would be to add a cast operation in qibo/core/states.py:StateVector.probabilities(), but we also need to fix the cast operation in NumbaBackend, and I'm not sure what is the best way to do it.

Benchmark performance not-in-place updates

Following qiboteam/qibo#1037, we would like to quantify the speed up achieved with qibojit with not-in-place updates.

`CuQuantum` exception

I am using the cuquantum backend and the following error keeps being raised.

Exception ignored in: <function CuQuantumBackend.__del__ at 0x7ff976e4c160>
Traceback (most recent call last):
  File /path/to/lib/python3.9/site-packages/qibojit/backends/gpu.py", line 509, in __del__
TypeError: 'NoneType' object is not callable

Though I could not trace any reason for it or how an minimal code could raise it.

Inverse of singular matrix in a quantum-to-classical channel in `numba`

I'm implementing a new noise model class ReadoutErrorChannel. It is working properly for Qibo's numpy and tensorflow backends. However, it is breaking for Qibojit's numba backend due to the computation of inverses of Singular matrices in

qibojit/src/qibojit/backends/cpu.py

Line 175 in 5a4125a

def apply_gate_density_matrix(self, gate, state, nqubits, inverse=False):

.
Here is a minimal code to reproduce the raised error LinAlgError: Singular matrix.

from qibo import gates
from qibo.quantum_info import random_density_matrix, random_stochastic_matrix
from qibojit.backends import NumbaBackend

backend = NumbaBackend()

nqubits = 1
d = 2**nqubits

rho = random_density_matrix(d, seed=1)
P = random_stochastic_matrix(d, seed=1)

rho_error = gates.ReadoutErrorChannel(0, P).apply_density_matrix(
    backend, rho, 1
)

I would like to get some help with this, because I don't know why these inverses are being computed, which doesn't happen in the Qibo backends.

Multiqubit ops GPU performance

Following our discussion, this is the GPU equivalent of #51 where we can discuss various interesting findings regarding the multiqubit kernel performance on GPU and particularly how it compares to qiskit. Here are some benchmark results:

Simulation time qibo/qiskit - RTX A6000 - double precision

Simulation time qibo/qiskit - RTX A6000 - single precision

In single precision we are much faster while in double precision we are mostly equivalent apart from a specific area. Here are the exact times for this interesting area:

simulation times - nqubits=15

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00068	0.00171	0.00057	0.00115
2	0.00105	0.00177	0.00066	0.00184
3	0.00270	0.00184	0.00112	0.00175
4	0.00695	0.00169	0.00125	0.00242
5	0.02246	0.00259	0.00246	0.00350
6	0.03912	0.01039	0.00538	0.00637
7	0.06972	0.02882	0.00678	0.02115
8	0.12242	0.11884	0.01668	0.10134
9	0.25151	0.60537	0.05127	0.57179
10	0.77184	3.72334	0.30835	3.54705

simulation times - nqubits=16

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00071	0.00142	0.00061	0.00132
2	0.00108	0.00193	0.00069	0.00191
3	0.00286	0.00210	0.00120	0.00183
4	0.00756	0.00758	0.00135	0.00232
5	0.02427	0.00859	0.00269	0.00351
6	0.08424	0.01121	0.00997	0.00782
7	0.15274	0.03605	0.01274	0.02318
8	0.27393	0.11994	0.02289	0.11415
9	0.52561	0.72564	0.06245	0.65283
10	1.15000	4.27385	0.36389	4.22007

simulation times - nqubits=17

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00076	0.00213	0.00065	0.00203
2	0.00121	0.00233	0.00074	0.00170
3	0.00307	0.00792	0.00129	0.00286
4	0.00813	0.00277	0.00146	0.00361
5	0.02625	0.00923	0.00291	0.00575
6	0.09226	0.01316	0.01078	0.01105
7	0.33222	0.04734	0.02657	0.03166
8	0.60661	0.17166	0.04665	0.13848
9	1.09221	0.74349	0.08526	0.75295
10	2.17680	4.90263	0.41825	4.97029

simulation times - nqubits=18

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00092	0.00196	0.00071	0.00226
2	0.00122	0.00227	0.00080	0.00225
3	0.00329	0.00906	0.00134	0.00408
4	0.00871	0.00968	0.00154	0.00623
5	0.02826	0.01136	0.00312	0.01024
6	0.09959	0.02241	0.01180	0.01939
7	0.36258	0.06104	0.02912	0.04782
8	1.32488	0.23768	0.10081	0.17849
9	2.42261	0.87334	0.19749	0.88229
10	4.36327	5.55920	0.57205	5.60554

simulation times - nqubits=19

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00127	0.01038	0.00077	0.00300
2	0.00191	0.00469	0.00089	0.00329
3	0.00357	0.01160	0.00145	0.00656
4	0.00945	0.02012	0.00165	0.01049
5	0.03066	0.01703	0.00339	0.01806
6	0.10760	0.03546	0.01305	0.03311
7	0.39375	0.08286	0.03169	0.07409
8	1.44949	0.23880	0.11003	0.23198
9	5.31427	0.98861	0.39713	1.04869
10	9.69975	6.31365	0.82073	6.39574

simulation times - nqubits=20

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00182	0.01452	0.00123	0.00561
2	0.00312	0.01466	0.00133	0.00602
3	0.00642	0.01459	0.00187	0.01225
4	0.01038	0.03097	0.00184	0.02110
5	0.03337	0.03782	0.00392	0.03531
6	0.11569	0.05954	0.01418	0.06470
7	0.42412	0.12040	0.03445	0.13335
8	1.57073	0.24070	0.11950	0.34332
9	5.79949	1.12296	0.43513	1.31031
10	21.42194	6.96956	1.59364	7.14927

simulation times - nqubits=21

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00297	0.04279	0.00183	0.01087
2	0.00506	0.04736	0.00190	0.00970
3	0.01134	0.04779	0.00257	0.02269
4	0.02009	0.04995	0.00272	0.04079
5	0.03781	0.07061	0.00496	0.06971
6	0.12480	0.11806	0.03517	0.12507
7	0.45471	0.19977	0.05067	0.24671
8	1.69274	0.37102	0.13072	0.56319
9	6.32958	1.37186	0.47240	1.76927
10	23.37317	7.95625	1.75441	8.69835

simulation times - nqubits=22

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.00529	0.04282	0.00304	0.02550
2	0.00885	0.04292	0.00307	0.02752
3	0.01958	0.04581	0.00373	0.05389
4	0.03942	0.05852	0.00431	0.08455
5	0.07661	0.08379	0.00905	0.14375
6	0.13771	0.13358	0.07412	0.25818
7	0.48743	0.24396	0.10509	0.48839
8	1.82252	0.55038	0.19315	1.01881
9	6.81780	1.78645	0.52388	2.68159
10	25.32039	9.04470	1.91643	10.59537

simulation times - nqubits=23

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.01015	0.08001	0.00550	0.04743
2	0.01712	0.08031	0.00543	0.04986
3	0.03556	0.08724	0.00609	0.10628
4	0.07037	0.11227	0.00700	0.17147
5	0.15528	0.16596	0.01858	0.29656
6	0.29346	0.26994	0.15351	0.53666
7	0.52695	0.48206	0.22443	1.00201
8	1.96243	0.99488	0.40818	1.99313
9	7.36424	2.65864	0.75142	4.56154
10	27.27548	11.00970	2.12359	14.71918

simulation times - nqubits=24

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.02022	0.18080	0.01053	0.08798
2	0.03352	0.18370	0.01043	0.09721
3	0.07010	0.19461	0.01097	0.21545
4	0.13568	0.25227	0.01237	0.35247
5	0.28303	0.36481	0.03299	0.61607
6	0.61949	0.58082	0.32582	1.12560
7	1.16325	1.00706	0.47051	2.11143
8	2.11352	1.91901	0.86623	4.03778
9	7.86494	4.39171	1.59513	8.44242
10	29.23039	14.76853	3.00823	22.39044

simulation times - nqubits=25

ntargets	qibo double (sec)	qiskit double (sec)	qibo single (sec)	qiskit single (sec)
1	0.04113	0.36355	0.02106	0.18211
2	0.06760	0.37848	0.02062	0.20346
3	0.14073	0.39008	0.02098	0.44870
4	0.27040	0.51104	0.02358	0.73671
5	0.54744	0.75356	0.06350	1.30015
6	1.14574	1.21546	0.67175	2.36441
7	2.47845	2.06826	1.00970	4.39131
8	4.59437	4.11890	1.82457	8.32751
9	8.45306	9.00420	3.41029	16.53761
10	31.44260	24.93261	6.36849	38.09419

It appears that qiskit has an issue here as in some cases their single precision is significantly slower than their double. On the other hand, qibo's single precision is extremely faster than double.

Below are some numbers from the DGX, where the situation is completely different and more in line with what we observed on CPU in #51:

Simulation time qibo/qiskit - V100 - double precision

simulation times - nqubits=24 - V100

ntargets	qibo double (sec)	qiskit double (sec)
3	0.42339	0.33970
4	0.62654	0.33986
5	1.00678	0.33991
6	2.19785	0.34998
7	3.63109	0.45986
8	6.53215	0.76321
9	11.95559	2.57002
10	22.17884	15.57758

simulation times - nqubits=25 - V100

ntargets	qibo double (sec)	qiskit double (sec)
3	0.80510	0.67170
4	1.23495	0.67187
5	2.09212	0.67181
6	4.59987	0.67553
7	7.65031	0.88789
8	13.84615	1.25915
9	25.54963	3.41262
10	47.12159	19.35345

simulation times - nqubits=26 - V100

ntargets	qibo double (sec)	qiskit double (sec)
3	1.65588	1.33589
4	2.56164	1.33560
5	4.35132	1.33077
6	9.55329	1.35739
7	16.05355	1.74887
8	29.17296	2.29261
9	53.89957	5.04608
10	100.10543	26.62849

simulation times - nqubits=27 - V100

ntargets	qibo double (sec)	qiskit double (sec)
3	3.41055	2.65044
4	5.32581	2.63898
5	9.09922	2.63975
6	20.06714	2.64693
7	33.64388	3.46601
8	61.26683	4.29842
9	113.86667	8.26060
10	212.27658	40.63561

simulation times - nqubits=28 - V100

ntargets	qibo double (sec)	qiskit double (sec)
3	7.15363	5.24736
4	11.12589	5.22061
5	18.98035	5.25438
6	41.96390	5.26979
7	70.22372	6.85323
8	128.53272	8.41681
9	239.46607	14.80653
10	448.56989	70.21813

NOTE: I just realized that the benchmark script we have been using does not use state.numpy() so the final state is not transferred from GPU to CPU for qibo, something that probably happens for qiskit which returns a numpy array. If we include the transfer, qibo's results may be worse (not sure how much), but I believe some of above observations will still hold. For example the strange fact that qiskit single is slower than qiskit double remains, but is not really our problem to solve.

Multi-qubit kernel for GPU

Following #21, we should port the new multi-qubit kernels in CUDA and ROCm.

No version `0.0.11` on pypi

While preparing some tutorial I've noticed that despite having a release for the 0.0.11 version, this version is not on pypi.
@scarrazza for the tutorial I don't think that changing the version will matter that much. However, if we want the students to use the latest version we should put in on pypi.

Error when installing from source

Cloning the repo and running

pip install -r requirements.txt

according to the documentation, gives the following error:

ERROR: Could not open requirements file: [Errno 2] No existe el archivo o el directorio: 'requirements.txt'

Documentation

It might be nice to add some docs to Qibojit.

I can add the usual infrastructure myself, and add some docstrings while reviewing the codebase.
I'm not going to commit myself to any more extensive explanation, since I'm not an expert myself.

porting workflows repo

Backend review

This is actually spanning both Qibojit and Qibo itself, but being specific for backends, I decided to avoid polluting Qibo's tracker.

It is only a proposal and definitely not urgent. The goal is to simplify the code (for maintenance), and potentially also new backends implementation.

The main observation is that, most of the work done at the level of the backend, relies on the usage of a NumPy compatible API.
This has already been observed since the beginning, and indeed there is a self.np attribute to access the API specific to that backend.
However, NumPy has far more refined approaches for interoperability, and since they are quite adopted by the other similar libraries, in principle some of the tasks being performed by Qibo could be delegated to the libraries themselves.

In particular, the main mechanism are __array_ufunc__ and __array_function__, that allow a NumPy call on a foreign object to be handled by the external library defining the object. They are essentially hooks, that are called by the NumPy function, passing them all the details about the original call.
These are not only working on the function processing existing arrays, but also on the creation routines, by using the like argument (see e.g. np.zeros).
Libraries like CuPy are already implementing this mechanism by themselves. In principle, all the backend methods that are just using the NumPy API should not be implemented more than once, at most the underlying NumPy operations should be hooked, by providing an __array_function__ implementation ourselves (possibly a wrapper over an existing one, if not sufficiently complete).
Essentially, we could act at the level of NumPy functions, filling the gaps, instead of at the level of quantum operations.

E.g. the zero_state method is implemented over and over:

https://github.com/qiboteam/qibo/blob/a09e16e3d107f412bc7a57e10b729aeadcfd7c7b/src/qibo/backends/numpy.py#L78-L81
https://github.com/qiboteam/qibo/blob/a09e16e3d107f412bc7a57e10b729aeadcfd7c7b/src/qibo/backends/tensorflow.py#L233-L238
qibojit/src/qibojit/backends/cpu.py

Lines 87 to 90 in 0cac397

def zero_state(self, nqubits):

size = 2**nqubits

state = np.empty((size,), dtype=self.dtype)

return self.ops.initial_state_vector(state)

qibojit/src/qibojit/backends/gpu.py

Lines 144 to 150 in 0cac397

 def zero_state(self, nqubits): 

 n = 1 << nqubits 

 kernel = self.gates.get(f"initial_state_kernel_{self.kernel_type}") 

 state = self.cp.zeros(n, dtype=self.dtype) 

 kernel((1,), (1,), [state]) 

 self.cp.cuda.stream.get_current_stream().synchronize() 

 return state

but it should always perform the same operations.

In practice, there are many limitations that should be discussed separately:

CuPy backend is computing the exponential by bit-shift

qibojit/src/qibojit/backends/gpu.py

Line 145 in 0cac397

n = 1 << nqubits

however, this is happening purely in Python, so, if more efficient, could be simply adopted by the unique implementation (in other places the same backend is using exponentiation)

qibojit/src/qibojit/backends/gpu.py

Line 169 in 0cac397

state = self.cp.ones(2**nqubits, dtype=self.dtype)
CuPy backend is using a kernel for setting an element:

qibojit/src/qibojit/backends/gpu.py

Lines 146 to 148 in 0cac397

kernel = self.gates.get(f"initial_state_kernel_{self.kernel_type}")

state = self.cp.zeros(n, dtype=self.dtype)

kernel((1,), (1,), [state])

where NumPy is using "fancy indexing", i.e. arr[idx] = el.
However, if indexing is a problem for CuPy (or other backends), and in case it would be problematic to hook on its own, NumPy itself has an equivalent function, i.e. np.put. In the hooking perspective, the kernel implementation can be the np.put replacement (btw, CuPy has the same function, cp.put, and I'm pretty sure is already hooked - but I also suspect indexing to work, and I could quickly check, so I might be missing something about the kernel...)
CuPy backend requires a further operation to finalize the function - but this could also be embedded in one out of two ways: adding it to the np.put replacement, or adding it at the end; however, this choice would become global (while currently it could be different method-by-method), we should investigate if this is a true limitation (most likely who implemented the backend has a better understanding of it)
the main outlier in this landscape is TensorFlow: NumPy and similar libraries are working together to standardize interfaces, through NumPy interoperability, array API, DLpack, and NumPy-like namespaces; but while CuPy and PyTorch are backing almost all of these efforts, TensorFlow is only mentioned in the last two cases, and it's always part of their experimental API; I suspect that this might have affected past choices, since TensorFlow is one of the main backends (the only one in Qibo, other than NumPy itself) - in case this is breaking all possible updates, an alternative might be ditching TensorFlow in favor of PyTorch

As I said, the main observation is that the current Qibo backends contain a lot of duplicated operations, at a higher level than required (an even better example would be matrices, which should definitely not be repeated more than once).
However, the update would require some effort and some (possibly deep) refactoring of the backends. The good part is that this would be fully internal, there is no need to break any interface for the Qibo user.

Given all these points, take this as a report about an investigation for possible improvements. There is no hurry to do anything.

Long cupy dry run

As discussed in other threads it seems that we have a 0.5sec overhead in cupy's dry run when compared to simulation times. I am opening this issue to post some profiling results related to this and discuss potential solutions.

I profiled a script that calls only the inital_state operator which uses a minimal set of arguments: state = op.initial_state(nqubits, "complex128") and does not require any casting or GPU-CPU transfer. The following table shows the corresponding dry run (first call), simulation (second call) and compile times. The compile time is found by looking at the cummulative time spend on cupy's _compile_with_cache_cuda method as logged in the profiling output.

Profiling results

nqubits	dry run	simulation	cupy	dry run - compile - simulation
3	1.07826	0.00002	1.077	0.00124
4	0.75259	0.00002	0.751	0.00157
5	0.79743	0.00002	0.796	0.00141
6	0.73674	0.00002	0.735	0.00173
7	0.98545	0.00002	0.984	0.00143
8	0.75880	0.00002	0.756	0.00278
9	0.75664	0.00002	0.755	0.00162
10	0.70874	0.00002	0.707	0.00172
11	0.79865	0.00002	0.797	0.00163
12	0.66965	0.00002	0.668	0.00163
13	0.67687	0.00002	0.676	0.00085
14	0.73576	0.00002	0.735	0.00074
15	0.75079	0.00002	0.749	0.00178
16	0.73728	0.00002	0.736	0.00126
17	1.46634	0.00005	1.464	0.00229
18	0.69862	0.00002	0.697	0.00160
19	1.39950	0.00005	1.398	0.00145
20	0.72016	0.00003	0.719	0.00114
21	0.73185	0.00003	0.731	0.00082
22	0.71809	0.00003	0.717	0.00106
23	0.72895	0.00004	0.728	0.00091
24	0.69690	0.00411	0.695	-0.00221
25	0.69932	0.00170	0.697	0.00062
26	0.70868	0.00175	0.707	-0.00007
27	1.91245	0.00025	1.907	0.00520
28	1.29077	0.00018	1.286	0.00459
29	0.90459	0.00057	0.898	0.00602
30	0.77462	0.00292	0.761	0.01069

So it seems that the whole dry run overhead comes from this function call. @scarrazza, do you have other examples where cupy compilation is fast?

Tests in Qibo repository fail with AMD ROCm

When using an AMD GPU, the tests in the Qibo repository fail.
On my setup, I get 92 failed tests, in particular:

rocblas_status_not_implemented errors, related to some linear algebra methods e.g. eigh.
Assertion errors
Some Type errors (but they seem to disappear with PR #33).

The tests in this repository are fine, though.

Improvements for cuQuantum backend

After having a chat with the guys at NVIDIA-cuQuantum they suggested that in order to increase performances we could try to reduce the number of memory allocations. Currently we have the following mechanims:

qibojit/src/qibojit/backends/gpu.py

Lines 705 to 723 in 5a4125a

 workspaceSize = self.cusv.apply_matrix_get_workspace_size( 

 self.handle, 

 data_type, 

 nqubits, 

 gate_ptr, 

 data_type, 

 self.cusv.MatrixLayout.ROW, 

 adjoint, 

 ntarget, 

 ncontrols, 

 compute_type, 

 ) 

 # check the size of external workspace 

 if workspaceSize > 0: 

 workspace = self.cp.cuda.memory.alloc(workspaceSize) 

 workspace_ptr = workspace.ptr 

 else: 

 workspace_ptr = 0

So, basically everytime that a gate requires extra workspace we allocate it.
A possible solution that they proposed consists in the following:

Having a workspace member of the class CuQuantumBackend initialized at a fixed value (8 MB or 16 MB)
Keeping track of the workspace during the execution in order to allocate extra memory only if we run out of memory. Somethings like:

if current_workspace + workspace_required > workspace_allocated:
    allocate_memory

They also suggested that during the allocation we should allocate a considerable amount of memory to avoid allocating small bits of memory for each gate.

Other things mentioned during the meeting:

check if this block of code is needed

qibojit/src/qibojit/backends/gpu.py

Lines 539 to 544 in 5a4125a

if isinstance(gate, self.cp.ndarray):

gate_ptr = gate.data.ptr

elif isinstance(gate, self.np.ndarray):

gate_ptr = gate.ctypes.data

else:

raise ValueError
they should now support multi GPU for cuquantum-python

@scarrazza @stavros11 let me know what you think.
If you agree I will try to implement the solution that they proposed for handling the workspace.

Cuquantum fails using the latest release

I believe a new cuquantum-python version was uploaded in conda-forge yesterday and there are some changes in the API. For example our cuquantum tests both in qibo and qibojit now fail with

AttributeError: module 'cuquantum.custatevec' has no attribute 'apply_matrix_buffer_size'

I believe this method was renamed to apply_matrix_get_workspace_size in the latest version. I have not checked in detail if any other changes have been made.

Dry run overhead is inconsistent between different environments

I've noticed that the dry run overhead is inconsistent between different environments.
For example, these are the results on the same machine with the same version of qibo and qibojit.
As always, I have disabled the compilation during import time.

CuPy 9.6.0, cudatoolkit v12.2.0 both from conda-forge: ~3.9 s, ~ 3.3 s with cached kernels
CuPy 9.6.0 from pip (cupy-cuda115), system installation of cuda v11.5: ~ 3.7 s, ~ 3.2 s with cached kernels
CuPy 9.5.0, cudatoolkit v11.2.2 both from conda-forge: ~2.7 s, ~ 2.3 s with cached kernels
CuPy 9.6.0, cudatoolkit v11.1.1 both from conda-forge: ~2.2 s, ~ 1.6 s with cached kernels
CuPy 9.5.0, cudatoolkit v11.1.1 both from conda-forge: ~1.4 s, ~ 0.9 s with cached kernels

and so on. Two comments:

It seems like the caching system is working, but the effect is limited.
We may just leave the compilation during import, but an user with the most updated environment will see ~ 5 s of import time on the machine I'm using. On the other hand, I'm going to open a draft PR soon to discuss a possible way to decrease the compilation times. EDIT (see #45)

Document kernels systematically

Kernels are not documented at all.
This is potentially relevant for maintenance.

Kernel calls and thread synchronization

In CupyBackend the thread synchronization after kernel calls seems inconsistent.
In particular, it is used in one_qubit_base, two_qubit_base and multi_qubit_base, but it is not used in initial_state and collapse_state.

If it is mandatory, we should add it where it's missing, otherwise we should remove it where it's unnecessary. What do you think?

ROCm support

I have tested the code using cupy-rocm-4-0 and it seems that the current code can work with minor adjustments.

some qibojit benchmarks on NVIDIA Grace-Hopper (WIP)

data and some plots at https://gist.github.com/migueldiascosta/0a0dbe061982bc4cc2bc7171785a4b86, as requested by @scarrazza

Invalid NUMBA_NUM_THREADS while running with Numpy

While debugging a very simple tests for qibotn (on the qibotn:pypkg branch) I encountered the error reported below, coming from Numba backend, even though the Numpy one was selected.

As @stavros11 already noticed correctly, the cupy behavior is perfectly expected and adequate, so the actual issues are:

why is Numba complaining about the number of threads?
why is Numba involved at all, if Numpy was selected?

Most likely the problem is machine dependent, since my machine has 6 cores (x2 with hyperthreading), so the total number (i.e. 12) exceeds the maximum of 6 requested by Numba.
However, I don't get why Numba is imposing a maximum at all, nor why it is happening with qibojit, but never happened with other Numba projects (for which the answer might be actually more complicated, and I don't expect to be found here...)

Pytest run with traceback

❯ pytest
===================================================================== test session starts =====================================================================
platform linux -- Python 3.10.7, pytest-7.2.1, pluggy-1.0.0
rootdir: /media/alessandro/moneybin/Projects/Qibo/qibotn, configfile: pyproject.toml, testpaths: tests/
plugins: env-0.8.1, cov-4.0.0
collected 4 items

tests/test_qasm_quimb_backend.py FFFF                                                                                                                   [100%]

========================================================================== FAILURES ===========================================================================
________________________________________________________________________ test_eval[1] _________________________________________________________________________

[...]

________________________________________________________________________ test_eval[10] ________________________________________________________________________

backend = 'qibojit', platform = 'numpy', runcard = None

    def construct_backend(backend, platform=None, runcard=None):
        if backend == "qibojit":
            from qibojit.backends import CupyBackend, CuQuantumBackend, NumbaBackend

            if platform == "cupy":  # pragma: no cover
                return CupyBackend()
            elif platform == "cuquantum":  # pragma: no cover
                return CuQuantumBackend()
            elif platform == "numba":
                return NumbaBackend()
            else:  # pragma: no cover
                try:
>                   return CupyBackend()

env/lib/python3.10/site-packages/qibo/backends/__init__.py:22:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = numpy

    def __init__(self):
        NumpyBackend.__init__(self)

>       import cupy as cp  # pylint: disable=import-error
E       ModuleNotFoundError: No module named 'cupy'

env/lib/python3.10/site-packages/qibojit/backends/gpu.py:18: ModuleNotFoundError

During handling of the above exception, another exception occurred:

nqubits = 10

    @pytest.mark.parametrize("nqubits", [1, 2, 5, 10])
    def test_eval(nqubits: int):
        print(f"Testing for {nqubits} nqubits")
>       result = qasm_quimb.eval_QI_qft(nqubits)

tests/test_qasm_quimb_backend.py:9:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
env/lib/python3.10/site-packages/qibotn/qasm_quimb.py:187: in eval_QI_qft
    qibo.set_backend(backend=qibo_backend, platform="numpy")
env/lib/python3.10/site-packages/qibo/backends/__init__.py:108: in set_backend
    GlobalBackend.set_backend(backend, platform, runcard)
env/lib/python3.10/site-packages/qibo/backends/__init__.py:84: in set_backend
    cls._instance = construct_backend(backend, platform, runcard)
env/lib/python3.10/site-packages/qibo/backends/__init__.py:24: in construct_backend
    return NumbaBackend()
env/lib/python3.10/site-packages/qibojit/backends/cpu.py:68: in __init__
    self.set_threads(len(psutil.Process().cpu_affinity()))
env/lib/python3.10/site-packages/qibojit/backends/cpu.py:79: in set_threads
    numba.set_num_threads(nthreads)
env/lib/python3.10/site-packages/numba/np/ufunc/parallel.py:607: in set_num_threads
    snt_check(n)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

n = 12

    def snt_check(n):
        if n > NUMBA_NUM_THREADS or n < 1:
>           raise ValueError(msg)
E           ValueError: The number of threads must be between 1 and 6

env/lib/python3.10/site-packages/numba/np/ufunc/parallel.py:569: ValueError
-------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------
Testing for 10 nqubits
====================================================================== warnings summary =======================================================================
env/lib/python3.10/site-packages/quimb/linalg/approx_spectral.py:11
  /media/alessandro/moneybin/Projects/Qibo/qibotn/env/lib/python3.10/site-packages/quimb/linalg/approx_spectral.py:11: DeprecationWarning: Please use `uniform_filter1d` from the `scipy.ndimage` namespace, the `scipy.ndimage.filters` namespace is deprecated.
    from scipy.ndimage.filters import uniform_filter1d

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

---------- coverage: platform linux, python 3.10.7-final-0 -----------
Coverage XML written to file coverage.xml

=================================================================== short test summary info ===================================================================
FAILED tests/test_qasm_quimb_backend.py::test_eval[1] - ValueError: The number of threads must be between 1 and 6
FAILED tests/test_qasm_quimb_backend.py::test_eval[2] - ValueError: The number of threads must be between 1 and 6
FAILED tests/test_qasm_quimb_backend.py::test_eval[5] - ValueError: The number of threads must be between 1 and 6
FAILED tests/test_qasm_quimb_backend.py::test_eval[10] - ValueError: The number of threads must be between 1 and 6
================================================================ 4 failed, 1 warning in 2.64s =================================================================

CPU details

❯ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         39 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
    CPU family:          6
    Model:               158
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            10
    CPU(s) scaling MHz:  82%
    CPU max MHz:         4600,0000
    CPU min MHz:         800,0000
    BogoMIPS:            6399.96
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdp
                         e1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monito
                         r ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rd
                         rand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fs
                         gsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida a
                         rat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   192 KiB (6 instances)
  L1i:                   192 KiB (6 instances)
  L2:                    1,5 MiB (6 instances)
  L3:                    12 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:
  Itlb multihit:         KVM: Mitigation: VMX disabled
  L1tf:                  Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
  Mds:                   Mitigation; Clear CPU buffers; SMT vulnerable
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Mitigation; IBRS
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Mitigation; Microcode
  Tsx async abort:       Mitigation; TSX disabled

Port poetry

Cannot run pytest when cuquantum is installed

When I run pytest on the latest main I get

_________________________________________________ ERROR collecting test session __________________________________________________
../../anaconda3/envs/qibojit/lib/python3.9/importlib/__init__.py:127: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1030: in _gcd_import
    ???
<frozen importlib._bootstrap>:1007: in _find_and_load
    ???
<frozen importlib._bootstrap>:986: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:680: in _load_unlocked
    ???
../../anaconda3/envs/qibojit/lib/python3.9/site-packages/_pytest/assertion/rewrite.py:168: in exec_module
    exec(co, module.__dict__)
src/qibojit/tests/conftest.py:11: in <module>
    BACKENDS.get(backend_name)()
src/qibojit/backends/cpu.py:30: in __init__
    from numba import __version__ as numba_version
../../anaconda3/envs/qibojit/lib/python3.9/site-packages/numba/__init__.py:42: in <module>
    from numba.np.ufunc import (vectorize, guvectorize, threading_layer,
../../anaconda3/envs/qibojit/lib/python3.9/site-packages/numba/np/ufunc/__init__.py:3: in <module>
    from numba.np.ufunc.decorators import Vectorize, GUVectorize, vectorize, guvectorize
../../anaconda3/envs/qibojit/lib/python3.9/site-packages/numba/np/ufunc/decorators.py:3: in <module>
    from numba.np.ufunc import _internal
E   SystemError: initialization of _internal failed without raising an exception

After creating a new conda environment I found that this appears only after installing cuquantum. If I just install the basic qibojit with numba only tests work. I have not checked if the issue is really cuquantum or cupy (which is installed with cuquantum).

If others cannot reproduce it may be related to my machine.

FSWAP operator

Following the discussion in qiboteam/qibo#536, we should consider the possibility to update the swap kernel by including the sign information, and thus share the implementation with the fermionic swap gate.

Add not-in-place updates

In order to accomplish #154 we need a mechanism to disable in-place updates.

Multiqubit ops CPU performance

As we saw during our discussion about qiboteam/qibo#505, we are observing some performance issues while incorporating the multiqubit ops in qibo, particularly in comparison to qiskit. Here are some benchmarks on CPU for circuits of the following type:

q0 : ─U───────────────────────────────
q1 : ─U─U─────────────────────────────
q2 : ─U─U─U───────────────────────────
q3 : ─U─U─U─U─────────────────────────
q4 : ─U─U─U─U─U───────────────────────
q5 : ───U─U─U─U─U─────────────────────
q6 : ─────U─U─U─U─U───────────────────
q7 : ───────U─U─U─U─U─────────────────
q8 : ─────────U─U─U─U─U───────────────
q9 : ───────────U─U─U─U─U─────────────
q10: ─────────────U─U─U─U─U───────────
q11: ───────────────U─U─U─U─U─────────
q12: ─────────────────U─U─U─U─U───────
q13: ───────────────────U─U─U─U─U─────
q14: ─────────────────────U─U─U─U─U───
q15: ───────────────────────U─U─U─U─U─
q16: ─────────────────────────U─U─U─U─
q17: ───────────────────────────U─U─U─
q18: ─────────────────────────────U─U─
q19: ───────────────────────────────U─

where U is a multiqubit (here five-qubit) unitary:

multiqubit - qibo/qiskit - simulation time - double

multiqubit - qibo/qiskit - simulation time - single

Since in previous benchmarks on this repository we were comparing calling the custom operators directly vs qiskit, I made an additional comparison of qibo (with qibojit) vs qibojit:

multiqubit - qibo/qibojit - simulation time - double

multiqubit - qibo/qibojit - dry run time - double

multiqubit - qibo/qibojit - simulation time - single

multiqubit - qibo/qibojit - dry run time - single

From these we see that the qibo overhead is minimal and not enough to explain the difference with qiskit so the discrepancy most likely comes from the qibojit side. Also, although absolute qibo/qiskit ratios differ between single and double precision, the behavior is qualitatively the same. What is interesting is that qiskit appears much faster in the 4 <= ntargets <=6, nqubits > 20 but becomes much slower for ntargets > 6. Here are some absolute times (no ratio) that clearly show this:

nqubits=23 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.09326	0.06830	0.03834
4	0.17647	0.16432	0.05503
5	0.25915	0.24949	0.10277
6	0.45729	0.48950	0.18796
7	0.65344	0.78138	0.74864
8	0.98115	1.19669	2.51820
9	1.54533	1.62834	5.98414
10	3.47243	3.38084	42.43832

nqubits=24 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.15120	0.21074	0.08805
4	0.35044	0.30915	0.12071
5	0.58445	0.56411	0.20928
6	1.09115	1.02924	0.34992
7	1.16172	1.22445	1.23091
8	2.01835	1.91831	5.66647
9	3.32717	3.47910	11.75414
10	7.11032	6.52692	73.70552

nqubits=25 - simulation times - double

ntargets	qibo (sec)	qibojit (sec)	qiskit (sec)
3	0.31643	0.36482	0.24589
4	0.59639	0.58661	0.26426
5	1.27955	1.26141	0.38248
6	2.18827	2.08555	0.71935
7	2.47914	2.32125	2.56939
8	4.05328	4.07807	11.53120
9	6.88069	6.86786	22.85148
10	14.79330	13.21936	133.96834

Qibo's performance increases expectedly with ntargets, while qiskit makes at ntargets=7. It looks like they have a very good implementation for ntargets < 7 (perhaps based in some decomposition?) and a very bad for more targets. I think @mlazzarin observed something similar in the past, right?

For all these benchmarks the threads were set using from multiprocessing import cpu_count with all libraries using half of the total threads and is tested that final wavefunctions agree.

Probabilities do not sum to 1

If I run the following circuit with qibojit on GPU:

import qibo
from qibo.models import Circuit
from qibo import gates
qibo.set_precision("single")

c = Circuit(31)
for i in range(31):
    c.add(gates.H(i))
output = c.add(gates.M(30, collapse=True))
for i in range(31):
    c.add(gates.H(i))
result = c()

I obtain the following error:

[Qibo 0.1.7rc1.dev0|INFO|2021-11-18 12:12:15]: Using qibojit backend on /GPU:0
Traceback (most recent call last):
  File ".../circuit.py", line 15, in <module>
    result = c()
  File ".../qibo/abstractions/circuit.py", line 712, in __call__
    return self.execute(initial_state=initial_state, nshots=nshots)
  File ".../qibo/core/circuit.py", line 306, in execute
    state = self._device_execute(initial_state)
  File ".../qibo/core/circuit.py", line 235, in _device_execute
    state = self._execute(initial_state=initial_state)
  File ".../qibo/core/circuit.py", line 221, in _execute
    state = self._eager_execute(state)
  File ".../qibo/core/circuit.py", line 188, in _eager_execute
    state = gate(state)
  File ".../qibo/core/gates.py", line 299, in __call__
    return getattr(self, self._active_call)(state)
  File ".../qibo/core/gates.py", line 288, in _state_vector_call
    return K.state_vector_collapse(self, state, self.result.binary[-1])
  File ".../qibo/core/measurements.py", line 111, in binary
    self._binary = self._convert_to_binary()
  File ".../qibo/core/measurements.py", line 167, in _convert_to_binary
    return K.mod(K.right_shift(self.decimal[:, K.newaxis], _range), 2)
  File ".../qibo/core/measurements.py", line 102, in decimal
    self._decimal = self._sample_shots()
  File ".../qibo/core/measurements.py", line 186, in _sample_shots
    result = K.cpu_fallback(K.sample_shots, self.probabilities, self.nshots)
  File ".../qibo/backends/abstract.py", line 110, in cpu_fallback
    return func(*args)
  File ".../qibo/backends/numpy.py", line 232, in sample_shots
    return self.random.choice(range(len(probs)), size=nshots, p=probs)
  File ".../cupy/random/_sample.py", line 190, in choice
    return rs.choice(a, size, replace, p)
  File ".../cupy/random/_generator.py", line 1042, in choice
    raise ValueError('probabilities do not sum to 1')
ValueError: probabilities do not sum to 1

Actually, probs sums to 0.25.

EDIT: Just wanted to add that, if I run the same circuit with a smaller number of qubits, it works fine.

``macOS-latest`` workflows fail with Python 3.6

It seems that macOS-latest workflows now use macOS-11 and Python 3.6 is not available anymore (actions/runner-images#4060).

Drop py38

We should propagate this after qiboteam/qibo#1079

Reduce compilation impact

When working with programs with tiny circuits the jit strategy may introduce non negligible time overhead, in particular when measuring time between the circuit allocation and execution. One possibility to remove this overhead is to implement a tracing feature during the package importing, so execution times should take the best performance on a specific system.

Wrong result when applying unitary matrix using cuquantum

The following fails:

import numpy as np
import qibo
from qibo import models, gates
from scipy.linalg import expm
qibo.set_backend("qibojit", platform="cuquantum")

# generate random unitary matrix
matrix = np.random.random((2, 2)) + 1j * np.random.random((2, 2))
matrix = expm(1j * (matrix + matrix.T.conj()))

initial_state = np.random.random(2) + 1j * np.random.random(2)
initial_state = initial_state / np.sqrt(np.sum(np.abs(initial_state) ** 2))
target_state = matrix.dot(initial_state)

circuit = models.Circuit(1)
circuit.add(gates.Unitary(matrix, 0))
final_state = circuit(initial_state=np.copy(initial_state))

np.testing.assert_allclose(final_state, target_state)

while if I remove the expm line it works.

This issue is captured by test_cirq.py in qibo using QIBOJIT_PLATFORM=cuquantum pytest test_cirq.py. Interestingly if one uses QIBOJIT_PLATFORM pytest to run all the tests the issue does not appear. This means that the platform is switched back to cupy somewhere in the previous tests and cuquantum is not tested properly. This issue will be resolved in qiboteam/qibo#539 so it may be a good idea to wait for this before making any releases.

Add cuQuantum version in documentation

Once we merge #77 the cuquantum backend will work only with the latest version of cuquantum which currently is the 22.03.0.
We should add this to the qibo documentation.

Module not found during installation

As observed today, the approach in https://github.com/qiboteam/qibojit/blob/main/setup.py#L29 only works if the qibo package is already installed, otherwise the code raises an module error, but installs properly.

The problem is that, even if qibo is required, the current pip installation does not register qibo (refresh its environment) before executing this line, thus it raises and error and proceeds with python setup.py install (which now has qibo available).

In principle this is not a big deal, I think we have 3 options:

keep as it is
update the docs explaining that qibo must be installed beforehand
remove the precompilation from setup.py

@stavros11, @mlazzarin, @andrea-pasquale what is you opinion?

	def zero_state(self, nqubits):
	size = 2**nqubits
	state = np.empty((size,), dtype=self.dtype)
	return self.ops.initial_state_vector(state)

	def zero_state(self, nqubits):
	n = 1 << nqubits
	kernel = self.gates.get(f"initial_state_kernel_{self.kernel_type}")
	state = self.cp.zeros(n, dtype=self.dtype)
	kernel((1,), (1,), [state])
	self.cp.cuda.stream.get_current_stream().synchronize()
	return state

	workspaceSize = self.cusv.apply_matrix_get_workspace_size(
	self.handle,
	data_type,
	nqubits,
	gate_ptr,
	data_type,
	self.cusv.MatrixLayout.ROW,
	adjoint,
	ntarget,
	ncontrols,
	compute_type,
	)

	# check the size of external workspace
	if workspaceSize > 0:
	workspace = self.cp.cuda.memory.alloc(workspaceSize)
	workspace_ptr = workspace.ptr
	else:
	workspace_ptr = 0

	if isinstance(gate, self.cp.ndarray):
	gate_ptr = gate.data.ptr
	elif isinstance(gate, self.np.ndarray):
	gate_ptr = gate.ctypes.data
	else:
	raise ValueError

qiboteam / qibojit Goto Github PK

qibojit's Introduction

qibojit

Documentation

Citation policy

qibojit's People

Contributors

Stargazers

Watchers

Forkers

qibojit's Issues

Description

How to reproduce the error

Additional details

Recommend Projects

Recommend Topics

Recommend Org

Jobs