@hunse Told me to open a discussion on this repo to discuss potential improvements.
Firstly, I'd like to say that Multi GPU may become extremely viable with NVIDIA Pascal's release of NVLink, and AMD Polaris's Coherent Interconnect Fabric (I believe that's what the name is) as it may be much more practical that it was on previous generations to split the simulation of different ensembles between GPUs. I developed a similar system about 18 months ago for doing multi-agent clustering, and essentially performed a "Pseudo Boosting" algorithm on the CPU in order to determine which clusters to group together on the GPUs; however, due to the topology of a Nengo networking being known beforehand it is most likely possible to implement a similar system with significantly less overhead.
Secondly, I noticed that it appears OCL doesn't properly reuse memory that it already allocated on previous kernel executions and doesn't avoid branching
While this is fine currently, @hunse had mentioned that he perhaps wants to implement dynamic parallelism. When dynamic parallelism is in use, you need to have a much tighter control on memory and branching in order to avoid major bottlenecking
An example of well implemented dynamic parallelism is available here. Note that this file has a few syntax errors, but the idea is there. Roughly, no branch has an uneven number of instructions, I do not create new stack frames, and I try to reuse as much memory as possible. Furthermore, I do not recommend trampolining of any sort when doing dynamic parallelism. It isn't designed for that and you will run into limitations very very quickly.
Last time I used OpenCL, you couldn't assign more than one thread to an individual GPU. It is possible that Vulkan has changed this. Said being, BEFORE attempting to tackle dynamic parallelism, we should look into optimizing our method of batching tasks to the GPU from the CPU. Since the order of instructions is known at all points of the simulation, if we rework how we're managing memory a bit its very possible that we could well exceed the performance of dynamic parallelism, although there will probably be some performance boost over its existing state. If we see that we still need dynamic parallelism though, we can always try that too.
There are some issues that need to be brought up. DP is only supported on NVIDIA cards that support compute model 3.5 and up. Namely, 750Ti and >= GTX 770. DP is supported on all AMD GPUs 6xxx and up I believe (With a few odd and random exceptions though). Which brings up the concern that we can't move all of Nengo to DP, since GPUs of that caliber may not be common among members of the community. Also, I don't believe DP can execute kernels on other GPUs. I believe it is only limited to running kernels internally. Note trampolining is possible between GPUs. Don't do this. Don't. Srsly. You will regret it.
If we were to add Multi GPU support with limited CPU communication, I believe OpenCL has a way to do P2P GPU-to-GPU communication without CPU intervention. This would probably be the way we'd want to go. Cuda Equivalent. If we use a UVA system, then we don't need to worry about this; however, UVA is only useful if you don't have an in depth understanding of how the GPUs are communicating. Luckily nengo has a very well defined set of rules for this, pertaining to the topology of the network :P
There was a feature with DirectCompute back in DX11.1 that I only recently saw exists in Vulkan. We can premake the list of operations that the GPU needs to execute, as well as specify memory transfers between peers, and to the host. We can then execute this premade list long after the fact. (Eg: Even between sessions). This may also be useful since it has some of the same performance benefits of DP and multithreading without the need for either (Note: It technically does both slightly worse, but it has much much lower requirements for the end user as well as less CPU overhead.)
Finally, something I noticed when going through OCL. OpenCL has APU support, and Skylake has Iris Pro APUs and AMD's Zen will have perhaps equally as strong APUs. None of OCL is optimized for APUs. Perhaps we should include a mode for this? APUs have the benefit that Host to Device and Device to Host transfers are practically free. I imagine an APU mode would try to offload as much as it could to AVX/SSE while leaving the rest to the APU itself. This would be particularly useful for laptops and/or ARM devices.
Thanks
Louis