Comments (6)
Constant with __ prefix type arrays can have a maximum of 64k elements and they are not shared between different kernels anyway
__constant char arr[max 64k]={0};
__kernel void hello(__global char * a) { }
__kernel void hello2(__global char * a) { sees another copy of arr but a is same data }
Every kernel has its unique compilation unit. So to evade duplication of big data, you should use it as kernel parameter backed by ClArray. Since it is non-changing data, you should copy it only once or even initialize on gpu-side. Then use it with read/write disabled.
array.read = false;
array.write = false;
but make sure data is initialized before array-usage (with flags after this state change).
This will make all kernels use same parameter data without duplicating it for each kernel but it will duplicate for each GPU which is a hardware issue. To overcome this, you can use zero copy flag.
array.zeroCopy = true;
As you guess, this is for "streaming" type calculations where each element is read only once (you're free to read/write multiple times but performance is bad for that) and data only moves to GPU when it is needed. This is a direct RAM access by all GPUs using it. Reading from same cell by multiple GPUs is legal but writing by any GPU and reading by another concurrently is illegal.
Sorry for late reply.
from cekirdekler.
Using readonly flag makes an optimization for cases like:
- on every iteration, cpu writes data, gpu reads
- always goes through pci-e but more optimized path than normal array
This is for hardware performance, not software. hence, the "flag" setting.
For software side, decorating the parameter with const (and maybe "restrict" too)
__kernel void test(const __global char * arr)
{
int i=get_global_id(0);
}
should enable Nvidia's fast data path optimizations or AMD's equivalent for the kernel-side data loading.
from cekirdekler.
No, there is no equivalent of setting constant arrays from CPU command(like cudaMemcpyToSymbol), I'm sorry.
from cekirdekler.
If there is "initialize once, use always" scenario, then I'd do this:
- set read flag to true (default), write flag to false
- initialize its data
- run your real kernel (you can also load it with a dummy kernel if array size and kernel workitems not matching)
-
- if kernel workitems and array elements(for loading) don't match, then set partialRead to false to force it load whole array at once regardless of workitem size
- set read flag to false
- run your real kernel
that for loop of yours could have a flag change in first step, maybe thats all needed. But, unique shared variables have to be used as "parameter" of kernel.
If there is "initialize frequently, load it always" scenario:
- set readonly flag
- initialize data on CPU
- run kernel on GPU
- initialize data on CPU
- run kernel on GPU
If GPU data duplication is an issue (because of shared-distributed architecture going on background),
- set zeroCopy flag
- every element access will go through pci-e lanes(or at least page-faults are) and writing/reading "concurrently" between different kernels is not supported by all GPU architectures, this is not thoroughly tested
- if its a "streaming" work, then this has best performance
from cekirdekler.
Lastly, only OpenCL 2.0 supports static variables on global scope and it was not tested. I guess it works but only for same kernel (a kernel2 would still see another copy of its own) but OpenCL 2.0 still limits it by "const initializer expression" . I think CUDA is much more advanced than OpenCL in this case as you can even change gpu constant arrays from host.
You don't need to worry about equivalent of cudaMemcpyToSymbol
when there is no equivalent of __device__
. There are only cl-buffer copies and cl-buffers are used as kernel parameters, they are a kind of gpu memory handle carriers for the host side. This "cekirdekler" adds just another layer over them to treat them as C# arrays, so just as C# arrays can be shared between methods, their pointed GPU buffers can be too, without being duplicated inside same GPU. Also the 64kB limit of constant arrays in global scope is GPU's limitation and being stuck at __constant
type (for program-scope) is OpenCL 1.2's limitation. I wish I had done this using CUDA but then there wouldn't be CPU/FPGA/GPU mixture possibility.
from cekirdekler.
Thank you for your time.
This is lot to understand, will revert back to you after understanding your explanation.
Thanks
from cekirdekler.
Related Issues (20)
- nonPartialWrite capability for buffers HOT 3
- Enqueue mode with single gpu (and for device to device pipeline) ---- lower latency per command HOT 3
- Read-only and write-only flags for ClArray HOT 2
- ClArray.name to bind an array to a kernel parameter with exact spelling HOT 1
- ClArray.async to make an array copy operation done on another commandQueue(concurrently) HOT 1
- clNumberCruncher.enqueueModeAsyncEnable to enqueue different kernels and arrays concurrently
- single device pipeline: overlapping regions percentage in total latency
- single device pipeline: kernel repeat option
- add "batch mode compute"(pool of devices for pool of kernels) with multiple devices where each compute() is computed by 1 device only, with greedy scheduling
- array.nextParam(array2).task() ---> creates ClTask to compute later in pool, with all the fields set at that time but with the latest array data
- add multiple opencl-kernel instances for different compute-id values, for tiled computing, in task pool, with device pool
- add callback option to ClTask
- add duplicated compute option to device pool / task pool / task for initializing same buffer on all devices
- add task types to control pool behavior (sync, broadcast task, shutdown devices)
- 1D NBODY scores HOT 9
- Can you set pipeline mode for each device separately? HOT 5
- Is there an example of generating a Unity Texture? HOT 4
- Any of the opencl 2 version does not work HOT 38
- Mandelbrot benchmark's or other test's source
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cekirdekler.