tugrul512bit / cekirdekler Goto Github PK

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

License: GNU General Public License v3.0

C# 100.00%

opencl-kernels iterative load-balancer pipelining multi-device gpgpu multi-gpu zero-copy gpu-computing gpu-acceleration

cekirdekler's Introduction

cekirdekler's People

Contributors

Stargazers

Watchers

Forkers

cephdon dason2u syaroslavtsev jinxiu0406 jonike awesomedotnetcore tevfikoguz mmx5 patrick-laurin

cekirdekler's Issues

add duplicated compute option to device pool / task pool / task for initializing same buffer on all devices

so a "tiled rendering" compute with many tasks can see the input image

Lazy compute

There is no lazy compute for now.

var compute1 = array1.queueCompute()
var compute2 = compute1.nextStep(array2.queueCompute()).compute()

can be useful with less synchronizations.

add built-in matrix multiplication with sizes between 2x2 and 8192x8192

batched 2x2 4x4 16x16 32x32
single 8k x 8k with sub-matrix partitioning to increase load balancing

N-levels of partitioning (4,16,64,256 sub matrices)
or M-levels of batching

C++ array wrapper re-creating(and computing) in loop throws error(CL_INVALID_MEM_OBJECT) but works for prepared N-array of C++ arrays

Found the root: re-creating inside loop has a chance to get same pointer (C++ - OS memory management) so USE_HOST_PTR flagged buffer throws error because of using duplicate buffer objects with same pointer.

Todo: release buffer that is bound to hashcode of ClArray<T> that is being destructed, in the Cores/ClNumberCruncher

because C# generates probably same hash after some iterations, makes API use same opencl buffer with USE_HOST_PTR flag and that has old/deleted array pointer, needs to re-check for USE_HOST_PTR type buffers whenever accessed.
or
Parallel.For and buffer read/write(or workers[i].kernelArgument) gets overlapped(or even out of bounds) addressings that throw AggregateException_ctor_DefaultMessage error + System.AccessViolationException
no problem for C# arrays
probably from the USE_HOST_PTR buffer allocation failure which is not yet error-checked yet.
or, it is opencl implementation bugging when deleting a pointer while that pointer is still in an opencl buffer as CL_MEM_USE_HOST_PTR

Disposing unused buffers with warning message

api is creating a new buffer for each unique array given as parameter, with enough arrays, it could give out of resources.

LRU cache to hold max=N buffers(regardless of individual sizes) with total size constraint
(default = RAM / 2 ? )
save data to disk when disposed, read from disk when re-created

add multiple opencl-kernel instances for different compute-id values, for tiled computing, in task pool, with device pool

so clSetKernelArg (with dictionary book keeping) will also look for kernelName+"cekirdek"+compute_id to add new kernel name and multiple kernels will be able to run concurrently for same kernel code but with different array parameters

add callback option to ClTask

so a batch computing can be even greedier to draw results quickly

Explicit device selection disposes handles twice, giving error

When passed to "Cores" instance, need to deny disposing self, even if "isDeleted" is false.

add built-in image-resizing method for png,gif and jpeg

uses compressor-decompressor methods

nbody(benchmark based) device selection disposes shared platform

make platform copied, not shared

Enqueue mode with single gpu (and for device to device pipeline) ---- lower latency per command

enable enqueue()
read
compute N
compute N/2
compute N+64
compute
write
write
compute
read
write
write
disableenqueue()
sync() ---- all commands at once, lower latency

Add device limits stress testing to have numbers used later in production or alarming when approaching limits.

OpenCL can't get max number of command-queues. Add a test that creates command queues up to 1024, until it gives out of memory or out of resources error, save the value for device so it knows max crunchers in flight, release all resources.

OpenCL can't get max number of buffers neighter. Add a test of it too. So user can get a log about remaining resources before using more.

Device to device pipeline: balancing load (kernel names) between neighboring stages

Moving kernel names from one stage to another to altering total latencies of stages to minimize total latency of pipeline / to increase throughput.

Example:

checks all stages' timings.
picks a random pair of stages
moves one kernel name from one stage to another without breaking total order of kernel names

Arrays: bounds check before compute.

just like workitems but with "elementsPerWorkItem" value taken into consideration against total work size and array size.

arrays will be able to bigger, but will not be let smaller than used range.

Add speed-ratio indicator between devices after 10-20 iterations

Gets average of last 10-20 or all iterations, getting compute time versus buffer copy/access times for an efficiency percentage too.

single device pipeline: kernel repeat option

Sometimes a kernel needs to be repeated such as a "fluid solver" with same global+local range values.

Explicit Device to Device Pipelining

So developer can build a compute network using N GPUs

nonPartialWrite capability for buffers

single device pipeline: overlapping regions percentage in total latency

such as a 3 stage pipeline result:

pipeline 1: 3ms, %25 overlapped

pipeline 2: 1ms, totally hidden

pipeline 3: 20ms, %8 overlapped

total overlapping regions: %15

time saved: 2ms

(will need more event queries on cekirdeklercpp)

Image decode+resize+multiple_encode pipeline

Such that it will consume 1 image at each step(push as data to pipeline) and all stages(decode resize encode) will run concurrently, opportunistically on multiple GPUs.

Redefine properties that are with underscores, to have a proper naming

error_______

can be

deviceError

English language translation of cluster-computing related classes(multi-pc centered-control)

clNumberCruncher.enqueueModeAsyncEnable to enqueue different kernels and arrays concurrently

clNumberCruncher.enqueueMode=true
clNumberCruncher.enqueueModeAsyncEnable=true
compute(kernel1)
compute(kernel2)
compute(kernel3)
clNumberCruncher.enqueueModeAsyncEnable=false
clNumberCruncher.enqueueMode=false

****kernel1************
******kernel2**********
********kernel3********

add task types to control pool behavior (sync, broadcast task, shutdown devices)

add "batch mode compute"(pool of devices for pool of kernels) with multiple devices where each compute() is computed by 1 device only, with greedy scheduling

so gpus don't stay idle

a pool of compute() jobs are intercepted by a pool of gpus

Complete device to device pipeline stage initialization kernel execution

so it doesn't start with garbage, especially for hidden buffers per stage (hiddens also needs testing)

Some helper methods into ClNumberCruncher

deviceNames()

normalizedGlobalRangesOfDevices()

normalizedComputePowersOfDevices()

Nbody benchmark-based explicit device selection

benchmark all devices, sort by benchmark performances in decreasing order

devices.getBestNbodyPerforming()[0]

Device to device pipeline: optimize single stage multiple kernel compute with less synchronizations

use Cores class' "single sync multi kernel execution" feature if all stage kernels use same global and local range values

add "single sync multi kernel with multi range values" feature

ClArray.async to make an array copy operation done on another commandQueue(concurrently)

async arrays will not be used in kernel executions.

if kernel has p1 and p2

p3.async=true;

p1.nextParam(p3).nextParam(p2).compute() will work

For explicit device selection, ClNumberCruncher still expects number of cores and gpus

Those parameters are not used for explicit devices. Needs to be removed.

Hide Unnecessary Methods and Classes

internal keyword
wrapper for Cores.cs+Worker.cs and another wraper for ClDevice,ClPlatform,ClContext,...

Explicit Pipelining

pipeline1.push(a.nextParam(b).read()).push(c.compute()).push(d.write()).finish()

pipeline1.overlap(pipeline2,pipeline3).finish()

Error handling for every single opencl command.

Maybe less performance but more description when something bad happens. There is already a Test class for testing implementation but developer faults need to be taken care of.

For now, it only tells opencl kernel compiling errors such as "float5 is not defined" and similar.

added error-returning function call error handing.
need to add buffer creation or buffer mapping error handling(from parameter, not returned value)

Device to device pipeline: enable mixed ordering of kernel arrays (in kernel function definition)

Then developers can have any order they want instead of just:

__kernel void test(input1,input2,hidden1,hidden2,hidden3,output1,output2){}

instead of using inputs+hiddens+outputs differently in the parameter building part, add all into a single array accordingly with their "order" value which gets incremented with any addInput() or addOutput() or addHidden() method.

kernel repeat count number and repeat-end function name(kernel) with 64 global size(auto) for each repeat

so no need to write same string multiple times in kernel name parameer, string parsing also slow

ClArray<T> CopyTo CopyFrom for larger and smaller arrays.

For now, it can copy only same sized arrays. Being able to copying differently sized arrays could help in some cases.

add struct array support with byte-length descriptors for Unity's Vector3-Vector2 arrays

So it won't need to copy to another array, increasing speed by %1000

ClArray.name to bind an array to a kernel parameter with exact spelling

this way, binding only necessary arrays to a kernel will be possible, instead of all arrays

array.nextParam(array2).task() ---> creates ClTask to compute later in pool, with all the fields set at that time but with the latest array data

Read-only and write-only flags for ClArray

So pci-e may be used even better for device-to-device or load balanced programs.

Load balancer sensitive to OS hiccups, need more resistance against temporary performance peaks.

After CPU and GPU achieves a balance point of %50 - %50 work share for a simple stream, the load balancer keeps oscillating around %45 and %55 for small workloads.

Needs a performance-history backed, slower but more unstoppable load balancing function.

Force multiple-of-64 for array size when using streaming and C++ arrays (cl_mem_use_host_ptr)

I don't know if Intel,Amd or Nvidia fixes this error inside and fallsback to cl_mem_alloc version.

Workitems: Grain size - local size - global size: bounds check

To offload the error handling of workitem sizes for different pipelining-load-balancing-multi-device scenarios, from user to API with readable error messages.

Add built-in jpeg,gif,png decompression-recompression methods

so implementing an image-resizer will be faster

Explicit device selection

Can be useful when developer doesn't need all GPUs at once in OpenCL. Maybe something like a device list in different categories:

  ClDevice.getGpuList()                                        
  ClDevice.getAccList()                                          // random order with device name so user can choose 
  ClDevice.getDeviceWithMaxComputeUnit()    // 20 thread CPU is not same as a 20CU -  HD7870 !!! 
  ClDevice.getDeviceWithBenchmark("nbody"); // gets top point awarded device
  ClDevice.activateDynamicDeviceSwitching() // switches to another device when performance becomes too much oscillated (GTX_titan 1ms 3ms 2ms 3ms 1ms then switches to gtx_950 10ms 11ms 10ms 9ms)

tugrul512bit / cekirdekler Goto Github PK

cekirdekler's Introduction

cekirdekler's People

Contributors

Stargazers

Watchers

Forkers

cekirdekler's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs