GithubHelp home page GithubHelp logo

`split` failing about tf-coriander HOT 41 CLOSED

hughperkins avatar hughperkins commented on May 29, 2024
`split` failing

from tf-coriander.

Comments (41)

hughperkins avatar hughperkins commented on May 29, 2024 1

Wheee! split working in hughperkins/coriander@98e2526 :-O

c_cpu[0] [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
i 0
  cpu [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
  gpu [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
i 1
  cpu [ 1.00405395  0.38618639  0.73736858  1.49073207 -0.93583387  1.17582905
 -1.25388062 -0.63775152  0.90710521 -1.42868066]
  gpu [ 1.00405395  0.38618639  0.73736858  1.49073207 -0.93583387  1.17582905
 -1.25388062 -0.63775152  0.90710521 -1.42868066]
i 2
  cpu [ 0.00284592  0.68822271 -0.87953633  0.28362733 -0.80536652 -1.72766948
 -0.39089981  0.57380587  0.33858904 -0.01183049]
  gpu [ 0.00284592  0.68822271 -0.87953633  0.28362733 -0.80536652 -1.72766948
 -0.39089981  0.57380587  0.33858904 -0.01183049]
i 3
  cpu [ 0.02968323  1.06931591  0.89070642  1.75488615  1.49564409  1.06939268
 -0.77270871  0.79486269  0.31427199 -1.32626545]
  gpu [ 0.02968323  1.06931591  0.89070642  1.75488615  1.49564409  1.06939268
 -0.77270871  0.79486269  0.31427199 -1.32626545]
PASSED

----------- generated xml file: /Users/hugh2/git-local/tensorflow-dev/test/junit-pytest-report.xml -----------
============================================= 9 tests deselected =============================================
=================================== 1 passed, 9 deselected in 2.09 seconds ==============================

It's a "bit" spammy, and I need to commit the changes to tf-coriander and so on, but seems promising to me :-)

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024 1

https://github.com/hughperkins/tf-coriander/releases/tag/v0.18.2

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

So.... the struct containing the float ** is a gpu-side buffer, can be read ok on the gpu. However, the pointers inside it are virtual pointers: they arrive as the virtual pointers we fed to the hostside earlier. They are not valid global float * pointers :-O

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

correction: the struct is hostside, passed in by-value. it contains two types of doubly-indirected arrays:

float *f1[8];
float **f2;

The first one, f1, we can probably rewrite, on the fly, possibly. The second one seems more challenging...

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

There seem to be a few approaches possible. An obvious one would be to modify the tensorflow CudaDeviceArrayStruct. This has the advantage of:

  • relatively easy short-term
    but the downsides of:
  • violating the principle of not modifying the client code
  • won't generalize well

Another approach would be to convert the pointers in by-value structs containig float **s, before passing them into the kernel. This will work for boudned arrays, ie float *ptrs0[8], but less well for unbounded ones float **ptrs1.

Lastly, I'm tempted to make a copy of the virtual memory table into the gpu side. This will allow the kernel itself to handle this. I mean, if we feed it code to handle it.

I currently like this last possibilty, since it's the most general. In the case of tensorflow, the virtual memory table only contains one entry, for the huge mega allocate-entire-gpu buffer, so performance cost will be negligible.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Update: if we're going to run a memory manager on the gpuside, we might need to allocate entire gpu memory in one go, so we get a single cl_mem, that we pass in on each call. Otherwise, whilst in the specific case of tensorflow, it does that anyway, in the general case, we'd need to pass in all buffers into any kernel calls have unbounded float ** parameters.

Edit: for now, I might make the simplifying assumption that they're all cut from the same buffer as the float * going into the method call, since this will work for tensorflow, mostly, and then generalize if/when I hit a situation that this no longer works for.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Just to add additional challenges, it turns out that, on my own GPU, float *s are only 32-bits. So, if float *ptrs[8] is passed in by value, from hostside, ptrs[0] and ptrs[1] do not in fact hold the values from the hostside. We have to first cast ptrs into unsigned long *, and then we can get the values.

$ clinfo
...
  Device Name                                     AMD Radeon Pro 450 Compute Engine
   ...
   Address bits                                    32, Little-Endian

The good news is, if clinfo can determine the 32-bitness, then it seems plausible that coriander can too? The bad news is, this issue seems like this is not going to contribute to solving the issue of handling unbounded float ** arrays, passed in by value.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Looks like it is possible to manually hack the opencl, using COCL_DUMP_CL=1, then hacking on /tmp/0.cl, then using COCL_LOAD_CL=1 to run it ( https://github.com/hughperkins/coriander/blob/master/doc/advanced_usage.md#cocl_dump_cl1 ), to get the gpu buffers to be updated correctly. Using COCL_DUMP_CONFIG.. with the following yaml file:

_ZN10tensorflow12_GLOBAL__N_113SplitOpKernelIfEEvPKT_iiiNS_21CudaDeviceArrayStructIPS2_Li8EEE_0_1:
  - type: float
    clmem: 0
    offsetarg: 0
    count: 18
  - type: int32
    clmem: 0
    offsetarg: 0
    count: 18
  - type: float
    virtualaddress: 1920
    count: 18
  - type: float
    virtualaddress: 2176
    count: 18
  - type: float
    virtualaddress: 2432
    count: 18
  - type: float
    virtualaddress: 2688
    count: 18

.... and the following test python code:

def test_split():
    shape = (72, 1)
    graph = tf.Graph()
    np.random.seed(123)
    a = np.random.randn(*shape).astype(np.float32)
    with graph.as_default():
        with tf.device('/gpu:0'):
            a_tf = tf.placeholder(tf.float32, shape)
            c_tf = tf.split(0, 4, a_tf)
            sess = tf.Session()
            with sess.as_default():
                c = sess.run(c_tf, feed_dict={a_tf: a})
                if(np.prod(shape)) < 20:
                    print('a', a)
                    print('c_gpu', c)
                else:
                    print('c_gpu[0]', c[0])

ie, split into 4 parts.

then, using the opencl from later, we can get the following output:

  Dumping buffer 0 clmem0 offset=1280 floats:
    -1.08563 0.997345 0.282979 -1.50629 -0.5786 1.65144 -2.42668 -0.428913 
    1.26594 -0.86674 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 -0.434351 
    2.20593 2.18679 
  Dumping buffer 1 clmem0 offset=1280 int32s:
    -1081412110 1065308680 1049682575 -1077883324 -1089200347 1070817862 -1071952202 
    -1092904336 1067584051 -1084366157 -1087517828 -1111361849 1069475291 -1088188651 
    -1092398694 -1092721846 1074605557 1074525262 
  Dumping buffer 2 virtualaddress=1920  offset in buffer 1792 floats:
    -1.08563 0.997345 0.282979 -1.50629 -0.5786 1.65144 -2.42668 -0.428913 
    1.26594 -0.86674 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 -0.434351 
    2.20593 2.18679 
  Dumping buffer 3 virtualaddress=2176  offset in buffer 2048 floats:
    1.00405 0.386186 0.737369 1.49073 -0.935834 1.17583 -1.25388 -0.637752 
    0.907105 -1.42868 -0.140069 -0.861755 -0.255619 -2.79859 -1.77153 -0.699877 
    0.927462 -0.173636 
  Dumping buffer 4 virtualaddress=2432  offset in buffer 2304 floats:
    0.00284592 0.688223 -0.879536 0.283627 -0.805367 -1.72767 -0.3909 0.573806 
    0.338589 -0.0118305 2.39237 0.412912 0.978736 2.23814 -1.29409 -1.03879 
    1.74371 -0.798063 
  Dumping buffer 5 virtualaddress=2688  offset in buffer 2560 floats:
    0.0296832 1.06932 0.890706 1.75489 1.49564 1.06939 -0.772709 0.794863 0.314272 
    -1.32627 1.4173 0.807237 0.0454901 -0.233092 -1.1983 0.199524 0.468439 
    -0.831155 
c_gpu[0] [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041 -0.67888618 -0.09470897
  1.49138963 -0.63890201 -0.44398195 -0.43435127  2.20592999  2.18678617]

We can see that buffers 2 to 5 (the output gpu buffers) correctly have np.random.rand() looking data, and that buffer 2 matches buffer 0 (the start of the input data). So, this looks promising.

The relevant hacks to /tmp/0.cl are:

Change the struct defintiion to use unsigned longs:

struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    global unsigned long f1[8];
    global unsigned long * f2;
};

... and insert the following lines, at the end of section v4:

    unsigned long offsets[8];
    for(int i = 0; i < 8; i++) {
        offsets[i] = v13[0].f1[i];
        int offsetint = (int)offsets[i];
        v20[i] = (global float *)(offsets[i] + clmem0 - 128);
    }

This at least shows that it appears technically possible to modify the kernel, in the case of num splits < 8, to give the correct output buffers. The - 128 is a hack, since the first virtual address, for the first allocated cl_mem is created at location 128...

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Created test case, using assumption that all gpu buffers cut from one single allocation, hughperkins/coriander@fcba64f Now just have to write some code to make this pass...

Note that this also assumes bounded array. We will look at unbounded later, or just modify tensorflow to have a value of ~128 or so for MaxInlineValues https://github.com/hughperkins/tf-coriander/blob/master/tensorflow/core/kernels/cuda_device_array_gpu.h#L28 , so that the array is bounded for many typical use-cases.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Oh... idea :-) . We can generalize to multiple allocations, by simply re-allocating a larger buffer each time, copying everything across, and redoing the virtual memory table. It's not ideal, but it does have a few interesting qualities:

  • it's general
  • it will fail if its assumptions are violated (rather than silently giving incorrect results, with no explanation, or introspection as to why)
  • it doesnt involve passing potentially zillions of clmem buffers into each kernel call....

from tf-coriander.

keryell avatar keryell commented on May 29, 2024

You are a courageous man. :-)

We have to think about having this less painful in OpenCL-V... But of course it does not solve your issue now. :-(

Just to be sure: when you create a huge buffer, do you assume that the address on the GPU will be always the same across the various kernel calls?
This is a wrong (according to the specification) but useful :-) assumption I made also in another context
https://github.com/keryell/ronan/raw/gh-pages/Talks/2016/2016-03-13-PPoPP-SYCL-triSYCL/2016-03-13-PPoPP-SYCL-triSYCL-expose.pdf to solve a somehow similar problem...

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Just to be sure: when you create a huge buffer, do you assume that the address on the GPU will be always the same across the various kernel calls?

I was hoping that was the case, but tested that yesterday and, as you state, it's not the case. Therefore I dont assume it.

Relevant test code: https://github.com/hughperkins/pub-prototyping/blob/master/opencl/test_globalpointerstability.cpp

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

You are a courageous man. :-)

:-)

We have to think about having this less painful in OpenCL-V... But of course it does not solve your issue now. :-(

Not just my issue: anyone who doesnt have a SPIR-V enabled card. Which is, pretty much anyone :-D . I think?

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

https://github.com/keryell/ronan/raw/gh-pages/Talks/2016/2016-03-13-PPoPP-SYCL-triSYCL/2016-03-13-PPoPP-SYCL-triSYCL-expose.pdf to solve somehow similar problem...

Thanks! Wil ltake a look :-)

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Update on this, after a ton of pain:

Output from running the test case:

124 125 126 127 
125 126 127 128 
126 127 128 129 

Kernel from the test case: https://gist.github.com/hughperkins/bdefd062f83fc3d2ad59add6480d2819

Kernel from the (still failing) split: https://gist.github.com/hughperkins/ff9db47558c86a7a265341ffb5168bd2

You can see approximately how it works:

  1. we assume everything is cut from a singel gpu buffer, and we pass it in, along with its virtual memory address, into the kernel parametrs:
kernel void _Z17run_bounded_arra(global char* clmem0, unsigned long clmem_vmem_offset0, ...
  1. We assign these to a global variables struct, we're going to pass aroudn by-value ish:
    struct GlobalVars globalVars = { scratch, clmem0, clmem_vmem_offset0 };
    struct GlobalVars *pGlobalVars = &globalVars;
  1. We receive the virtual memory pointers by-value in the struct, and store them as unsigned long, since they're 64-bit ,and the gpu pointers are 32-bit ...
struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    unsigned long f1[8];
    global float** f2;
};

(only look at f1 for now. havent done f2 yet, ie only handling bounded arrays for now. but unbounded will work very similarly, just slighlty more work, no obvious technical/theoretical obstacles).

  1. at some point, we call llvm load on a virtual pointer, and Ploof! we convert it from virtual pointer to global pointer:
   global float* v43_gptrstep = getGlobalPointer((float*)(&((&v20)[(long)(v38 / v25)]))[0], pGlobalVars);

Oh yes, the small inline getGlobalPointer implemnetation:

inline global float *getGlobalPointer(unsigned long vmemloc, struct GlobalVars *globalVars) {
    return (global float *)(globalVars->clmem0 + vmemloc - globalVars->clmem_vmem_offset0);
}

Thats it! The hard bits are done. But, a bunch of fiddly bits, since there's abunch of hacks in Coriander, and some of this needs some hacky hacks around the hackiness :-P

Edit: tldr; : we implement virtual memory inside the kernel :-P

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

On the downside, seems this is not enough to get recurrent_networks.py to run yet:

[LAUNCH] kernelGo() uniqueKernelName: _ZN5Eigen8internal19ReductionInitKernelIfiEEvT_T0_PS2__1
[LAUNCH] clmem0
findMemoryByClmem clmem=0x7ffb0109ba80 memory->clmem 0x7ffb0109ba80
[LAUNCH] clmem1
findMemoryByClmem clmem=0x7ffb0109ba80 memory->clmem 0x7ffb0109ba80
[LAUNCH] i=0 FloatArg
[LAUNCH] i=1 Int32Arg=512
[LAUNCH] i=2 Int64Arg=333824
kernel failed to run, saving to easycl-failedkernel.cl
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Invalid work group size, code -54
Abort trap: 6

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

oh right, because I only implemented split for boudned arrays, maybe for that...

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Heheh! recurrent_network.py running :-O
screen shot 2017-06-06 at 4 28 42 pm

^^^ this is on Ubuntu 16.04 by the way. (But not with any offiical wheel, or even with stuff committed to git for that matter :-P )

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

(merged to master. Here is recurrent_network.py running on Mac:

screen shot 2017-06-06 at 5 12 40 pm

(Yes, I confess there's no way to tell it's a Mac from the screenshot.)

)

from tf-coriander.

cathalgarvey avatar cathalgarvey commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Ok. Information update:

  • bidirectional_rnn.py runs too
  • dynamic_rnn.py does not, because it uses more than an 8-way split

Thoughts on the extent to which it is useful as-is, and the extent to which you need a split on more than 8 ways, as per dynamic_rnn.py?

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

(update: I raised teh maximum number of inline splits up from 8 to 64 (its just a number, in the code), so dynamic_rnn.py runs now, but it gives loss nan. No diagnosis as to why currently).

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

(hmmmm, recurrent_rnn.py is not very speedy though. For 4000 iterations:

  • Mac CPU: 37 seconds
  • Mac Radeon: 63 seconds

I guess that rnns are not really well adapted to GPUs anyway, since tons of tiny batches, and then using Coriander exacerbates this.

I think some standard CNN might work better.

Edit: the good news is, changing the maximum inline splits makes no difference to recurrent_network.py speed. Well, one can see that as good or bad I suppose. Changing the max splits from 64 back down to 8 makes no difference to the speed.
)

from tf-coriander.

cathalgarvey avatar cathalgarvey commented on May 29, 2024

For my part, I'm not sure what effect splits have on work-flows; it's probably not "one split per GPU", but 64 splits sounds like enough to permit multi-GPU or multi-machine?

At this stage, I'm happy enough with being able to split between two GPUs. If 64 splits is enough to allow that, then I'm basically happy. :)

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

from tf-coriander.

cathalgarvey avatar cathalgarvey commented on May 29, 2024

Not sure! I'm not seasoned enough with NNs to know what to expect, performance wise. But could capacity (mobile GPU) be a factor?

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

It could, but that would also affect the cpu too of course.

Anyway, I've cleaned up the output a bit, so it doesnt spam build warnings. Hopefully. And set maximjum inline splits to 64. I'll probalby build a wheel once it hits 10pm (since aws instances are billed from the start of the hour).

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Sooo.... I made a tag v0.18.0... but it didnt build on Ubuntu, following a bunch of 'cleaning' I did on coriander for eigen, whilst I was procrastinating about split. But now there is tag v0.18.1, and that builds ok :-) . Here is a piccie of dynamic_rnn.py running on Ubuntu 16.04:
screen shot 2017-06-06 at 11 55 48 pm

That's from a sort of hybrid build though, rather from the actual pure tag, so I'm going to rebuild that tag, from clean ./configure etc, then upload that.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

v0.18.1 failed build too :-P . issue with shims. fixed in hughperkins/coriander@3cfbf6f . But then I noticed that test_floatstarstar.cu fails occasionally ,so I'm checking this point.

Edit: fixed the test_floatstarstar.cu failure in hughperkins/coriander@31739d9

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

I think latest wheel might be bugged. At least, it certainly is bugged on Mac, not sure on Ubuntu. On Mac, it freezes my Mac, have to reboot :-P . Fixing it now.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Using wheel v0.18.3 (to be uploaded), u1604 on the left, mac on the right, autoencoder.py, which was hanging my Mac earlier:

screen shot 2017-06-07 at 6 37 49 pm

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

https://github.com/hughperkins/tf-coriander/releases/tag/v0.18.3

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

@cathalgarvey calculated some timings at https://github.com/hughperkins/tf-coriander/tree/tensorflow-dev#test-results , and compared with NVIDIA® CUDA™ , on the same GPU:

  • for multilayer_perceptron.py, timings are approximately the same, for Coriander vs NVIDIA® CUDA™
  • for the recurrent networks, Coriander is around 4 times slower than NVIDIA® CUDA™

from tf-coriander.

cathalgarvey avatar cathalgarvey commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

Thats approximately what I'd expect: LSTMs, and RNNs in general, tend to have tiny-ish batch sizes, and spend their life shuffling tiny bits of data to and from the GPU. Running on the CPU can be faster, since no data shuffling.

Conv nets have huge batch sizes, and need tons of maths, for the convolutional layers, and GPUs are a great fit for them.

In the case of Coriander, launch times are slower than for direct NVIDIA® CUDA™, which exacerbates the slowness of tiny RNN batch sizes.

To accelerate RNNs there are two ways really:

  • kernel fusion, to reduce the number of kernel launches, and
  • larger batch sizes

The second way, larger batch sizes, is by far and away the easier method, though it might reduce the amount of learning that takes place per sample per batch.

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

You've been very quiet in the last few days. I'm guessing you've been using ... HIP?

from tf-coriander.

cathalgarvey avatar cathalgarvey commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

from tf-coriander.

hughperkins avatar hughperkins commented on May 29, 2024

(Just for info, timings are moved to this page now : https://github.com/hughperkins/tf-coriander/blob/master/doc/execution_speed.md )

from tf-coriander.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.