Relevant test: <a href="https://github.com/hughperkins/tensorflow-cl

Wheee! split working in <a class="commit-link" data-hovercard-type="commit" data-hove

Looks like it is possible to manually hack the opencl, using <code class="notranslate"

`split` failing,about hughperkins/tf-coriander

Comments (41)

hughperkins commented on May 29, 2024 1

Wheee! split working in hughperkins/coriander@98e2526 :-O

c_cpu[0] [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
i 0
  cpu [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
  gpu [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041]
i 1
  cpu [ 1.00405395  0.38618639  0.73736858  1.49073207 -0.93583387  1.17582905
 -1.25388062 -0.63775152  0.90710521 -1.42868066]
  gpu [ 1.00405395  0.38618639  0.73736858  1.49073207 -0.93583387  1.17582905
 -1.25388062 -0.63775152  0.90710521 -1.42868066]
i 2
  cpu [ 0.00284592  0.68822271 -0.87953633  0.28362733 -0.80536652 -1.72766948
 -0.39089981  0.57380587  0.33858904 -0.01183049]
  gpu [ 0.00284592  0.68822271 -0.87953633  0.28362733 -0.80536652 -1.72766948
 -0.39089981  0.57380587  0.33858904 -0.01183049]
i 3
  cpu [ 0.02968323  1.06931591  0.89070642  1.75488615  1.49564409  1.06939268
 -0.77270871  0.79486269  0.31427199 -1.32626545]
  gpu [ 0.02968323  1.06931591  0.89070642  1.75488615  1.49564409  1.06939268
 -0.77270871  0.79486269  0.31427199 -1.32626545]
PASSED

----------- generated xml file: /Users/hugh2/git-local/tensorflow-dev/test/junit-pytest-report.xml -----------
============================================= 9 tests deselected =============================================
=================================== 1 passed, 9 deselected in 2.09 seconds ==============================

It's a "bit" spammy, and I need to commit the changes to tf-coriander and so on, but seems promising to me :-)

from tf-coriander.

hughperkins commented on May 29, 2024 1

https://github.com/hughperkins/tf-coriander/releases/tag/v0.18.2

from tf-coriander.

hughperkins commented on May 29, 2024

So.... the struct containing the float ** is a gpu-side buffer, can be read ok on the gpu. However, the pointers inside it are virtual pointers: they arrive as the virtual pointers we fed to the hostside earlier. They are not valid global float * pointers :-O

from tf-coriander.

hughperkins commented on May 29, 2024

correction: the struct is hostside, passed in by-value. it contains two types of doubly-indirected arrays:

float *f1[8];
float **f2;

The first one, f1, we can probably rewrite, on the fly, possibly. The second one seems more challenging...

from tf-coriander.

hughperkins commented on May 29, 2024

There seem to be a few approaches possible. An obvious one would be to modify the tensorflow CudaDeviceArrayStruct. This has the advantage of:

relatively easy short-term
but the downsides of:
violating the principle of not modifying the client code
won't generalize well

Another approach would be to convert the pointers in by-value structs containig float **s, before passing them into the kernel. This will work for boudned arrays, ie float *ptrs0[8], but less well for unbounded ones float **ptrs1.

Lastly, I'm tempted to make a copy of the virtual memory table into the gpu side. This will allow the kernel itself to handle this. I mean, if we feed it code to handle it.

I currently like this last possibilty, since it's the most general. In the case of tensorflow, the virtual memory table only contains one entry, for the huge mega allocate-entire-gpu buffer, so performance cost will be negligible.

from tf-coriander.

hughperkins commented on May 29, 2024

Update: if we're going to run a memory manager on the gpuside, we might need to allocate entire gpu memory in one go, so we get a single cl_mem, that we pass in on each call. Otherwise, whilst in the specific case of tensorflow, it does that anyway, in the general case, we'd need to pass in all buffers into any kernel calls have unbounded float ** parameters.

Edit: for now, I might make the simplifying assumption that they're all cut from the same buffer as the float * going into the method call, since this will work for tensorflow, mostly, and then generalize if/when I hit a situation that this no longer works for.

from tf-coriander.

hughperkins commented on May 29, 2024

Just to add additional challenges, it turns out that, on my own GPU, float *s are only 32-bits. So, if float *ptrs[8] is passed in by value, from hostside, ptrs[0] and ptrs[1] do not in fact hold the values from the hostside. We have to first cast ptrs into unsigned long *, and then we can get the values.

$ clinfo
...
  Device Name                                     AMD Radeon Pro 450 Compute Engine
   ...
   Address bits                                    32, Little-Endian

The good news is, if clinfo can determine the 32-bitness, then it seems plausible that coriander can too? The bad news is, this issue seems like this is not going to contribute to solving the issue of handling unbounded float ** arrays, passed in by value.

from tf-coriander.

hughperkins commented on May 29, 2024

Looks like it is possible to manually hack the opencl, using COCL_DUMP_CL=1, then hacking on /tmp/0.cl, then using COCL_LOAD_CL=1 to run it ( https://github.com/hughperkins/coriander/blob/master/doc/advanced_usage.md#cocl_dump_cl1 ), to get the gpu buffers to be updated correctly. Using COCL_DUMP_CONFIG.. with the following yaml file:

_ZN10tensorflow12_GLOBAL__N_113SplitOpKernelIfEEvPKT_iiiNS_21CudaDeviceArrayStructIPS2_Li8EEE_0_1:
  - type: float
    clmem: 0
    offsetarg: 0
    count: 18
  - type: int32
    clmem: 0
    offsetarg: 0
    count: 18
  - type: float
    virtualaddress: 1920
    count: 18
  - type: float
    virtualaddress: 2176
    count: 18
  - type: float
    virtualaddress: 2432
    count: 18
  - type: float
    virtualaddress: 2688
    count: 18

.... and the following test python code:

def test_split():
    shape = (72, 1)
    graph = tf.Graph()
    np.random.seed(123)
    a = np.random.randn(*shape).astype(np.float32)
    with graph.as_default():
        with tf.device('/gpu:0'):
            a_tf = tf.placeholder(tf.float32, shape)
            c_tf = tf.split(0, 4, a_tf)
            sess = tf.Session()
            with sess.as_default():
                c = sess.run(c_tf, feed_dict={a_tf: a})
                if(np.prod(shape)) < 20:
                    print('a', a)
                    print('c_gpu', c)
                else:
                    print('c_gpu[0]', c[0])

ie, split into 4 parts.

then, using the opencl from later, we can get the following output:

  Dumping buffer 0 clmem0 offset=1280 floats:
    -1.08563 0.997345 0.282979 -1.50629 -0.5786 1.65144 -2.42668 -0.428913 
    1.26594 -0.86674 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 -0.434351 
    2.20593 2.18679 
  Dumping buffer 1 clmem0 offset=1280 int32s:
    -1081412110 1065308680 1049682575 -1077883324 -1089200347 1070817862 -1071952202 
    -1092904336 1067584051 -1084366157 -1087517828 -1111361849 1069475291 -1088188651 
    -1092398694 -1092721846 1074605557 1074525262 
  Dumping buffer 2 virtualaddress=1920  offset in buffer 1792 floats:
    -1.08563 0.997345 0.282979 -1.50629 -0.5786 1.65144 -2.42668 -0.428913 
    1.26594 -0.86674 -0.678886 -0.094709 1.49139 -0.638902 -0.443982 -0.434351 
    2.20593 2.18679 
  Dumping buffer 3 virtualaddress=2176  offset in buffer 2048 floats:
    1.00405 0.386186 0.737369 1.49073 -0.935834 1.17583 -1.25388 -0.637752 
    0.907105 -1.42868 -0.140069 -0.861755 -0.255619 -2.79859 -1.77153 -0.699877 
    0.927462 -0.173636 
  Dumping buffer 4 virtualaddress=2432  offset in buffer 2304 floats:
    0.00284592 0.688223 -0.879536 0.283627 -0.805367 -1.72767 -0.3909 0.573806 
    0.338589 -0.0118305 2.39237 0.412912 0.978736 2.23814 -1.29409 -1.03879 
    1.74371 -0.798063 
  Dumping buffer 5 virtualaddress=2688  offset in buffer 2560 floats:
    0.0296832 1.06932 0.890706 1.75489 1.49564 1.06939 -0.772709 0.794863 0.314272 
    -1.32627 1.4173 0.807237 0.0454901 -0.233092 -1.1983 0.199524 0.468439 
    -0.831155 
c_gpu[0] [-1.08563066  0.99734545  0.2829785  -1.50629473 -0.57860023  1.65143657
 -2.42667913 -0.42891264  1.26593626 -0.86674041 -0.67888618 -0.09470897
  1.49138963 -0.63890201 -0.44398195 -0.43435127  2.20592999  2.18678617]

We can see that buffers 2 to 5 (the output gpu buffers) correctly have np.random.rand() looking data, and that buffer 2 matches buffer 0 (the start of the input data). So, this looks promising.

The relevant hacks to /tmp/0.cl are:

Change the struct defintiion to use unsigned longs:

struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    global unsigned long f1[8];
    global unsigned long * f2;
};

... and insert the following lines, at the end of section v4:

    unsigned long offsets[8];
    for(int i = 0; i < 8; i++) {
        offsets[i] = v13[0].f1[i];
        int offsetint = (int)offsets[i];
        v20[i] = (global float *)(offsets[i] + clmem0 - 128);
    }

This at least shows that it appears technically possible to modify the kernel, in the case of num splits < 8, to give the correct output buffers. The - 128 is a hack, since the first virtual address, for the first allocated cl_mem is created at location 128...

from tf-coriander.

hughperkins commented on May 29, 2024

Created test case, using assumption that all gpu buffers cut from one single allocation, hughperkins/coriander@fcba64f Now just have to write some code to make this pass...

Note that this also assumes bounded array. We will look at unbounded later, or just modify tensorflow to have a value of ~128 or so for MaxInlineValues https://github.com/hughperkins/tf-coriander/blob/master/tensorflow/core/kernels/cuda_device_array_gpu.h#L28 , so that the array is bounded for many typical use-cases.

from tf-coriander.

hughperkins commented on May 29, 2024

Oh... idea :-) . We can generalize to multiple allocations, by simply re-allocating a larger buffer each time, copying everything across, and redoing the virtual memory table. It's not ideal, but it does have a few interesting qualities:

it's general
it will fail if its assumptions are violated (rather than silently giving incorrect results, with no explanation, or introspection as to why)
it doesnt involve passing potentially zillions of clmem buffers into each kernel call....

from tf-coriander.

keryell commented on May 29, 2024

You are a courageous man. :-)

We have to think about having this less painful in OpenCL-V... But of course it does not solve your issue now. :-(

Just to be sure: when you create a huge buffer, do you assume that the address on the GPU will be always the same across the various kernel calls?
This is a wrong (according to the specification) but useful :-) assumption I made also in another context
https://github.com/keryell/ronan/raw/gh-pages/Talks/2016/2016-03-13-PPoPP-SYCL-triSYCL/2016-03-13-PPoPP-SYCL-triSYCL-expose.pdf to solve a somehow similar problem...

from tf-coriander.

hughperkins commented on May 29, 2024

Just to be sure: when you create a huge buffer, do you assume that the address on the GPU will be always the same across the various kernel calls?

I was hoping that was the case, but tested that yesterday and, as you state, it's not the case. Therefore I dont assume it.

Relevant test code: https://github.com/hughperkins/pub-prototyping/blob/master/opencl/test_globalpointerstability.cpp

from tf-coriander.

hughperkins commented on May 29, 2024

You are a courageous man. :-)

:-)

We have to think about having this less painful in OpenCL-V... But of course it does not solve your issue now. :-(

Not just my issue: anyone who doesnt have a SPIR-V enabled card. Which is, pretty much anyone :-D . I think?

from tf-coriander.

hughperkins commented on May 29, 2024

https://github.com/keryell/ronan/raw/gh-pages/Talks/2016/2016-03-13-PPoPP-SYCL-triSYCL/2016-03-13-PPoPP-SYCL-triSYCL-expose.pdf to solve somehow similar problem...

Thanks! Wil ltake a look :-)

from tf-coriander.

hughperkins commented on May 29, 2024

Update on this, after a ton of pain:

the test at https://github.com/hughperkins/coriander/blob/adding-floatfloatstar/test/cocl/test_floatstarstar.cu passes now :-)
however, this isnt quite enough to get split to work yet. seems close-ish though

Output from running the test case:

124 125 126 127 
125 126 127 128 
126 127 128 129

Kernel from the test case: https://gist.github.com/hughperkins/bdefd062f83fc3d2ad59add6480d2819

Kernel from the (still failing) split: https://gist.github.com/hughperkins/ff9db47558c86a7a265341ffb5168bd2

You can see approximately how it works:

we assume everything is cut from a singel gpu buffer, and we pass it in, along with its virtual memory address, into the kernel parametrs:

kernel void _Z17run_bounded_arra(global char* clmem0, unsigned long clmem_vmem_offset0, ...

We assign these to a global variables struct, we're going to pass aroudn by-value ish:

    struct GlobalVars globalVars = { scratch, clmem0, clmem_vmem_offset0 };
    struct GlobalVars *pGlobalVars = &globalVars;

We receive the virtual memory pointers by-value in the struct, and store them as unsigned long, since they're 64-bit ,and the gpu pointers are 32-bit ...

struct tensorflow__CudaDeviceArrayStruct {
    int f0;
    unsigned long f1[8];
    global float** f2;
};

(only look at f1 for now. havent done f2 yet, ie only handling bounded arrays for now. but unbounded will work very similarly, just slighlty more work, no obvious technical/theoretical obstacles).

at some point, we call llvm load on a virtual pointer, and Ploof! we convert it from virtual pointer to global pointer:

   global float* v43_gptrstep = getGlobalPointer((float*)(&((&v20)[(long)(v38 / v25)]))[0], pGlobalVars);

Oh yes, the small inline getGlobalPointer implemnetation:

inline global float *getGlobalPointer(unsigned long vmemloc, struct GlobalVars *globalVars) {
    return (global float *)(globalVars->clmem0 + vmemloc - globalVars->clmem_vmem_offset0);
}

Thats it! The hard bits are done. But, a bunch of fiddly bits, since there's abunch of hacks in Coriander, and some of this needs some hacky hacks around the hackiness :-P

Edit: tldr; : we implement virtual memory inside the kernel :-P

from tf-coriander.

hughperkins commented on May 29, 2024

On the downside, seems this is not enough to get recurrent_networks.py to run yet:

[LAUNCH] kernelGo() uniqueKernelName: _ZN5Eigen8internal19ReductionInitKernelIfiEEvT_T0_PS2__1
[LAUNCH] clmem0
findMemoryByClmem clmem=0x7ffb0109ba80 memory->clmem 0x7ffb0109ba80
[LAUNCH] clmem1
findMemoryByClmem clmem=0x7ffb0109ba80 memory->clmem 0x7ffb0109ba80
[LAUNCH] i=0 FloatArg
[LAUNCH] i=1 Int32Arg=512
[LAUNCH] i=2 Int64Arg=333824
kernel failed to run, saving to easycl-failedkernel.cl
libc++abi.dylib: terminating with uncaught exception of type std::runtime_error: Invalid work group size, code -54
Abort trap: 6

from tf-coriander.

hughperkins commented on May 29, 2024

oh right, because I only implemented split for boudned arrays, maybe for that...

from tf-coriander.

hughperkins commented on May 29, 2024

Heheh! recurrent_network.py running :-O

^^^ this is on Ubuntu 16.04 by the way. (But not with any offiical wheel, or even with stuff committed to git for that matter :-P )

from tf-coriander.

hughperkins commented on May 29, 2024

(merged to master. Here is recurrent_network.py running on Mac:

(Yes, I confess there's no way to tell it's a Mac from the screenshot.)

)

from tf-coriander.

cathalgarvey commented on May 29, 2024

Yaaay! Happy to test whenever the next build is ready. :)

…

On 6 June 2017 17:13:17 GMT+01:00, Hugh Perkins ***@***.***> wrote: (merged to `master`. Here is `recurrent_network.py` running on Mac: <img width="780" alt="screen shot 2017-06-06 at 5 12 40 pm" src="https://user-images.githubusercontent.com/123560/26839576-603d6ade-4adb-11e7-8558-5f5ea6cf2e71.png"> (Yes, I confess there's no way to tell it's a Mac from the screenshot.) ) -- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

Ok. Information update:

bidirectional_rnn.py runs too
dynamic_rnn.py does not, because it uses more than an 8-way split

Thoughts on the extent to which it is useful as-is, and the extent to which you need a split on more than 8 ways, as per dynamic_rnn.py?

from tf-coriander.

hughperkins commented on May 29, 2024

(update: I raised teh maximum number of inline splits up from 8 to 64 (its just a number, in the code), so dynamic_rnn.py runs now, but it gives loss nan. No diagnosis as to why currently).

from tf-coriander.

hughperkins commented on May 29, 2024

(hmmmm, recurrent_rnn.py is not very speedy though. For 4000 iterations:

Mac CPU: 37 seconds
Mac Radeon: 63 seconds

I guess that rnns are not really well adapted to GPUs anyway, since tons of tiny batches, and then using Coriander exacerbates this.

I think some standard CNN might work better.

Edit: the good news is, changing the maximum inline splits makes no difference to recurrent_network.py speed. Well, one can see that as good or bad I suppose. Changing the max splits from 64 back down to 8 makes no difference to the speed.
)

from tf-coriander.

cathalgarvey commented on May 29, 2024

For my part, I'm not sure what effect splits have on work-flows; it's probably not "one split per GPU", but 64 splits sounds like enough to permit multi-GPU or multi-machine?

At this stage, I'm happy enough with being able to split between two GPUs. If 64 splits is enough to allow that, then I'm basically happy. :)

from tf-coriander.

hughperkins commented on May 29, 2024

It corresponds to sequence length, in the rnn

…

On 6 June 2017 19:46:03 BST, Cathal Garvey ***@***.***> wrote: For my part, I'm not sure what effect splits have on work-flows; it's probably not "one split per GPU", but 64 splits sounds like enough to permit multi-GPU or multi-machine? At this stage, I'm happy enough with being able to split between two GPUs. If 64 splits is enough to allow that, then I'm basically happy. :) -- You are receiving this because you were assigned. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

What are your thoughts about the speed?

…

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

cathalgarvey commented on May 29, 2024

Not sure! I'm not seasoned enough with NNs to know what to expect, performance wise. But could capacity (mobile GPU) be a factor?

from tf-coriander.

hughperkins commented on May 29, 2024

It could, but that would also affect the cpu too of course.

Anyway, I've cleaned up the output a bit, so it doesnt spam build warnings. Hopefully. And set maximjum inline splits to 64. I'll probalby build a wheel once it hits 10pm (since aws instances are billed from the start of the hour).

from tf-coriander.

hughperkins commented on May 29, 2024

Sooo.... I made a tag v0.18.0... but it didnt build on Ubuntu, following a bunch of 'cleaning' I did on coriander for eigen, whilst I was procrastinating about split. But now there is tag v0.18.1, and that builds ok :-) . Here is a piccie of dynamic_rnn.py running on Ubuntu 16.04:

That's from a sort of hybrid build though, rather from the actual pure tag, so I'm going to rebuild that tag, from clean ./configure etc, then upload that.

from tf-coriander.

hughperkins commented on May 29, 2024

v0.18.1 failed build too :-P . issue with shims. fixed in hughperkins/coriander@3cfbf6f . But then I noticed that test_floatstarstar.cu fails occasionally ,so I'm checking this point.

Edit: fixed the test_floatstarstar.cu failure in hughperkins/coriander@31739d9

from tf-coriander.

hughperkins commented on May 29, 2024

I think latest wheel might be bugged. At least, it certainly is bugged on Mac, not sure on Ubuntu. On Mac, it freezes my Mac, have to reboot :-P . Fixing it now.

from tf-coriander.

hughperkins commented on May 29, 2024

Using wheel v0.18.3 (to be uploaded), u1604 on the left, mac on the right, autoencoder.py, which was hanging my Mac earlier:

from tf-coriander.

hughperkins commented on May 29, 2024

https://github.com/hughperkins/tf-coriander/releases/tag/v0.18.3

from tf-coriander.

hughperkins commented on May 29, 2024

@cathalgarvey calculated some timings at https://github.com/hughperkins/tf-coriander/tree/tensorflow-dev#test-results , and compared with NVIDIA® CUDA™ , on the same GPU:

for multilayer_perceptron.py, timings are approximately the same, for Coriander vs NVIDIA® CUDA™
for the recurrent networks, Coriander is around 4 times slower than NVIDIA® CUDA™

from tf-coriander.

cathalgarvey commented on May 29, 2024

I have been experimenting a lot in the last few days and I have found LSTMs and GRUs to be very slow, though I didn't know whether I should attribute this to Coriander/OpenCL or if it's just how they (recurrents) are. Convolutions seem speedy, as are straight up fully-connected layers.

…

On 11 June 2017 01:48:08 GMT+01:00, Hugh Perkins ***@***.***> wrote: @cathalgarvey calculated some timings at https://github.com/hughperkins/tf-coriander/tree/tensorflow-dev#test-results , and compared with NVIDIA® CUDA™ , on the same GPU: - for multilayer_perceptron.py, timings are approximately the same, for Coriander vs NVIDIA® CUDA™ - for the recurrent networks, Coriander is around 4 times slower than NVIDIA® CUDA™ -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

Thats approximately what I'd expect: LSTMs, and RNNs in general, tend to have tiny-ish batch sizes, and spend their life shuffling tiny bits of data to and from the GPU. Running on the CPU can be faster, since no data shuffling.

Conv nets have huge batch sizes, and need tons of maths, for the convolutional layers, and GPUs are a great fit for them.

In the case of Coriander, launch times are slower than for direct NVIDIA® CUDA™, which exacerbates the slowness of tiny RNN batch sizes.

To accelerate RNNs there are two ways really:

kernel fusion, to reduce the number of kernel launches, and
larger batch sizes

The second way, larger batch sizes, is by far and away the easier method, though it might reduce the amount of learning that takes place per sample per batch.

from tf-coriander.

hughperkins commented on May 29, 2024

You've been very quiet in the last few days. I'm guessing you've been using ... HIP?

from tf-coriander.

cathalgarvey commented on May 29, 2024

Not yet, Tensorflow isn't supported yet on HIP AFAIK, though Eigen seems to be nearly there. TF on TriSYCL seems to be a thing now though, but I haven't tried it yet. No, I've just been busy trying to replicate my Scikit-Learn text classification results from a certain dataset using "Deep Learning".. sadly, not successfully, yet. :) So I've been too busy using tf-coriander to test if effectively, but I'll look at that tomorrow morning. Thanks for the advice re- LSTM performance. Interesting to learn that larger batch size can diminish learning-per-sample. I've been using Keras, and the data streaming API for training on iterators is... very frustrating to use. So I've been mostly doing relatively large batches for text classification, ~200 docs per batch, fairly long length docs. And learning was just very poor for LSTMs+Glove, usually stalling by 3rd epoch. Convnets+Glove worked better under these conditions but still fell short of lemmatize->bigrams->SVM approach from scikit. Still got lots to learn here. :)

…

On 11 June 2017 02:00:11 GMT+01:00, Hugh Perkins ***@***.***> wrote: You've been very quiet in the last few days. I'm guessing you've been using ... HIP? -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

Ah. Right :)

…

On 11 June 2017 02:10:22 BST, Cathal Garvey ***@***.***> wrote: Not yet, Tensorflow isn't supported yet on HIP AFAIK, though Eigen seems to be nearly there. TF on TriSYCL seems to be a thing now though, but I haven't tried it yet. No, I've just been busy trying to replicate my Scikit-Learn text classification results from a certain dataset using "Deep Learning".. sadly, not successfully, yet. :) So I've been too busy using tf-coriander to test if effectively, but I'll look at that tomorrow morning. Thanks for the advice re- LSTM performance. Interesting to learn that larger batch size can diminish learning-per-sample. I've been using Keras, and the data streaming API for training on iterators is... very frustrating to use. So I've been mostly doing relatively large batches for text classification, ~200 docs per batch, fairly long length docs. And learning was just very poor for LSTMs+Glove, usually stalling by 3rd epoch. Convnets+Glove worked better under these conditions but still fell short of lemmatize->bigrams->SVM approach from scikit. Still got lots to learn here. :) On 11 June 2017 02:00:11 GMT+01:00, Hugh Perkins ***@***.***> wrote: >You've been very quiet in the last few days. I'm guessing you've been >using ... HIP? > >-- >You are receiving this because you were mentioned. >Reply to this email directly or view it on GitHub: >#33 (comment) -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub: #33 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

You probably should compare your lstm results on GPU with CPU. Two reasons: - tf coriander is still a bit flaky: might be some bugs, would be good to find those - you might find using CPU is not slower It would be good to get some measurements on these points.

…

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

from tf-coriander.

hughperkins commented on May 29, 2024

(Just for info, timings are moved to this page now : https://github.com/hughperkins/tf-coriander/blob/master/doc/execution_speed.md )

from tf-coriander.

`split` failing about tf-coriander HOT 41 CLOSED

Comments (41)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs