GithubHelp home page GithubHelp logo

Comments (17)

emjotde avatar emjotde commented on July 28, 2024

Hi,
That looks more like a driver error to me? What machine is this? Also does this happen regularly, only once, only with your branch? Too little information to guess anything.

from marian-dev.

afaji avatar afaji commented on July 28, 2024

this is from master, original code without modifying anything.
Happened once, just today.

running on azure

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

Hm, difficult to say. What's the driver version? On some machines the nvidia drivers are highly unstable, not something we can fix really.

from marian-dev.

afaji avatar afaji commented on July 28, 2024

Tesla M60

command used: `./marian -t corpus.ro corpus.en -d 0 1 2 3 --disp-freq=100 --layer-normalization``

Actually, just happened again right now, with freshly pulled code:

[2017-03-22 16:49:00] Ep. 1 : Up. 5600 : Sen. 358400 : Cost 56.34 : Time 48.07s : 3026.32 words/s
[2017-03-22 16:49:52] Ep. 1 : Up. 5700 : Sen. 364800 : Cost 55.40 : Time 51.16s : 2809.11 words/s
[2017-03-22 16:50:42] Ep. 1 : Up. 5800 : Sen. 371200 : Cost 56.01 : Time 50.92s : 2893.21 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 567
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 46
Segmentation fault (core dumped)

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

from marian-dev.

afaji avatar afaji commented on July 28, 2024

Still happened after reset.
I just deallocate and re-start the azure server, not sure if that counts as resetting the machine.

[2017-03-22 18:33:09] Ep. 2 : Up. 7100 : Sen. 81408 : Cost 47.13 : Time 49.01s : 3035.36 words/s
[2017-03-22 18:33:53] Ep. 2 : Up. 7200 : Sen. 87808 : Cost 43.82 : Time 44.38s : 3113.88 words/s
[2017-03-22 18:34:46] Ep. 2 : Up. 7300 : Sen. 94208 : Cost 48.74 : Time 52.93s : 2899.93 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 567
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 46
Segmentation fault (core dumped)

So this is 3 errors in 3 attempts.

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

from marian-dev.

afaji avatar afaji commented on July 28, 2024

Driver version: 375.39

config: ./marian -t corpus.ro corpus.en -d 0 1 2 3 --disp-freq=100 --layer-normalization --log=data_baseline.txt

data:
I ran the this wmt script and used corpus.ro and corpus.en

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

I have been having problems with those new driver versions, especially with this version (weird errors, random freezes), and downgraded to 367.48 on all my machines. This made all errors go away. You can try this and I can experiment with the data in the meantime.

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

You did not do any preprocessing? Notice that the data is not even tokenized. But I guess it might be empty lines or something, so maybe a good test. Good chance this is a case of garbage in/garbage out.

from marian-dev.

afaji avatar afaji commented on July 28, 2024

I always thought Marian did the preprocessing...

but I don't think unprocessed data causes the error. You might get a crappy result, though.

Now that I retry with preprocessed data, I found something interesting:

[2017-03-23 09:22:39] [config] train-sets:
[2017-03-23 09:22:39] [config]   - corpusPrep.ro
[2017-03-23 09:22:39] [config]   - corpusPrep.en
[2017-03-23 09:22:39] [config] type: dl4mt
[2017-03-23 09:22:39] [config] valid-freq: 10000
[2017-03-23 09:22:39] [config] valid-metrics:
[2017-03-23 09:22:39] [config]   - cross-entropy
[2017-03-23 09:22:39] [config] vocabs:
[2017-03-23 09:22:39] [config]   - corpus.ro.yml
[2017-03-23 09:22:39] [config]   - corpus.en.yml
[2017-03-23 09:22:39] [config] workspace: 2048

see that vocab file is corpus.ro.yml, even though my trainfile is corpusPrep.ro. In fact no matter what my train file name is, the vocab is always corpus.ro.yml. Is this intended? Maybe we load wrong vocab thus causing the crash.

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

No, why should marian do preprocessing? No such magic yet.
I suppose you have a model.npz and a model.npz.yml file in that folder. Marian automatically reloads that.

from marian-dev.

afaji avatar afaji commented on July 28, 2024

I see..

so that's not the issue. Currently training the preprocessed files.

from marian-dev.

kpu avatar kpu commented on July 28, 2024

@afaji can you make a minimal test case (tokenized or not) that causes the segfault then check it in, intentionally breaking the build?

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

I am running it now on the unprocessed data. So far no problems in the 3rd epoch (different machine though). I still suspect it's the driver version.

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

So, six full epochs. no problem. You might want to check downgrading that driver...

from marian-dev.

emjotde avatar emjotde commented on July 28, 2024

Two times 10 epochs, no segfault.

from marian-dev.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.