Comments (17)
Hi,
That looks more like a driver error to me? What machine is this? Also does this happen regularly, only once, only with your branch? Too little information to guess anything.
from marian-dev.
this is from master, original code without modifying anything.
Happened once, just today.
running on azure
from marian-dev.
Hm, difficult to say. What's the driver version? On some machines the nvidia drivers are highly unstable, not something we can fix really.
from marian-dev.
Tesla M60
command used: `./marian -t corpus.ro corpus.en -d 0 1 2 3 --disp-freq=100 --layer-normalization``
Actually, just happened again right now, with freshly pulled code:
[2017-03-22 16:49:00] Ep. 1 : Up. 5600 : Sen. 358400 : Cost 56.34 : Time 48.07s : 3026.32 words/s
[2017-03-22 16:49:52] Ep. 1 : Up. 5700 : Sen. 364800 : Cost 55.40 : Time 51.16s : 2809.11 words/s
[2017-03-22 16:50:42] Ep. 1 : Up. 5800 : Sen. 371200 : Cost 56.01 : Time 50.92s : 2893.21 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 567
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 46
Segmentation fault (core dumped)
from marian-dev.
from marian-dev.
Still happened after reset.
I just deallocate and re-start the azure server, not sure if that counts as resetting the machine.
[2017-03-22 18:33:09] Ep. 2 : Up. 7100 : Sen. 81408 : Cost 47.13 : Time 49.01s : 3035.36 words/s
[2017-03-22 18:33:53] Ep. 2 : Up. 7200 : Sen. 87808 : Cost 43.82 : Time 44.38s : 3113.88 words/s
[2017-03-22 18:34:46] Ep. 2 : Up. 7300 : Sen. 94208 : Cost 48.74 : Time 52.93s : 2899.93 words/s
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/kernels/tensor_operators.cu 567
GPUassert: an illegal memory access was encountered /home/alfikri/ori/marian/src/tensors/tensor.cu 46
Segmentation fault (core dumped)
So this is 3 errors in 3 attempts.
from marian-dev.
from marian-dev.
Driver version: 375.39
config: ./marian -t corpus.ro corpus.en -d 0 1 2 3 --disp-freq=100 --layer-normalization --log=data_baseline.txt
data:
I ran the this wmt script and used corpus.ro and corpus.en
from marian-dev.
I have been having problems with those new driver versions, especially with this version (weird errors, random freezes), and downgraded to 367.48 on all my machines. This made all errors go away. You can try this and I can experiment with the data in the meantime.
from marian-dev.
You did not do any preprocessing? Notice that the data is not even tokenized. But I guess it might be empty lines or something, so maybe a good test. Good chance this is a case of garbage in/garbage out.
from marian-dev.
I always thought Marian did the preprocessing...
but I don't think unprocessed data causes the error. You might get a crappy result, though.
Now that I retry with preprocessed data, I found something interesting:
[2017-03-23 09:22:39] [config] train-sets:
[2017-03-23 09:22:39] [config] - corpusPrep.ro
[2017-03-23 09:22:39] [config] - corpusPrep.en
[2017-03-23 09:22:39] [config] type: dl4mt
[2017-03-23 09:22:39] [config] valid-freq: 10000
[2017-03-23 09:22:39] [config] valid-metrics:
[2017-03-23 09:22:39] [config] - cross-entropy
[2017-03-23 09:22:39] [config] vocabs:
[2017-03-23 09:22:39] [config] - corpus.ro.yml
[2017-03-23 09:22:39] [config] - corpus.en.yml
[2017-03-23 09:22:39] [config] workspace: 2048
see that vocab file is corpus.ro.yml, even though my trainfile is corpusPrep.ro. In fact no matter what my train file name is, the vocab is always corpus.ro.yml. Is this intended? Maybe we load wrong vocab thus causing the crash.
from marian-dev.
No, why should marian do preprocessing? No such magic yet.
I suppose you have a model.npz and a model.npz.yml file in that folder. Marian automatically reloads that.
from marian-dev.
I see..
so that's not the issue. Currently training the preprocessed files.
from marian-dev.
@afaji can you make a minimal test case (tokenized or not) that causes the segfault then check it in, intentionally breaking the build?
from marian-dev.
I am running it now on the unprocessed data. So far no problems in the 3rd epoch (different machine though). I still suspect it's the driver version.
from marian-dev.
So, six full epochs. no problem. You might want to check downgrading that driver...
from marian-dev.
Two times 10 epochs, no segfault.
from marian-dev.
Related Issues (20)
- Compilation error on gcc 12: pointer used after ‘void operator delete(void*, std::size_t)’
- Doesn't compile on clang 16.0.6 due to issue in sentencepiece
- Doubt regarding scoring method,F0
- Cost nan
- Portable marian binary for the recent versions of ubuntu (20.04 and newer)
- Curand error 203 in wsl
- fp16 does not work on CPU HOT 2
- --force-decode does not work on CPU
- Running into Cublas Error: 7 for target factors for marian 1.12 HOT 5
- Error: operand type mismatch for `vpor'
- Multithread Translation HOT 1
- High RAM usage with factors+shuffle-in-ram: false
- Per-factor embedding dimensions when concatenating
- Setting optimizer-delay to 0 prevents makes the trainining process stall with no error
- [Feature Request] Decoder-only Marian models
- GCC 12 compilation warning: withCommas integer wraparound
- intrusive_ptr not threadsafe
- Training Optimization Question
- zstandard support in input files
- Training fails on Vertex AI (GCP) due to NCCL error on A100 GPUs HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from marian-dev.