Comments (10)
Not sure. I want to say the dataset was like 40GB maybe. And the machine had 128GB of ram maybe. It was a long time ago though so take that with a grain of salt.
from mitie.
@davisking , on 4Gb dataset, for me it looks like OpenBLAS running out of memory. I wan monitoring the RAM and swap was full but memory usage was about 40% what's strange :/
I will try to increase the dataset incrementally and see when it fails. I have no idea.
lldb -- ./wordrep -e <path>
(lldb) target create "./wordrep"
Current executable set to '<path>' (arm64).
(lldb) settings set -- target.run-args "-e" "<path>"
(lldb) run
Process 46942 launched: '<path>' (arm64)
number of raw ASCII files found: 76
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 76
Sample 50000000 random context vectors
Now do CCA (left size: 50000000, right size: 50000000).
Process 46942 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6e01ccc1fc)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
Target 0: (wordrep) stopped.
from mitie.
Hard to say. That shouldn't happen though. Try running the program in gdb and getting a stack trace to see what's going on.
from mitie.
@davisking I will do that (a will post results an approx. 2h). Do you roughly remember the size od dataset that English or Spanish model was trained on as well as the RAM of the machine?
from mitie.
Okay, thank you is there any formula or approximation that I can calculate needed amount of RAM (A want to train a model for Polish on dataset 50-80GB). Do you know how big total_word_feature_extractor would be?
from mitie.
I don't recall. It should be linear though. Try some sizes and see what happens. But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.
from mitie.
@davisking I knew from some of the issues that MITIE requires a lot of RAM, but that's quite surprising ://
dataset -> total_word_feature_extractor.dat
1. 50MB -> 335MB
number of raw ASCII files found: 1
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 1
Sample 50000000 random context vectors
Now do CCA (left size: 8582326, right size: 8582326).
correlations: 0.783697 0.495714 0.428661 0.417896 0.399711 0.308799 0.257686 0.241372 0.214332 0.206914 0.180628 0.151268 0.143147 0.135543 0.129035 0.11404 0.104493 0.094976 0.0889702 0.081165 0.0765073 0.0743562 0.0730994 0.0653017 0.0622959 0.0602029 0.0504991 0.0483258 0.0475654 0.0458159 0.0405218 0.0396676 0.0377837 0.035018 0.0321907 0.0302941 0.0294151 0.0272736 0.0250081 0.023913 0.0227566 0.0216142 0.0207634 0.0199015 0.0191722 0.0172601 0.0167749 0.0165069 0.0156701 0.0152914 0.0150729 0.0147807 0.0135261 0.013273 0.0125623 0.0120217 0.0117249 0.0115241 0.010776 0.0105152 0.0102252 0.00982769 0.00967478 0.00906732 0.00888962 0.00882777 0.00853846 0.00810019 0.00803891 0.00766455 0.0073715 0.00711813 0.00686214 0.0067737 0.00648305 0.00637957 0.00621849 0.00609243 0.00578847 0.00560462 0.00551808 0.00540755 0.00527975 0.0051427 0.00495427 0.0048308 0.00470006 0.00460154 0.00457549 0.00444141
CCA done, now build up average word vectors
num words: 200000
num word vectors loaded: 200000
got word vectors, now learn how they correlate with morphological features.
building morphological vectors
L.size(): 200000
R.size(): 200000
Now running CCA on word <-> morphology...
correlations: 0.972561 0.671965 0.612678 0.579442 0.505745 0.410469 0.370399 0.320987 0.303507 0.295264 0.284905 0.272496 0.260294 0.252493 0.247422 0.243564 0.224549 0.215319 0.211788 0.202069 0.198979 0.193592 0.187116 0.179735 0.177608 0.173987 0.167869 0.165495 0.159846 0.157329 0.152932 0.148687 0.146318 0.144891 0.1425 0.140035 0.138354 0.137298 0.13551 0.133037 0.131869 0.129603 0.129255 0.126594 0.125905 0.123318 0.119747 0.116908 0.116507 0.115395 0.114808 0.111849 0.111262 0.108799 0.108121 0.106949 0.105413 0.103576 0.103322 0.102464 0.101616 0.100574 0.100245 0.0993405 0.0986815 0.0981801 0.0973369 0.0969129 0.0962343 0.0956761 0.0950916 0.0949497 0.0937699 0.0930668 0.092798 0.0924761 0.0915229 0.0906338 0.0902752 0.0897365 0.0893135 0.0891064 0.0884818 0.0879273 0.0873452 0.0871056 0.0864901 0.0862346 0.0858451 0.0855675
morphological feature dimensionality: 90
total word feature dimensionality: 271
2. 100MB -> 337MB
number of raw ASCII files found: 2
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 2
Sample 50000000 random context vectors
Now do CCA (left size: 17204027, right size: 17204027).
correlations: 0.779923 0.512237 0.436919 0.423704 0.412156 0.32337 0.260955 0.245038 0.222629 0.220487 0.186529 0.155599 0.145268 0.140225 0.1293 0.11533 0.10416 0.0969533 0.0920494 0.0812887 0.0788473 0.0760371 0.0711883 0.0664675 0.0616015 0.0582108 0.0499312 0.0481921 0.0471149 0.0457483 0.0405427 0.0391695 0.0366974 0.0343238 0.0317766 0.0301706 0.0284668 0.02708 0.024634 0.0230824 0.0221957 0.0211691 0.0201103 0.0192433 0.0177503 0.0170046 0.0165348 0.0160024 0.0154991 0.014842 0.0144806 0.0140849 0.0132977 0.0126606 0.012073 0.0118559 0.0112087 0.0108456 0.0105672 0.0102032 0.0100555 0.0096982 0.00935652 0.00908955 0.00844935 0.00818099 0.00796244 0.00762995 0.00753236 0.0072838 0.00706413 0.00690194 0.00681546 0.00654413 0.00626689 0.00608943 0.00596117 0.00557491 0.00536563 0.0051692 0.00503521 0.00487138 0.00479157 0.00464975 0.0042314 0.00396397 0.00389778 0.00373412 0.0036037 0.00347846
CCA done, now build up average word vectors
num words: 200000
num word vectors loaded: 200000
got word vectors, now learn how they correlate with morphological features.
building morphological vectors
L.size(): 200000
R.size(): 200000
Now running CCA on word <-> morphology...
correlations: 0.978974 0.713797 0.657182 0.632944 0.558261 0.47269 0.425248 0.372064 0.351736 0.339963 0.320005 0.311463 0.30237 0.294466 0.289986 0.280466 0.265051 0.25838 0.247886 0.235087 0.232196 0.225109 0.224001 0.212538 0.204304 0.202062 0.191768 0.184285 0.182187 0.181151 0.174696 0.169313 0.167562 0.16549 0.160309 0.157552 0.154192 0.152316 0.150774 0.146907 0.144155 0.142709 0.141969 0.138627 0.137163 0.134016 0.132602 0.129624 0.12818 0.126129 0.125144 0.12363 0.122081 0.120231 0.118327 0.116443 0.115346 0.114512 0.113029 0.112386 0.111879 0.111077 0.110131 0.108435 0.107854 0.107057 0.106172 0.105093 0.104389 0.103112 0.102088 0.101712 0.100469 0.0994505 0.0989811 0.0984102 0.0982125 0.0972503 0.096731 0.0956087 0.0950048 0.094337 0.0939208 0.0925872 0.0919503 0.0914311 0.0909645 0.0905782 0.0905216 0.0894758
morphological feature dimensionality: 90
total word feature dimensionality: 271
3. 150MB -> failed
number of raw ASCII files found: 3
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 3
Sample 50000000 random context vectors
Now do CCA (left size: 25828623, right size: 25828623).
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1a629238)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
4. 200MB -> failed
number of raw ASCII files found: 4
num words: 200000
saving word counts to top_word_counts.dat
number of raw ASCII files found: 4
Sample 50000000 random context vectors
Now do CCA (left size: 34482189, right size: 34482189).
Process 49420 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x6f1d74b3b0)
frame #0: 0x0000000192905900 libLAPACK.dylib`SLARFT + 400
libLAPACK.dylib`SLARFT:
-> 0x192905900 <+400>: ldr s0, [x27, x23, lsl #2]
0x192905904 <+404>: fcmp s0, #0.0
0x192905908 <+408>: b.ne 0x192905974 ; <+516>
0x19290590c <+412>: sub x23, x23, #0x1
Target 0: (wordrep) stopped.
Im not so sure what to do next, do you have any tips?
from mitie.
But I guess you are saying you think you are just running out of RAM? I would normally expect a bus error to be something else, but I'm not sure if OS X just reports the error differently.
RAM was my initial guess (because I see that it's capped to 40% and swap is drained), but I can be mistaken :c
from mitie.
@davisking do you have any ideas?
from mitie.
No idea. You will have to debug into it and see what the deal is.
from mitie.
Related Issues (20)
- issues in MITIEInterpreter HOT 1
- attribute error when running on python3.6 HOT 4
- What are the Wordrep Parameters to improve the vectors model HOT 3
- What text categorizer perform to handle unknown vocabulary in testing dataset HOT 2
- Current version 0.6 is release with setup.py version set to 0.5 HOT 2
- suitable tool to annotate large text file
- Mitie code integration with GPU HOT 1
- “std::bad_alloc”: am I using too much memory?
- extract_entities returns score of 0 HOT 1
- Can not install mitie on Centos7 HOT 1
- How does mitie deal with the segmentation of OOV HOT 1
- Interface HOT 4
- Exception: Invalid range given to ner_training_instance.overlaps_any_entity(). It overlaps an entity given to a previous call to add_entity(). HOT 1
- Project status? HOT 2
- Dlib as (forked) surepo, transplant your changes to your forkadd it as subrepo, HOT 4
- Bad offsets for tokenize_with_offsets with UTF-8
- Not classifying trained entities HOT 5
- PHP bindings HOT 2
- Is MITIE a proper choice for restoring punctuation HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from mitie.