Comments (5)
@YanLiang1102 maybe it contains the question length
from eeqa.
good thoughts!
from eeqa.
@YanLiang1102 maybe it contains the question length
@ll0iecas hey there, did you try to reproduce the result by using this codebase, I tried to train on multiple K80 (8) GPUs, it always failed silently on different epoch and different step (no exception throw) wonder what is the issue. with 1 K80 gpu and batch size even set to 2 , it also randomly exit with no exception. wonder if you have the same experience? or any thoughts on this? thank you in advance!
from eeqa.
I attached the train logs here: for example when I have the batchsize as 32:
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 72.33, r_c: 25.56, f1_c: 37.77, p_i: 89.94, r_i: 31.78, f1_i: 46.96
Epoch: 0, Step: 270 / 537, used_time = 799.39s, loss = 0.175028
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 68.51, r_c: 27.56, f1_c: 39.30, p_i: 89.50, r_i: 36.00, f1_i: 51.35
Epoch: 0, Step: 276 / 537, used_time = 817.74s, loss = 0.172604
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 66.96, r_c: 33.33, f1_c: 44.51, p_i: 82.14, r_i: 40.89, f1_i: 54.60
Epoch: 0, Step: 300 / 537, used_time = 888.44s, loss = 0.162986
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 76.12, r_c: 34.00, f1_c: 47.00, p_i: 86.07, r_i: 38.44, f1_i: 53.15
Epoch: 0, Step: 306 / 537, used_time = 906.22s, loss = 0.160711
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 76.33, r_c: 35.11, f1_c: 48.10, p_i: 85.51, r_i: 39.33, f1_i: 53.88
Epoch: 0, Step: 318 / 537, used_time = 942.80s, loss = 0.156545
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 60.77, r_c: 42.00, f1_c: 49.67, p_i: 73.31, r_i: 50.67, f1_i: 59.92
Epoch: 0, Step: 324 / 537, used_time = 961.29s, loss = 0.154742
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 75.11, r_c: 39.56, f1_c: 51.82, p_i: 83.12, r_i: 43.78, f1_i: 57.35
Epoch: 0, Step: 330 / 537, used_time = 979.67s, loss = 0.152926
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 64.39, r_c: 50.22, f1_c: 56.43, p_i: 75.21, r_i: 58.67, f1_i: 65.92
Epoch: 0, Step: 378 / 537, used_time = 1118.36s, loss = 0.139119
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 70.13, r_c: 48.00, f1_c: 56.99, p_i: 78.90, r_i: 54.00, f1_i: 64.12
Epoch: 0, Step: 384 / 537, used_time = 1136.84s, loss = 0.137941
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 67.32, r_c: 53.11, f1_c: 59.38, p_i: 76.62, r_i: 60.44, f1_i: 67.58
Epoch: 0, Step: 390 / 537, used_time = 1155.24s, loss = 0.136606
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 70.91, r_c: 52.00, f1_c: 60.00, p_i: 78.79, r_i: 57.78, f1_i: 66.67
Epoch: 0, Step: 408 / 537, used_time = 1207.63s, loss = 0.132690
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 69.28, r_c: 53.11, f1_c: 60.13, p_i: 77.10, r_i: 59.11, f1_i: 66.92
Epoch: 0, Step: 420 / 537, used_time = 1242.76s, loss = 0.130141
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 65.41, r_c: 58.00, f1_c: 61.48, p_i: 77.94, r_i: 69.11, f1_i: 73.26
Epoch: 0, Step: 426 / 537, used_time = 1260.81s, loss = 0.128809
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 73.91, r_c: 56.67, f1_c: 64.15, p_i: 80.58, r_i: 61.78, f1_i: 69.94
Epoch: 0, Step: 438 / 537, used_time = 1295.92s, loss = 0.126653
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 67.47, r_c: 62.22, f1_c: 64.74, p_i: 74.22, r_i: 68.44, f1_i: 71.21
Epoch: 0, Step: 474 / 537, used_time = 1400.03s, loss = 0.120188
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 67.21, r_c: 63.78, f1_c: 65.45, p_i: 74.24, r_i: 70.44, f1_i: 72.29
Epoch: 0, Step: 480 / 537, used_time = 1418.64s, loss = 0.119126
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 70.83, r_c: 64.22, f1_c: 67.37, p_i: 77.45, r_i: 70.22, f1_i: 73.66
Epoch: 0, Step: 498 / 537, used_time = 1470.99s, loss = 0.116169
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 72.03, r_c: 64.67, f1_c: 68.15, p_i: 77.23, r_i: 69.33, f1_i: 73.07
Epoch: 0, Step: 516 / 537, used_time = 1523.16s, loss = 0.113306
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 71.81, r_c: 66.22, f1_c: 68.90, p_i: 77.83, r_i: 71.78, f1_i: 74.68
Epoch: 0, Step: 522 / 537, used_time = 1541.61s, loss = 0.112382
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 72.73, r_c: 65.78, f1_c: 69.08, p_i: 78.62, r_i: 71.11, f1_i: 74.68
Epoch: 0, Step: 534 / 537, used_time = 1577.71s, loss = 0.110587
!!! Best dev f1_c (lr=4e-05, epoch=0): p_c: 69.33, r_c: 71.33, f1_c: 70.32, p_i: 74.73, r_i: 76.89, f1_i: 75.79
Start epoch #1 (lr = 4e-05)...
Epoch: 1, Step: 6 / 537, used_time = 1598.31s, loss = 0.108907
!!! Best dev f1_c (lr=4e-05, epoch=1): p_c: 68.89, r_c: 73.33, f1_c: 71.04, p_i: 73.90, r_i: 78.67, f1_i: 76.21
Epoch: 1, Step: 12 / 537, used_time = 1616.53s, loss = 0.107724
!!! Best dev f1_c (lr=4e-05, epoch=1): p_c: 69.33, r_c: 73.33, f1_c: 71.27, p_i: 74.16, r_i: 78.44, f1_i: 76.24
it just stopped at epoch 1 random step.
from eeqa.
solved.
from eeqa.
Related Issues (16)
- What does gold_file refer to HOT 2
- RuntimeError: cublas runtime error : the GPU program failed to execute at /pytorch/aten/src/THC/THCBlas.cu:331 HOT 2
- reproduce HOT 2
- 中文指标一直上不去 metric is allways bad in Chinese HOT 1
- The paras of BertForQuestionAnswering_withIfTriggerEmbedding used in train is right?
- convert_example.py HOT 5
- How to download SciERC and GENIA datasets ? HOT 1
- Do inference using trained trigger_qa model? HOT 1
- 数据集中触发词都只占一个位吗?
- why only one event considered from the pre-processinng step.
- Why can't I reproduce the result in the paper? HOT 2
- can't run pre-processinng code. HOT 7
- Multi-token event triggers in ACE HOT 1
- len(start_token) != 1 in parse_ace_event.py HOT 5
- 对于训练机有什么要求 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from eeqa.