GithubHelp home page GithubHelp logo

Comments (10)

shengyumao avatar shengyumao commented on July 30, 2024

您好,当出现RuntimeError: CUDA error: device-side assert triggered这个错误的时候,一般是指最后分类头预测的维度和测试数据集中标签的数量不一致。
请您检查一下predict.yaml中您设置的任务类型(i.e. task_name参数中是trigger还是role)与所训练的模型是否一致,因为predict.yaml中默认的任务类型为role(trigger的eval在训练的过程中已经同时执行了),而train.yaml中设定的默认任务类型为trigger。如果您需要对role进行evaluation的话,需要在train.yaml中修改task_name为role,再另外训练一个模型后再进行evaluation。
如果上述方案没能解决您的问题,请提供一下具体的运行参数(例如配置文件等)~

from deepke.

zxlzr avatar zxlzr commented on July 30, 2024

请问您还有其他问题吗?

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

感谢大侠热心的的答复,我按照建议把train.yaml的task_name由trigger改成role之后报另外一个错误了,这次不是cuda错误:

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning:
Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
warnings.warn(message=msg, category=UserWarning)
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main
args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None
File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types
raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

另外我想确定一下,我去huggingface下载google下的bert-base-chinese模型给ee例子用,这样做行不行的?

from deepke.

shengyumao avatar shengyumao commented on July 30, 2024

感谢大侠热心的的答复,我按照建议把train.yaml的task_name由trigger改成role之后报另外一个错误了,这次不是cuda错误:

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... /root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning: Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) Traceback (most recent call last): File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join genericpath._check_arg_types('join', a, *p) File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types raise TypeError(f'{funcname}() argument must be str, bytes, or ' TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

您好,从报错信息来看这里你修改参数后是直接运行的predict.py文件?您将train.yaml中的taskname改为role了之后需要再次运行python run.py,去训练一个事件元素抽取模型,我们在README中有提到对于事件抽取任务,需要训练两个阶段的模型。
另外从报错信息里头来看似乎是您把predict.yaml中的dev_trigger_pred_file参数删除了或设置为空,默认参数为./exp/DuEE/trigger/bert-base-chinese/eval_pred.json,这个是在运行trigger模型训练的过程中得到的触发词预测结果,用于后续继续进行pipeline的事件元素抽取。请您再次检查一下。

另外我想确定一下,我去huggingface下载google下的bert-base-chinese模型给ee例子用,这样做行不行的?

从huggingface上下载是没有问题的。

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

感谢大侠热心答复:—)

  1. 我是把train.yaml里的taskname改为role后再运行了run.py后再运行了predict.py得到的这个报错结果
  2. predict.yaml中的dev_trigger_pred_file参数我没有动,还是./exp/DuEE/trigger/bert-base-chinese/eval_pred.json,而且这个文件是存在的

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

附上两个配置文件哈


************** train.yaml ******************


data_name: DuEE # [ACE, DuEE]
model_name_or_path: bert-base-chinese # [bert-base-uncased, bert-base-chinese] english for ace, chinese for duee
#task_name: trigger # [trigger, role]
task_name: role
model_type: bertcrf
do_train: True
do_eval: True
do_predict: False # True for ACE, False for DuEE
labels: ""
config_name: ""
tokenizer_name: ""
cache_dir: ""
evaluate_during_training: True
do_lower_case: True
weight_decay: 0.0
learning_rate: 5e-5
adam_epsilon: 1e-8
per_gpu_train_batch_size: 16
per_gpu_eval_batch_size: 16
gradient_accumulation_steps: 1
max_seq_length: 256
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: 500
warmup_steps: 0
logging_steps: 500
save_steps: 500
eval_all_checkpoints: False
no_cuda: False
n_gpu: 0
overwrite_output_dir: True
overwrite_cache: True
seed: 42
fp16: False
fp16_opt_level: "01"
local_rank: -1
data_dir: "" # parsing in run.py
tag_path: "" # parsing in run.py
output_dir: "" # parsing in run.py
dev_trigger_pred_file: null
test_trigger_pred_file: null


*************** predict.yaml ***************


defaults:

  • train

data_name: DuEE # [ACE, DuEE]
model_name_or_path: ./exp/DuEE/role/bert-base-chinese
task_name: role # the trigger prediction is done during the training process.
do_train: False
do_eval: True
do_predict: False # True for ACE, False for DuEE

do_pipeline_predict: True
overwrite_cache: True

dev_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/eval_pred.json # change to your pred file of trigger classification
test_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/test_pred.json

from deepke.

shengyumao avatar shengyumao commented on July 30, 2024

您好,这里的报错为TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType',即args.cwd args.dev_trigger_pred_file中有对象为空,你可以尝试打个断点或者将相关变量打印出来看看,我在自己的环境下重新运行了一下,默认参数下predict.py L49并没有报错,可以看看在你的环境下变量错在哪儿。

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

感谢指导,不知道我的yaml文件什么毛病,现在搞定了dev_trigger_pred_file和test_trigger_pred_file参数为空的问题后报了下面的错误,方便的话可以加我微信一起看一下吗?非常感谢

[2024-06-24 17:34:19,321][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - LOOKING AT /root/DeepKE/example/ee/standard/./data/DuEE/role/dev_with_pred_trigger.tsv train
[2024-06-24 17:34:19,345][run][INFO] - Creating features from dataset file at /root/DeepKE/example/ee/standard/./data/DuEE/role
###############
[2024-06-24 17:34:19,345][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - Writing example 0 of 2015
###############
[2024-06-24 17:34:23,558][run][INFO] - Saving features into cached file /root/DeepKE/example/ee/standard/./data/DuEE/role/cached_dev_bert-base-chinese_256
[2024-06-24 17:34:24,415][run][INFO] - ***** Running evaluation *****
[2024-06-24 17:34:24,416][run][INFO] - Num examples = 2015
[2024-06-24 17:34:24,416][run][INFO] - Batch size = 16
[2024-06-24 17:34:24,416][run][INFO] - Mode = dev
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 124, in main
result, eval_pred_list = evaluate(args, model, eval_dataset, tokenizer, labels, pad_token_label_id, mode="dev", device=device)
File "/root/DeepKE/example/ee/standard/run.py", line 219, in evaluate
outputs = model(pad_token_label_id=pad_token_label_id, **inputs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/bert_crf.py", line 89, in forward
loss = self.crf.neg_log_likelihood(crf_logits, crf_mask, crf_labels)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 273, in neg_log_likelihood
gold_score = self._score_sentence(scores, mask, tags)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 258, in _score_sentence
tg_energy = tg_energy.masked_select(mask.transpose(1, 0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

from deepke.

zhuweigang avatar zhuweigang commented on July 30, 2024

ee的问题解决了,是因为hydra-core版本不是1.3.1的缘故,感谢大神

from deepke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.