GithubHelp home page GithubHelp logo

naver / claf Goto Github PK

View Code? Open in Web Editor NEW
215.0 15.0 37.0 8.81 MB

CLaF: Open-Source Clova Language Framework

Home Page: https://naver.github.io/claf/

License: MIT License

Python 99.84% HTML 0.01% Dockerfile 0.05% Shell 0.10%
nlp pytorch framework language clova natural-language-processing

claf's Introduction

Clova Language Framework

Documentation Status Code style: black


CLaF: Clova Language Framework

Task Language Dataset Model
Multi-Task Learning English GLUE Benchmark, SQuAD v1.1 MT-DNN (BERT)
Natural Language Understanding English GLUE Benchmark BERT, RoBERTa
Named Entity Recognition English CoNLL 2003 BERT
Question Answering Korean KorQuAD v1.0 BiDAF, DocQA, BERT
Question Answering Engilsh SQuAD v1.1 and v2.0 - v1.1: BiDAF, DrQA, DocQA, DocQA+ELMo, QANet, BERT, RoBERTa
- v2.0: BiDAF + No Answer, DocQA + No Answer
Semantic Parsing English WikiSQL SQLNet
Name Language Pipeline Note
KoWiki Korean Wiki Dumps -> Document Retrieval -> Reading Comprehension -
NLU All Query -> Intent Classification & Token Classification (Slot) -> Template Matching -

Table of Contents


Overview

CLaF is a Language Framework built on PyTorch that provides following two high-level features:

  • Experiment enables the control of training flow in general NLP by offering various TokenMaker methods.
    • CLaF is inspired by the design principle of AllenNLP such as the higher level concepts and reusable code, but mostly based on PyTorch’s common module, so that user can easily modify the code on their demands.
  • Machine helps to combine various modules to build a NLP Machine in one place.
    • There are knowledge-based, components and trained experiments which infer 1-example in modules.

Features

  • Multilingual modeling support (currently, English and Korean are supported).
  • Light weighted Systemization and Modularization.
  • Easy extension and implementation of models.
  • A wide variation of Experiments with reproducible and comprehensive logging
  • The metrics for services such as "1-example inference latency" are provided.
  • Easy to build of a NLP Machine by combining modules.

Installation

Requirements

  • Python 3.6
  • PyTorch >= 1.3.1
  • MeCab for Korean Tokenizer
    • sh script/install_mecab.sh

It is recommended to use the virtual environment.
Conda is the easiest way to set up a virtual environment.

conda create -n claf python=3.6
conda activate claf

(claf) ✗ pip install -r requirements.txt

Install via pip

Commands to install via pip

pip install claf

Experiment

  • Training Flow

images

Usage

Training

images

  1. only Arguments

    python train.py --train_file_path {file_path} --valid_file_path {file_path} --model_name {name} ...
    
  2. only BaseConfig (skip /base_config path)

    python train.py --base_config {base_config}
    
  3. BaseConfig + Arguments

    python train.py --base_config {base_config} --learning_rate 0.002
    
    • Load BaseConfig then overwrite learning_rate to 0.002

BaseConfig

Declarative experiment config (.json, .ymal)

  • Simply matching with object's parameters
  • Exists samples in /base_config directory
Defined BaseConfig
Base Config:
  --base_config BASE_CONFIG
    Use pre-defined base_config:
    []


    * CoNLL 2003:
    ['conll2003/bert_large_cased']

    * GLUE:
    ['glue/qqp_roberta_base', 'glue/qnli_bert_base', 'glue/rte_bert_base', 'glue/wnli_roberta_base', 'glue/mnlim_roberta_base', 'glue/wnli_bert_base', 'glue/mnlimm_roberta_base', 'glue/cola_bert_base', 'glue/mrpc_bert_base', 'glue/mnlimm_bert_base', 'glue/stsb_bert_base', 'glue/mnlim_bert_base', 'glue/qqp_bert_base', 'glue/rte_roberta_base', 'glue/qnli_roberta_base', 'glue/sst_bert_base', 'glue/mrpc_roberta_base', 'glue/cola_roberta_base', 'glue/stsb_roberta_base', 'glue/sst_roberta_base']

    * KorQuAD:
    ['korquad/bert_base_multilingual_cased', 'korquad/bidaf', 'korquad/bert_base_multilingual_uncased', 'korquad/docqa']

    * SQuAD:
    ['squad/bert_large_uncased', 'squad/bidaf', 'squad/drqa_paper', 'squad/drqa', 'squad/bert_base_uncased', 'squad/qanet', 'squad/docqa+elmo', 'squad/bidaf_no_answer', 'squad/docqa_no_answer', 'squad/qanet_paper', 'squad/bidaf+elmo', 'squad/docqa']

    * WikiSQL:
    ['wikisql/sqlnet']

Evaluate

python eval.py <data_path> <model_checkpoint_path>
  • Example
✗ python eval.py data/squad/dev-v1.1.json logs/squad/bidaf/checkpoint/model_19.pkl
...
[INFO] - {
    "valid/loss": 2.59111491665019,
    "valid/epoch_time": 60.7434446811676,
    "valid/start_acc": 63.17880794701987,
    "valid/end_acc": 67.19016083254493,
    "valid/span_acc": 54.45600756859035,
    "valid/em": 68.10785241248817,
    "valid/f1": 77.77963381714842
}
# write predictions files (<log_dir>/predictions/predictions-valid-19.json)
  • 1-example Inference Latency (Summary)
✗ python eval.py data/squad/dev-v1.1.json logs/squad/bidaf/checkpoint/model_19.pkl
...
# Evaluate Inference Latency Mode.
...
[INFO] - saved inference_latency results. bidaf-cpu.json  # file_format: {model_name}-{env}.json

Predict

python predict.py <model_checkpoint_path> --<arguments>
  • Example
✗ python predict.py logs/squad/bidaf/checkpoint/model_19.pkl \
    --question "When was the last Super Bowl in California?" \
    --context "On May 21, 2013, NFL owners at their spring meetings in Boston voted and awarded the game to Levi's Stadium. The $1.2 billion stadium opened in 2014. It is the first Super Bowl held in the San Francisco Bay Area since Super Bowl XIX in 1985, and the first in California since Super Bowl XXXVII took place in San Diego in 2003."

>>> Predict: {'text': '2003', 'score': 4.1640071868896484}

Docker Images

  • Docker Hub
  • Run with Docker Image
    • Pull docker image ✗ docker pull claf/claf:latest
    • Run docker run --rm -i -t claf/claf:latest /bin/bash

Machine

  • Machine Architecture

images

Usage

  • Define the config file (.json) like BaseConfig in machine_config/ directory
  • Run CLaF Machine (skip /machine_config path)
✗ python machine.py --machine_config {machine_config}
  • The list of pre-defined Machine:
Machine Config:
  --machine_config MACHINE_CONFIG
    Use pre-defined machine_config (.json (.json))

    ['ko_wiki', 'nlu']

Open QA (DrQA Style)

DrQA is a system for reading comprehension applied to open-domain question answering. The system has to combine the challenges of document retrieval (finding the relevant documents) with that of machine comprehension of text (identifying the answers from those documents).

  • ko_wiki: Korean Wiki Version
✗ python machine.py --machine_config ko_wiki
...
Completed!
Question > 동학의 2대 교주 이름은?
--------------------------------------------------
Doc Scores:
 - 교주 : 0.5347289443016052
 - 이교주 : 0.4967213571071625
 - 교주도 : 0.49036136269569397
 - 동학 : 0.4800325632095337
 - 동학중학교 : 0.4352934956550598
--------------------------------------------------
Answer: [
    {
        "text": "최시형",
        "score": 11.073444366455078
    },
    {
        "text": "충주목",
        "score": 9.443866729736328
    },
    {
        "text": "반월동",
        "score": 9.37778091430664
    },
    {
        "text": "환조 이자춘",
        "score": 4.64817476272583
    },
    {
        "text": "합포군",
        "score": 3.3186707496643066
    }
]

NLU (Dialog)

The reason why NLU machine does not return the full response is that response generation may require various task-specific post-processing techniques or additional logic(e.g. API calls, template-decision rules, template filling rules, nn-based response generation model) Therefore, for flexible usage, NLU machine returns only the NLU result.

✗ python machine.py --machine_config nlu
...
Utterance > "looking for a flight from Boston to Seoul or Incheon"

NLU Result: {
    "intent": "flight",
    "slots": {
        "city.depart": ["Boston"],
        "city.dest": ["Seoul", "Incheon"]
    }
}

Contributing

Thanks for your interest in contributing! There are many ways to contribute to this project.
Get started here.

Maintainers

CLaF is currently maintained by

Citing

If you use CLaF for your work, please cite:

@misc{claf,
  author = {Lee, Dongjun and Yang, Sohee and Kim, Minjeong},
  title = {CLaF: Open-Source Clova Language Framework},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/naver/claf}}
}

We will update this bibtex with our paper.

Acknowledgements

docs/ directory which includes documentation created by Sphinx.

License

MIT license

Copyright (c) 2019-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy 
of this software and associated documentation files (the "Software"), to deal 
in the Software without restriction, including without limitation the rights 
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell 
copies of the Software, and to permit persons to whom the Software is 
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all 
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
SOFTWARE.

claf's People

Contributors

dongjunlee avatar marload avatar onting avatar sejik avatar tdh8316 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

claf's Issues

The segmentation fault error occurred in prediction mode and Korean question & context.

Hi,

Now I'm testing the Claf BERT with KorQuad data set.
After trainning, I had got the checkpoint file(model_22000.pkl)

And then I had to tried predict using predict.py, but Segmentation Fault error occured like below log.
(However, in English case, it's normally executed)

So could you advise how to fix this error ?

# English case (Okay)
/home/pycharm/.conda/envs/claf/bin/python -u /home/pycharm/workspace/claf_test/predict.py logs/squad_bert/checkpoint/model_22000.pkl --cuda_devices 0 --question test --context test
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Predict: {'text': 'test', 'score': -0.8118173778057098}

Process finished with exit code 0

# Korean case (Segmentation Fault)
/home/pycharm/.conda/envs/claf/bin/python -u /home/pycharm/workspace/claf_test/predict.py logs/squad_bert/checkpoint/model_22000.pkl --cuda_devices 0 --question 한글 --context 한글
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
bash: line 1: 174065 Segmentation fault (core dumped) env "PYTHONUNBUFFERED"="1" "PYTHONPATH"="/home/pycharm/workspace/2944/claf_test:/home/pycharm/.pycharm_helpers/pycharm_matplotlib_backend" "PYCHARM_HOSTED"="1" "JETBRAINS_REMOTE_RUN"="1" "PYCHARM_MATPLOTLIB_PORT"="64594" "PYTHONIOENCODING"="UTF-8" '/home/pycharm/.conda/envs/claf/bin/python' '-u' '/home/pycharm/workspace/2944/claf_test/predict.py' 'logs/squad_bert/checkpoint/model_22000.pkl' '--cuda_devices' '0' '--question' '한글' '--context' '한글'

Error occurred when I ran docqa

Hi, all,
After download docker image and get ready, I found myself got an error when I ran docqa.
See below.
...
Completed!
Ready ..!

Question > 동학의 2대 교주 이름은?
Traceback (most recent call last):
File "machine.py", line 22, in
answer = claf_machine.get_answer(question)
AttributeError: 'OpenQA' object has no attribute 'get_answer'

Could you give me a clue to get through?

'trainer.py' update 건의

learn/experiment.py의 _load_exist_checkpoints 함수를 통해 checkpoint를 load할 시에 learn/trainer.py 에서 이를 활용하는 것에 대해 건의할 점이 있습니다.
왜냐하면, trainer.py 에서 학습 시에 load를 한 내용에 대한 부분을 활용해야
기존 진행된 학습을 중복하여 진행하지 않을 수 있을 것 같기 때문입니다.

구체적으로는
trainer.py 내용에서

self.metric_logs = {"best_epoch": 0, "best_global_step": 0, "best": None, "best_score": 0} 로 되어 있는 초기화를
load된 내용을 기반으로 넣어주어야 하며,

train 함수처럼 학습을 진행 시키는 함수에서
'for epoch in range(1, self.num_epochs+1)' 를
'for epoch in range(self.train_counter.epoch+1, self.epochs +1)'와
비슷하게 수정해야 한다고 생각합니다.

부정확할 수 있지만 도움이 되면 좋겠습니다.

그리고 typo로
Trainer Class 의 주석에 maximun 에서 maximum입니다.

predict 이슈

roberta large 모델을 사용해서 train후
predict.py 를 사용하는데 에러가 나네요.
ValueError: '[SEP]' is not in list

flatten_parameters()

korquad bidaf 모델을 돌리니까 /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:1236: UserWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().

이런 warning 이 뜨는데 해결방법이 있을까요??

train.py가 gpu모드를 쓸 수 없습니다

image
gpu모드에서 동작하지 않는 문제가 있습니다. self.config에 cuda_devices가 존제하지 않는 문제인데 이게 nsml에서 돌리면 동작하지만 로컬에서 돌리는 코드에서 실수가 있는것 같습니다. 저도 해결 방법을 찾아보고 있겠습니다.

How to get the many answers in predict mode?

Hi,

As I know, in predict mode, there is only one best answer as a result. (bert for qa model)
But If i want to get the 10 best answers(order by score) in predict mode, how to get them?

Is there any parameter like "n_best_size" (exist in BERT google version) ?
It will be appreciated, if you give advice for modify the source code.

Thanks.

하이퍼파라미터에 대한 질문입니다

def _make_features_and_labels(
self, context_sub_tokens, question_sub_tokens, answer_char_start, answer_char_end
):
# sub_token, context_stride logic with context_max_length
context_max_length = (
self.max_seq_length - len(question_sub_tokens) - 3
) # [CLS], [SEP], [SEP]
start_offset = 0
context_stride_spans = []
while start_offset < len(context_sub_tokens):
strided_context_length = len(context_sub_tokens) - start_offset
if strided_context_length > context_max_length:
strided_context_length = context_max_length
context_stride_spans.append((start_offset, strided_context_length))
if start_offset + strided_context_length == len(context_sub_tokens):
break
start_offset += min(strided_context_length, self.context_stride)
features, labels = [], []
for (start_offset, length) in context_stride_spans:
bert_tokens = [Token(self.cls_token)]
bert_tokens += question_sub_tokens[: self.max_question_length]
bert_tokens += [Token(self.sep_token)]
bert_tokens += context_sub_tokens[start_offset : start_offset + length]
bert_tokens += [Token(self.sep_token)]
features.append(bert_tokens)
if answer_char_start == -1 and answer_char_end == -1:
answer_start, answer_end = 0, 0
else:
answer_start, answer_end = self._get_closest_answer_spans(
bert_tokens, answer_char_start, answer_char_end
)
labels.append((answer_start, answer_end))
return features, labels

max_seq_length

context_max_length는 아래의 소스와 같이 strided_context_length를 설정하는데 영향을 주어서
context를 자르는 범위를 정하는 것으로 확인했습니다.

        context_stride_spans = []
        while start_offset < len(context_sub_tokens):
            strided_context_length = len(context_sub_tokens) - start_offset
            if strided_context_length > context_max_length:
                strided_context_length = context_max_length

            context_stride_spans.append((start_offset, strided_context_length))
            if start_offset + strided_context_length == len(context_sub_tokens):
                break
            start_offset += min(strided_context_length, self.context_stride)

그런데 context_max_length를 생성하는데 max_seq_length에 question_sub_tokens과 3을 빼고
생성되는데 만약 question_sub_tokens의 길이가 max_seq_length를 넘어가는 경우 무한루프에 빠지는 것으로 확인되었습니다. 어떻게 조치하면 좋을지 알려주시면 감사하겠습니다.

그리고 question_sub_tokens의 길이에 따라서 context_max_length가 동적으로 변하는데 만약 질문의 길이가
긴 경우 좀더 잘게 context를 자르고 질문의 길이가 짧은 경우 context를 크게 자르는 이유를 알고 싶습니다.

max_question_length

bert_tokens 생성시 질문이 max_question_length의 범위를 넘어가는 경우 잘려서 들어갑니다.
이부분은 전처리시 질문의 길이를 맞추든지 아니면 max_question_length의 길이를 크게하면 되는지
알고 싶습니다

        features, labels = [], []
        for (start_offset, length) in context_stride_spans:
            bert_tokens = [Token(self.CLS_TOKEN)]
            bert_tokens += question_sub_tokens[: self.max_question_length]
            bert_tokens += [Token(self.SEP_TOKEN)]
            bert_tokens += context_sub_tokens[start_offset : start_offset + length]
            bert_tokens += [Token(self.SEP_TOKEN)]
            features.append(bert_tokens)

training이 오류가 납니다

python train.py --basic_config korquad/drqa
python train.py --basic_config korquad/bidaf
위의 명령어로 학습시키려고 하는데 tokenizer error가 나면서 다음 오류를 내면서 꺼집니다.
image
mecab과 requirement는 설치했습니다. 도와주세요

(JSON) base_config 설정이 안됩니다.

안녕하세요?

귀중한 코드를 공개해 주셔서 감사합니다.
CLaF를 활용해서 KorQuAD 1.0 부터 결과를 재현해 보려고 차근차근 공부하고 있습니다.

다음 처럼 실행하면,

python train.py --base_config korquad/docqa

다음 부분에서 오류가 납니다.
https://github.com/naver/claf/blob/master/claf/config/args.py#L67

    if use_base_config:
        base_config_path = os.path.join("base_config", config.base_config)
        base_config_path = utils.add_config_extension(base_config_path)
        defined_config = utils.read_config()
        # config.overwrite(defined_config)

utils.read_config() 함수의 정의를 보니 입력 인수가 있어야 하는데, 위의 코드에서 입력 값이 없습니다.

def read_config(file_path):

args.py의 history를 확인해 보니 config를 yaml도 사용하도록 일반화 하는 과정에서 이 부분의 코드가 바뀐 것을 알게 되었습니다.
79ebc8e#diff-c801a539cf2c3ccab225f69b763cfce7573c3b90cb4349f4d0e6b1b98ad436fe

바로 윗줄까지 준비한 base_config_path 값을 사용하면 될까요?
Comment out 된 config.overwrite(defined_config) 후에
그 다음에 NestedNamespace를 다시 읽어오는 것 처럼 보이는데요.
원래 의도를 파악하기가 좀 어렵습니다.

바쁘시겠지만, 확인해 주시면 더 많은 사람들이 CLaF를 활용하는데 큰 도움이 될 것입니다.

감사합니다.

TypeError: FeatLabelPadCollator.collate: `pad_value` is not present.

Hi,

I have a question. When I tried to reproduce the results of claf, I got the error. Here is the script:

python train.py --base_config korquad/docqa
Traceback (most recent call last):
  File "train.py", line 3, in <module>
    from claf.config import args
  File "/workspace/sean/trial/claf/claf/__init__.py", line 4, in <module>
    from claf.data.reader import *
  File "/workspace/sean/trial/claf/claf/data/reader/__init__.py", line 2, in <module>
    from claf.data.reader.seq_cls import SeqClsReader
  File "/workspace/sean/trial/claf/claf/data/reader/seq_cls.py", line 9, in <module>
    from claf.data.dataset.seq_cls import SeqClsDataset
  File "/workspace/sean/trial/claf/claf/data/dataset/__init__.py", line 2, in <module>
    from claf.data.dataset.squad import SQuADDataset
  File "/workspace/sean/trial/claf/claf/data/dataset/squad.py", line 6, in <module>
    from claf.data.collate import PadCollator
  File "/workspace/sean/trial/claf/claf/data/collate.py", line 63, in <module>
    class FeatLabelPadCollator(PadCollator):
  File "/workspace/sean/trial/claf/claf/data/collate.py", line 85, in FeatLabelPadCollator
    def collate(self, datas, apply_pad=True, apply_pad_labels=(), apply_pad_values=()):
  File "/opt/conda/envs/claf/lib/python3.6/site-packages/overrides/overrides.py", line 88, in overrides
    return _overrides(method, check_signature, check_at_runtime)
  File "/opt/conda/envs/claf/lib/python3.6/site-packages/overrides/overrides.py", line 114, in _overrides
    _validate_method(method, super_class, check_signature)
  File "/opt/conda/envs/claf/lib/python3.6/site-packages/overrides/overrides.py", line 135, in _validate_method
    ensure_signature_is_compatible(super_method, method, is_static)
  File "/opt/conda/envs/claf/lib/python3.6/site-packages/overrides/signature.py", line 95, in ensure_signature_is_compatible
    super_sig, sub_sig, super_type_hints, sub_type_hints, is_static, method_name
  File "/opt/conda/envs/claf/lib/python3.6/site-packages/overrides/signature.py", line 136, in ensure_all_kwargs_defined_in_sub
    raise TypeError(f"{method_name}: `{name}` is not present.")
TypeError: FeatLabelPadCollator.collate: `pad_value` is not present.

It seems to me that the function apply_pad_values at line:
pad_value = apply_pad_values[apply_pad_labels.index(data_name)] did not work.

Maybe I'm wrong here. Could you give me some suggestions to fix it?

Thanks,

error!

I download docker image, trained and ran it.
Only got an error!.
Could you figure out what went wrong?

Read Wiki Articles: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 21427/21427 [02:16<00:00, 157.48it/s]

Completed!

Traceback (most recent call last):

File "machine.py", line 18, in

claf_machine = registry.get(f"machine:{machine_name}")(config)

File "/workspace/claf/claf/machine/open_qa.py", line 34, in init

self.load()

File "/workspace/claf/claf/machine/open_qa.py", line 55, in load

self.rc_experiment = self.make_module(reasoning_config.reading_comprehension)

File "/workspace/claf/claf/machine/base.py", line 64, in make_module

experiment = Experiment(Mode.PREDICT, experiment_config)

File "/workspace/claf/claf/learn/experiment.py", line 48, in init

self.load_setting()

File "/workspace/claf/claf/learn/experiment.py", line 89, in load_setting

cuda_devices, checkpoint_path, prev_cuda_device_id=prev_cuda_device_id

File "/workspace/claf/claf/learn/experiment.py", line 101, in _read_checkpoint

f"cuda:{prev_cuda_device_id}": f"cuda:{cuda_devices[0]}"

TypeError: 'NoneType' object is not subscriptable

Error occured in data reading by Bert_basic tokenizer

Hi,

As I konw, in case of lang code == "ko", tokenizer is hardcoded as a "mecab_ko" in squad_bert reader.
so I had changed the tokenizer to "bert_basic", and then tried to train again for "KorQuAD_v1.0_train.json" data. But, below error was occured.

I found the fact that the unicode “\u200e”, “\u00” in korQuad train dataset cause the error in data reading process when use "bert_basic" tokenizer. So I removed those unicodes in dataset and then tried again. As a result, it worked fine.

Could you let me know why this error occued?

[Error result]
train: 22%|███████▎ | 312/1420 [00:34<02:35, 7.14it/s]Traceback (most recent call last):
File "/home/pycharm/workspace/claf_test/train.py", line 15, in
experiment()
File "/home02/pycharm_work/claf_test/claf/learn/experiment.py", line 122, in call
train_loader, valid_loader, optimizer = self.set_train_mode()
File "/home02/pycharm_work/claf_test/claf/learn/experiment.py", line 176, in set_train_mode
datas, helpers = data_reader.read()
File "/home02/pycharm_work/claf_test/claf/data/reader/base.py", line 55, in read
batch, helper = self._read(file_path, data_type=data_type)
File "/home02/pycharm_work/claf_test/claf/data/reader/bert/squad.py", line 97, in _read
context_text, context_tokens
File "/home02/pycharm_work/claf_test/claf/data/reader/bert/squad.py", line 297, in _convert_to_spans
f"\n{raw_text} \n\n{joined_tokenized_text} \nToken: {token}, Index: {temp}, Current Index: {curr_idx}"
ValueError:

이 영화 제작과는 별개로, 1990년대 중반에, 월트 디즈니 컴퍼니가 야마토의 영어 리메이크 버전인 스타 블레이저의 저작권을 사들였었는데 영화로 제작하지는 못했었다. TBS의 영화 제작을 담당하는 TBS 필름스는 2005년경부터 실사판을 계획하고 있었다. 2009년 7월 17일, 원래 '우주전함 야마토' 텔레비전 시리즈의 제작자이자 스태프인 이시구로 노보루‎는 그의 오타콘 패널에서 우주전함 야마토의 실사영화 버전을 개발 중임을 밝혔다. "꿈의 프로젝트"라 불리며 2010년 12월에 개봉 계획이고 22억 엔의 제작비인 이 프로젝트는, 《리터너》와 《올웨이즈 3번가의 석양》 시리즈로 잘 알려진 야마자키 다카시 감독이 맡기로 했다. 야마토는 개봉 전부터 어마어마한 제작비로 일본 영화계에 화두가 되었다.

이 영화 제작과는 별개로 , 1990년대 중반에 , 월트 디즈니 컴퍼니가 야마토의 영어 리메이크 버전인 스타 블레이저의 저작권을 사들였었는데 영화로 제작하지는 못했었다 . TBS의 영화 제작을 담당하는 TBS 필름스는 2005년경부터 실사판을 계획하고 있었다 . 2009년 7월 17일 , 원래 ' 우주전함 야마토 ' 텔레비전 시리즈의 제작자이자 스태프인 이시구로 노보루는 그의 오타콘 패널에서 우주전함 야마토의 실사영화 버전을 개발 중임을 밝혔다 . " 꿈의 프로젝트 " 라 불리며 2010년 12월에 개봉 계획이고 22억 엔의 제작비인 이 프로젝트는 , 《 리터너 》 와 《 올웨이즈 3번가의 석양 》 시리즈로 잘 알려진 야마자키 다카시 감독이 맡기로 했다 . 야마토는 개봉 전부터 어마어마한 제작비로 일본 영화계에 화두가 되었다 .
Token: 노보루는, Index: -1, Current Index: 196

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.