monologg / distilkobert Goto Github PK

View Code? Open in Web Editor NEW

181.0 6.0 25.0 512 KB

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)

License: Apache License 2.0

Python 95.91% Shell 4.09%

distillation kobert bert pytorch transformers lightweight korean-nlp

distilkobert's Introduction

🚀 Things I do

NLP Engineer, contributing on Korean NLP with Open Source!

📬 Find me at

distilkobert's People

Contributors

Stargazers

Watchers

distilkobert's Issues

Naver NER (F1) 코드 질문

안녕하세요.

작성해주신 DistilKoBERT을 통해서 네이버 감성분석 코드를 구현울 도전하다. 아래와 같은 오류에 직면했습니다. 코드는 사실상 KoBERT에 있는 코드를 사용했습니다.

RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_index_select

for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

혹시 Naver NER (F1) 에 사용한 코드를 공개하실 생각은 없으신가요?

감사합니다.

토크나이저 에러

안녕하세요

좋은 프로젝트 감사합니다

다음과 같은 상황에서 에러가 발생합니다

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') 처럼 토크나이저를 가져와서

"한국어로는 안돼??ㅋ' 의 문장을 tokenizer.tokenize( ) 하게 되면 다음과 같은 아웃풋이 나옵니다

하지만 마지막 값은 한글 자음 'ㅋ'과 다른 값입니다. (작은 'ㅋ')

실제로 tokenizer.convert_tokens_to_ids('ㅋ') 하게 되면 unknown 값인 0이 리턴되고, 작은 'ㅋ' 값을 넣으면 정상적으로 나옵니다.

이 때문에 다음과 같은 상황에서 에러가 발생합니다

데이터 다운이 안되요..

샘플 돌려보고 데이터에 맞춰서 제 데이터에 맞춰서 돌려보려고하는데,
download.py
binarize.sh
train_single_gpu_3_layer.sh
순서대로 해야될 것 같더라구요.

그런데, download.py에 있는 https://drive.google.com/uc?id=1-jXAnrHcKzzFiFhri37YOXH2OjP7DwMg로 접근이 안됩니다.
데이터가 없어서인지 모르겠지만 binarize.sh를 실행시켜도 Can't load tokenizer for 'kobert'라고 에러가 나구요..

실험결과에 대해 여쭤볼 것이 있습니다.

Layer=6의 경우, Distillation 과정에서 GPU 모델은 어떤 것을 8개 쓰셔서, 어느정도의 시간이 소요되었나요?? 배치 사이즈는 1000이라고 보면 맞을까요?
감사합니다.

Question about training dataset

안녕하세요.

먼저 모델 공유해주셔서 너무 감사드립니다. 정말 유용하게 활용하고 있습니다.

training dataset에 대한 질문이 있어서 이렇게 질문을 남기게 되었습니다.

DistilkoBERT는 koBERT와 같은 training dataset을 사용하셔서 트레이닝하신 건가요?

*NMT에 PrLM의 output을 사용하는 프로젝트를 진행 중인데, NLU tasks에서 koBERT와 DistilkoBERT의 성능 차이가 NMT에서도 비슷한 양상으로 나타나는지 확인하려합니다. training dataset이 같아야 좀 더 공정한 비교를 할 수 있을 것 같아 이렇게 질문 드립니다.

모델 변경 또는 전환

구글 mutilingual BERT 로 만들어진 프로그램을
koBert 로 전환 가능할까요?

colab에서 AttributeError

안녕하세요! 파이썬을 공부중인 학생입니다!
KoBert 모델을 사용하려던 도중 토크나이저를 불러오는데 에러가 발생하여 질문을 남깁니다!

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

코드 실행 시

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'KoBertTokenizer'.

AttributeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

5 frames
/content/drive/MyDrive/tokenization_kobert.py in get_vocab(self)
123
124 def get_vocab(self):
--> 125 return dict(self.token2idx, **self.added_tokens_encoder)
126
127 def getstate(self):

AttributeError: 'KoBertTokenizer' object has no attribute 'token2idx'

라는 에러가 발생하는데 혹시 해결방법이 있나요?!

Parameters of `BertOnlyMLMHead` missing in `monologg/kobert`

Thanks for releasing the KoBERT model! However, I found that the parameters of BertOnlyMLMHead layer might be missing in the monologg/kobert model, which I think is a common issue that I also found in released BERT models for others languages, like Greek and Russian.
To reproduce this issue:

from transformers import *
m1 = AutoModelWithLMHead.from_pretrained('monologg/kobert')
print(m1.cls.predictions.transform.dense.weight)
m2 = AutoModelWithLMHead.from_pretrained('monologg/kobert')
print(m2.cls.predictions.transform.dense.weight)  # different from m1

Is it possible to upload the pretrained model with the missing parameters (either in huggingface's transformers or providing a link to the original tf checkpoint)?

monologg / distilkobert Goto Github PK

distilkobert's Introduction

🚀 Things I do

📬 Find me at

distilkobert's People

Contributors

Stargazers

Watchers

Forkers

distilkobert's Issues

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. The tokenizer class you load from this checkpoint is 'BertTokenizer'. The class this function is called from is 'KoBertTokenizer'.

Recommend Projects

Recommend Topics

Recommend Org

Jobs

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'KoBertTokenizer'.