GithubHelp home page GithubHelp logo

monologg / distilkobert Goto Github PK

View Code? Open in Web Editor NEW
181.0 6.0 25.0 512 KB

Distillation of KoBERT from SKTBrain (Lightweight KoBERT)

License: Apache License 2.0

Python 95.91% Shell 4.09%
distillation kobert bert pytorch transformers lightweight korean-nlp

distilkobert's Introduction

πŸš€ Things I do

  • NLP Engineer, contributing on Korean NLP with Open Source!

πŸ“¬ Find me at

Linkedin Badge Gmail Badge Tech Blog Badge

distilkobert's People

Contributors

monologg avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

distilkobert's Issues

Naver NER (F1) μ½”λ“œ 질문

μ•ˆλ…•ν•˜μ„Έμš”.

μž‘μ„±ν•΄μ£Όμ‹  DistilKoBERT을 ν†΅ν•΄μ„œ 넀이버 감성뢄석 μ½”λ“œλ₯Ό κ΅¬ν˜„μšΈ λ„μ „ν•˜λ‹€. μ•„λž˜μ™€ 같은 였λ₯˜μ— μ§λ©΄ν–ˆμŠ΅λ‹ˆλ‹€. μ½”λ“œλŠ” 사싀상 KoBERT에 μžˆλŠ” μ½”λ“œλ₯Ό μ‚¬μš©ν–ˆμŠ΅λ‹ˆλ‹€.

RuntimeError: Expected object of device type cuda but got device type cpu for argument #1 'self' in call to _th_index_select
for e in range(num_epochs):
    train_acc = 0.0
    test_acc = 0.0
    model.train()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(train_dataloader)):
        optimizer.zero_grad()
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        loss = loss_fn(out, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        optimizer.step()
        scheduler.step()  # Update learning rate schedule
        train_acc += calc_accuracy(out, label)
        if batch_id % log_interval == 0:
            print("epoch {} batch id {} loss {} train acc {}".format(e+1, batch_id+1, loss.data.cpu().numpy(), train_acc / (batch_id+1)))
    print("epoch {} train acc {}".format(e+1, train_acc / (batch_id+1)))
    model.eval()
    for batch_id, (token_ids, valid_length, segment_ids, label) in enumerate(tqdm_notebook(test_dataloader)):
        token_ids = token_ids.long().to(device)
        segment_ids = segment_ids.long().to(device)
        valid_length= valid_length
        label = label.long().to(device)
        out = model(token_ids, valid_length, segment_ids)
        test_acc += calc_accuracy(out, label)
    print("epoch {} test acc {}".format(e+1, test_acc / (batch_id+1)))

ν˜Ήμ‹œ Naver NER (F1) 에 μ‚¬μš©ν•œ μ½”λ“œλ₯Ό κ³΅κ°œν•˜μ‹€ 생각은 μ—†μœΌμ‹ κ°€μš”?

κ°μ‚¬ν•©λ‹ˆλ‹€.

ν† ν¬λ‚˜μ΄μ € μ—λŸ¬

μ•ˆλ…•ν•˜μ„Έμš”

쒋은 ν”„λ‘œμ νŠΈ κ°μ‚¬ν•©λ‹ˆλ‹€

λ‹€μŒκ³Ό 같은 μƒν™©μ—μ„œ μ—λŸ¬κ°€ λ°œμƒν•©λ‹ˆλ‹€

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert') 처럼 ν† ν¬λ‚˜μ΄μ €λ₯Ό κ°€μ Έμ™€μ„œ

"ν•œκ΅­μ–΄λ‘œλŠ” μ•ˆλΌ??γ…‹' 의 λ¬Έμž₯을 tokenizer.tokenize( ) ν•˜κ²Œ 되면 λ‹€μŒκ³Ό 같은 아웃풋이 λ‚˜μ˜΅λ‹ˆλ‹€
image

ν•˜μ§€λ§Œ λ§ˆμ§€λ§‰ 값은 ν•œκΈ€ 자음 'γ…‹'κ³Ό λ‹€λ₯Έ κ°’μž…λ‹ˆλ‹€. (μž‘μ€ 'γ…‹')

μ‹€μ œλ‘œ tokenizer.convert_tokens_to_ids('γ…‹') ν•˜κ²Œ 되면 unknown 값인 0이 λ¦¬ν„΄λ˜κ³ , μž‘μ€ 'γ…‹' 값을 λ„£μœΌλ©΄ μ •μƒμ μœΌλ‘œ λ‚˜μ˜΅λ‹ˆλ‹€.
image
image

이 λ•Œλ¬Έμ— λ‹€μŒκ³Ό 같은 μƒν™©μ—μ„œ μ—λŸ¬κ°€ λ°œμƒν•©λ‹ˆλ‹€

image

데이터 λ‹€μš΄μ΄ μ•ˆλ˜μš”..

μƒ˜ν”Œ 돌렀보고 데이터에 λ§žμΆ°μ„œ 제 데이터에 λ§žμΆ°μ„œ λŒλ €λ³΄λ €κ³ ν•˜λŠ”λ°,
download.py
binarize.sh
train_single_gpu_3_layer.sh
μˆœμ„œλŒ€λ‘œ 해야될 것 κ°™λ”λΌκ΅¬μš”.

그런데, download.py에 μžˆλŠ” https://drive.google.com/uc?id=1-jXAnrHcKzzFiFhri37YOXH2OjP7DwMg둜 접근이 μ•ˆλ©λ‹ˆλ‹€.
데이터가 μ—†μ–΄μ„œμΈμ§€ λͺ¨λ₯΄κ² μ§€λ§Œ binarize.shλ₯Ό μ‹€ν–‰μ‹œμΌœλ„ Can't load tokenizer for 'kobert'라고 μ—λŸ¬κ°€ λ‚˜κ΅¬μš”..

Question about training dataset

μ•ˆλ…•ν•˜μ„Έμš”.

λ¨Όμ € λͺ¨λΈ κ³΅μœ ν•΄μ£Όμ…”μ„œ λ„ˆλ¬΄ κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€. 정말 μœ μš©ν•˜κ²Œ ν™œμš©ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

training dataset에 λŒ€ν•œ 질문이 μžˆμ–΄μ„œ μ΄λ ‡κ²Œ μ§ˆλ¬Έμ„ λ‚¨κΈ°κ²Œ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

DistilkoBERTλŠ” koBERT와 같은 training dataset을 μ‚¬μš©ν•˜μ…”μ„œ νŠΈλ ˆμ΄λ‹ν•˜μ‹  κ±΄κ°€μš”?

*NMT에 PrLM의 output을 μ‚¬μš©ν•˜λŠ” ν”„λ‘œμ νŠΈλ₯Ό 진행 쀑인데, NLU tasksμ—μ„œ koBERT와 DistilkoBERT의 μ„±λŠ₯ 차이가 NMTμ—μ„œλ„ λΉ„μŠ·ν•œ μ–‘μƒμœΌλ‘œ λ‚˜νƒ€λ‚˜λŠ”μ§€ ν™•μΈν•˜λ €ν•©λ‹ˆλ‹€. training dataset이 κ°™μ•„μ•Ό μ’€ 더 κ³΅μ •ν•œ 비ꡐλ₯Ό ν•  수 μžˆμ„ 것 κ°™μ•„ μ΄λ ‡κ²Œ 질문 λ“œλ¦½λ‹ˆλ‹€.

colabμ—μ„œ AttributeError

μ•ˆλ…•ν•˜μ„Έμš”! νŒŒμ΄μ¬μ„ 곡뢀쀑인 ν•™μƒμž…λ‹ˆλ‹€!
KoBert λͺ¨λΈμ„ μ‚¬μš©ν•˜λ €λ˜ 도쀑 ν† ν¬λ‚˜μ΄μ €λ₯Ό λΆˆλŸ¬μ˜€λŠ”λ° μ—λŸ¬κ°€ λ°œμƒν•˜μ—¬ μ§ˆλ¬Έμ„ λ‚¨κΉλ‹ˆλ‹€!

tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

μ½”λ“œ μ‹€ν–‰ μ‹œ

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is 'BertTokenizer'.
The class this function is called from is 'KoBertTokenizer'.

AttributeError Traceback (most recent call last)
in <cell line: 1>()
----> 1 tokenizer = KoBertTokenizer.from_pretrained('monologg/kobert')

5 frames
/content/drive/MyDrive/tokenization_kobert.py in get_vocab(self)
123
124 def get_vocab(self):
--> 125 return dict(self.token2idx, **self.added_tokens_encoder)
126
127 def getstate(self):

AttributeError: 'KoBertTokenizer' object has no attribute 'token2idx'

λΌλŠ” μ—λŸ¬κ°€ λ°œμƒν•˜λŠ”λ° ν˜Ήμ‹œ 해결방법이 μžˆλ‚˜μš”?!

Parameters of `BertOnlyMLMHead` missing in `monologg/kobert`

Thanks for releasing the KoBERT model! However, I found that the parameters of BertOnlyMLMHead layer might be missing in the monologg/kobert model, which I think is a common issue that I also found in released BERT models for others languages, like Greek and Russian.
To reproduce this issue:

from transformers import *
m1 = AutoModelWithLMHead.from_pretrained('monologg/kobert')
print(m1.cls.predictions.transform.dense.weight)
m2 = AutoModelWithLMHead.from_pretrained('monologg/kobert')
print(m2.cls.predictions.transform.dense.weight)  # different from m1

Is it possible to upload the pretrained model with the missing parameters (either in huggingface's transformers or providing a link to the original tf checkpoint)?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.