Comments (6)
同问,如何得到的中英词表
from llmpruner.
同问,如何得到的中英词表,或者以bert的词表为准?
from llmpruner.
新的词表是沿用了langboat的bloom的词表。当然,最理想的的做法是根据自己下游任务,统计每个token的词频,取出词频超过99线的token来构造新的词表。 @yqli2420 @CoinCheung @yanguowei316
from llmpruner.
Hello,
Thank you for your code that helped to prune the model embeddings and lm_head with a selection of token_ids.
I am now trying to prune the tokenizer vocab and merges.
First, I tokenized my data with the original tokenizer (Bloomz-560m).
Then, I extracted the most frequent token_ids and bigrams to build new_vocab, new_merges.
Finally, I try to replace the model_state but it fails.
Here is the code I tried :
def update_tokenizer(tokenizer, new_vocab, new_merges, out_path):
"""new vocab is a subset of bloomz token_ids converted with convert_ids_to_tokens
new_merges is a subset of bloomz merges with items in new_vocab"""
model_state = json.loads(tokenizer.backend_tokenizer.model.__getstate__())
model_state["vocab"] = {w: i for i, w in enumerate(new_vocab)}
model_state["merges"] = new_merges
tokenizer.backend_tokenizer.model = Tokenizer(BPE(**model_state)) # It fails here :(
tokenizer.save_pretrained(out_path)
You will find similar code in
https://github.com/asahi417/lm-vocab-trimmer/blob/main/vocabtrimmer/base_trimmer.py#L244
I also tried to update the tokenizer.json file directly but it failed on loading the object.
How did you reduced the vocabulary (vocab and merges) of the BPE tokenizer ?
from llmpruner.
I also tried to update the tokenizer.json file directly but it failed on loading the object.
- Generate new tokenizer.json:
`import json
from transformers import BloomTokenizerFast
model_name = "yuanzhoulvpi/chinese_bloom_560m"
model_name = "bigscience/bloomz-560m"
tokenizer = BloomTokenizerFast.from_pretrained(model_name)
print("Loading model: ", model_name)
tokenizer.save_pretrained("./models/" + model_name)
def is_chinese(string):
for ch in string:
if not (u'\u4e00' <= ch <= u'\u9fa5'):
# if not ((u'\u4e00' <= ch <= u'\u9fa5') or (u'\u3400' <= ch <= u'\u4db5')):
return False
return True
def is_ascii(string):
if (len(string) != 1):
return False
for ch in string:
if not (0 <= ord(ch) < 128):
# if not ((u'\u4e00' <= ch <= u'\u9fa5') or (u'\u3400' <= ch <= u'\u4db5')):
return False
return True
chinese_vocab = {}
id = 0
for i in range(len(tokenizer.vocab)):
tks = tokenizer.convert_ids_to_tokens([i])
tks_str = "".join(tks)
text = tokenizer.decode([i])
if i < 4 or is_ascii(text) or is_chinese(text):
# print(text)
chinese_vocab[tks_str] = id
id = id + 1
# else:
# print(text)
def is_exist(tks, vocab):
tk = "".join(tks)
if tk in vocab.keys():
return True
return False
with open("./models/" + model_name + "/tokenizer.json", 'r', encoding="utf-8") as f0:
t = json.load(f0)
t["model"]["vocab"] = chinese_vocab
chinese_merges = []
merges = t["model"]["merges"]
for merge in merges:
tks = merge.split(" ")
text = tokenizer.convert_tokens_to_string(tks)
# if is_chinese(text) and exist(tks, chinese_vocab):
if len(text) > 1 and is_exist(tks, chinese_vocab):
chinese_merges.append(merge)
# else:
# print(text)
t["model"]["merges"] = chinese_merges
# t["model"]["merges"] = []
with open('./models/pruner_vocab/tokenizer.json', "w", encoding="utf-8") as f1:
json.dump(t, f1, ensure_ascii=False, indent=1)
print("chinese_vocab_size : ", len(chinese_vocab))
`
- Load error with merges
Exception has occurred: Exception data did not match any variant of untagged enum ModelWrapper at line 49961 column 2 File "C:\Projects\AIGC2\bloom\LLMPruner\test1.py", line 8, in <module> tokenizer = AutoTokenizer.from_pretrained(checkpoint) Exception: data did not match any variant of untagged enum ModelWrapper at line 49961 column 2
-
load succeful with merges is empty, run pruner.py can generate new model, but check error
Exception has occurred: IndexError index -1 is out of bounds for dimension 1 with size 0 File "C:\Projects\AIGC2\bloom\LLMPruner\pruners\vocabulary_pruner.py", line 25, in check new_output = new_model.generate(new_input_ids, max_length=max_length) File "C:\Projects\AIGC2\bloom\LLMPruner\pruner.py", line 13, in <module> pruner.check(model_name_or_path, save_path, text='长风破浪会有时') IndexError: index -1 is out of bounds for dimension 1 with size 0
-
finally, perhaps I think new tokenizer.json must be in a valid model.
from llmpruner.
@yangjianxin1 感谢,我没有其他问题了,先关掉
from llmpruner.
Related Issues (5)
- size issue HOT 2
- 词表相关 HOT 3
- 中文词显示乱码 HOT 2
- 自定义词表问题
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llmpruner.