Comments (18)
xlmrobertaなのでtokenizerはそのまま使えそう。
from ailia-models.
オリジナルのモデルが消えた話
UKPLab/sentence-transformers#2306
https://www.reddit.com/r/MachineLearning/comments/1286n6w/d_multilingual_retrieve_rerank_models/
from ailia-models.
使用方法は下記がわかりやすい。
https://huggingface.co/jeffwan/mmarco-mMiniLMv2-L12-H384-v1
from ailia-models.
大元のmultilingual marcoは下記みたい。
https://github.com/unicamp-dl/mMARCO
from ailia-models.
モデル。
https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2
from ailia-models.
このモデルは3年前のもので、これを、cross encoderアーキテクチャにしたのが最新のモデルみたい。
englishだけのma marcoのものは下記にモデルがある。
https://www.sbert.net/docs/pretrained-models/ce-msmarco.html
from ailia-models.
marco : minilm -> crossencoder
mmarco : minilm -> crossencoder
という流れ。
from ailia-models.
とりあえず、このissueの先頭のmirrorされたモデルをエクスポートすれば良さそう。
from ailia-models.
推論コードの例。
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1"
#model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか?', 'ベルリンには何人が住んでいますか?'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'], padding=True, truncation=True, return_tensors="pt")
model.eval()
with torch.no_grad():
scores_en = model(**features_en).logits
print(scores_en)
scores_ja = model(**features_ja).logits
print(scores_ja)
from ailia-models.
corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1とjeffwan/mmarco-mMiniLMv2-L12-H384-v1は同じモデルみたい。
#corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
# [-8.1277]])
#tensor([[ 9.3747],
# [-6.4083]])
#jeffwan/mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
# [-8.1277]])
#tensor([[ 9.3747],
# [-6.4083]])
from ailia-models.
ONNXへの変換
pip3 install optimum
optimum-cli export onnx --model jeffwan/mmarco-mMiniLMv2-L12-H384-v1 mmarco-mMiniLMv2-L12-H384-v1.onnx
from ailia-models.
変換したONNXは470.8MB。
from ailia-models.
推論コード
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'], padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか?', 'ベルリンには何人が住んでいますか?'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'], padding=True, truncation=True, return_tensors="pt")
import ailia
net = ailia.Net(weight = "mmarco-mMiniLMv2-L12-H384-v1.onnx/model.onnx")
print(features_en)
score_en = net.run([features_en["input_ids"].numpy(), features_en["attention_mask"].numpy()])
print(score_en)
print(features_ja)
score_ja = net.run([features_ja["input_ids"].numpy(), features_ja["attention_mask"].numpy()])
print(score_ja)
from ailia-models.
出力はtorchと一致。
[array([[10.761542],
[-8.127746]], dtype=float32)]
[array([[ 9.374646],
[-6.408309]], dtype=float32)]
from ailia-models.
期待するtoken
[ 0, 6, 106876, 67540, 2880, 4931, 19758, 6111, 206926,
1894, 32, 2, 2, 6, 106876, 67540, 154, 24008,
342, 18949, 119044, 304, 158580, 189402, 219137, 154, 34604,
160450, 8346, 159616, 281, 92714, 304, 63527, 5016, 487,
12219, 30, 2]
from ailia-models.
e5と同じtokenizerの出力。
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
inputs = tokenizer('ベルリンには何人が住んでいますか?', padding=True, truncation=True, return_tensors='np')
print(inputs)
inputs = tokenizer('ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', padding=True, truncation=True, return_tensors='np')
print(inputs)
query
[ 0, 6, 106876, 67540, 2880, 4931, 19758, 6111,
206926, 1894, 32, 2]
contents
[ 0, 6, 106876, 67540, 154, 24008, 342, 18949,
119044, 304, 158580, 189402, 219137, 154, 34604, 160450,
8346, 159616, 281, 92714, 304, 63527, 5016, 487,
12219, 30, 2]
from ailia-models.
期待値をデコードすると、接続時のがになるので、queryとcontentsを結合し、contentsの先頭の0を2に置き換えれば良い。
print(tokenizer.decode([ 0, 6, 106876, 67540, 2880, 4931, 19758, 6111, 206926,
1894, 32, 2, 2, 6, 106876, 67540, 154, 24008,
342, 18949, 119044, 304, 158580, 189402, 219137, 154, 34604,
160450, 8346, 159616, 281, 92714, 304, 63527, 5016, 487,
12219, 30, 2]))
<s> ベルリンには何人が住んでいますか?</s></s> ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。</s>
from ailia-models.
厳密には、2回目の2はsep_tokenなのだけど、sep_tokenとeos_tokenが同じシンボルになっているので、2になる。
{
"bos_token": "<s>",
"clean_up_tokenization_spaces": true,
"cls_token": "<s>",
"eos_token": "</s>",
"mask_token": {
"__type": "AddedToken",
"content": "<mask>",
"lstrip": true,
"normalized": true,
"rstrip": false,
"single_word": false
},
"model_max_length": 512,
"pad_token": "<pad>",
"sep_token": "</s>",
"tokenizer_class": "XLMRobertaTokenizer",
"unk_token": "<unk>"
}
from ailia-models.
Related Issues (20)
- ADD MusicGen
- Add bert-network-packet-flow-header-payload
- PaddleOCRの標準モデルをServerモデルにする
- ADD AniPortrait HOT 2
- ADD sdxl-turbo HOT 3
- ModuleNotFoundError: No module named 'fvcore'
- ADD bge-m3 HOT 1
- ADD VISTA (hands-segmentation-pytorch)
- ADD Ego2Hands
- ADD japanese-reranker-cross-encoder-large-v1 HOT 1
- How to obtain hubert_base.onnx that supports v2 [768]
- ADD BeatNet HOT 1
- ADD kotoba-whisper-v1.0 HOT 4
- Add gradio ui
- READMEにOpen in Colabボタンを追加
- Add phi3-mini HOT 12
- FP16 not working for CLAP HOT 1
- ADD g2p_en
- ADD IDN-VTON
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ailia-models.