Light

ADD cross-encoder-mmarco-mMiniLMv2-L12-H384-v1 about ailia-models HOT 18 CLOSED

kyakuno commented on June 10, 2024

ADD cross-encoder-mmarco-mMiniLMv2-L12-H384-v1

from ailia-models.

Comments (18)

kyakuno commented on June 10, 2024

xlmrobertaなのでtokenizerはそのまま使えそう。

from ailia-models.

kyakuno commented on June 10, 2024

オリジナルのモデルが消えた話
UKPLab/sentence-transformers#2306
https://www.reddit.com/r/MachineLearning/comments/1286n6w/d_multilingual_retrieve_rerank_models/

from ailia-models.

kyakuno commented on June 10, 2024

使用方法は下記がわかりやすい。
https://huggingface.co/jeffwan/mmarco-mMiniLMv2-L12-H384-v1

from ailia-models.

kyakuno commented on June 10, 2024

大元のmultilingual marcoは下記みたい。
https://github.com/unicamp-dl/mMARCO

from ailia-models.

kyakuno commented on June 10, 2024

モデル。
https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2

from ailia-models.

kyakuno commented on June 10, 2024

このモデルは3年前のもので、これを、cross encoderアーキテクチャにしたのが最新のモデルみたい。
englishだけのma marcoのものは下記にモデルがある。
https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

from ailia-models.

kyakuno commented on June 10, 2024

marco : minilm -> crossencoder
mmarco : minilm -> crossencoder
という流れ。

from ailia-models.

kyakuno commented on June 10, 2024

とりあえず、このissueの先頭のmirrorされたモデルをエクスポートすれば良さそう。

from ailia-models.

kyakuno commented on June 10, 2024

推論コードの例。

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1"
#model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか？', 'ベルリンには何人が住んでいますか？'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores_en = model(**features_en).logits
    print(scores_en)
    scores_ja = model(**features_ja).logits
    print(scores_ja)

from ailia-models.

kyakuno commented on June 10, 2024

corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1とjeffwan/mmarco-mMiniLMv2-L12-H384-v1は同じモデルみたい。

#corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
#        [-8.1277]])
#tensor([[ 9.3747],
#        [-6.4083]])

#jeffwan/mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
#        [-8.1277]])
#tensor([[ 9.3747],
#        [-6.4083]])

from ailia-models.

kyakuno commented on June 10, 2024

ONNXへの変換

pip3 install optimum
optimum-cli export onnx --model jeffwan/mmarco-mMiniLMv2-L12-H384-v1 mmarco-mMiniLMv2-L12-H384-v1.onnx

from ailia-models.

kyakuno commented on June 10, 2024

変換したONNXは470.8MB。

from ailia-models.

kyakuno commented on June 10, 2024

推論コード

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか？', 'ベルリンには何人が住んでいますか？'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'],  padding=True, truncation=True, return_tensors="pt")

import ailia
net = ailia.Net(weight = "mmarco-mMiniLMv2-L12-H384-v1.onnx/model.onnx")
print(features_en)
score_en = net.run([features_en["input_ids"].numpy(), features_en["attention_mask"].numpy()])
print(score_en)
print(features_ja)
score_ja = net.run([features_ja["input_ids"].numpy(), features_ja["attention_mask"].numpy()])
print(score_ja)

from ailia-models.

kyakuno commented on June 10, 2024

出力はtorchと一致。

[array([[10.761542],
       [-8.127746]], dtype=float32)]
[array([[ 9.374646],
       [-6.408309]], dtype=float32)]

from ailia-models.

kyakuno commented on June 10, 2024

期待するtoken

[     0,      6, 106876,  67540,   2880,   4931,  19758,   6111, 206926,
           1894,     32,      2,      2,      6, 106876,  67540,    154,  24008,
            342,  18949, 119044,    304, 158580, 189402, 219137,    154,  34604,
         160450,   8346, 159616,    281,  92714,    304,  63527,   5016,    487,
          12219,     30,      2]

from ailia-models.

kyakuno commented on June 10, 2024

e5と同じtokenizerの出力。

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
inputs = tokenizer('ベルリンには何人が住んでいますか？', padding=True, truncation=True, return_tensors='np')
print(inputs)
inputs = tokenizer('ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', padding=True, truncation=True, return_tensors='np')
print(inputs)

query

[     0,      6, 106876,  67540,   2880,   4931,  19758,   6111,
        206926,   1894,     32,      2]

contents

[     0,      6, 106876,  67540,    154,  24008,    342,  18949,
        119044,    304, 158580, 189402, 219137,    154,  34604, 160450,
          8346, 159616,    281,  92714,    304,  63527,   5016,    487,
         12219,     30,      2]

from ailia-models.

kyakuno commented on June 10, 2024

期待値をデコードすると、接続時のがになるので、queryとcontentsを結合し、contentsの先頭の0を2に置き換えれば良い。

print(tokenizer.decode([     0,      6, 106876,  67540,   2880,   4931,  19758,   6111, 206926,
           1894,     32,      2,      2,      6, 106876,  67540,    154,  24008,
            342,  18949, 119044,    304, 158580, 189402, 219137,    154,  34604,
         160450,   8346, 159616,    281,  92714,    304,  63527,   5016,    487,
          12219,     30,      2]))

<s> ベルリンには何人が住んでいますか?</s></s> ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。</s>

from ailia-models.

kyakuno commented on June 10, 2024

厳密には、2回目の2はsep_tokenなのだけど、sep_tokenとeos_tokenが同じシンボルになっているので、2になる。

{
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": true,
  "cls_token": "<s>",
  "eos_token": "</s>",
  "mask_token": {
    "__type": "AddedToken",
    "content": "<mask>",
    "lstrip": true,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 512,
  "pad_token": "<pad>",
  "sep_token": "</s>",
  "tokenizer_class": "XLMRobertaTokenizer",
  "unk_token": "<unk>"
}

from ailia-models.

Related Issues (20)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs