GithubHelp home page GithubHelp logo

Comments (18)

kyakuno avatar kyakuno commented on June 10, 2024

xlmrobertaなのでtokenizerはそのまま使えそう。

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

オリジナルのモデルが消えた話
UKPLab/sentence-transformers#2306
https://www.reddit.com/r/MachineLearning/comments/1286n6w/d_multilingual_retrieve_rerank_models/

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

使用方法は下記がわかりやすい。
https://huggingface.co/jeffwan/mmarco-mMiniLMv2-L12-H384-v1

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

大元のmultilingual marcoは下記みたい。
https://github.com/unicamp-dl/mMARCO

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

モデル。
https://huggingface.co/unicamp-dl/mMiniLM-L6-v2-mmarco-v2

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

このモデルは3年前のもので、これを、cross encoderアーキテクチャにしたのが最新のモデルみたい。
englishだけのma marcoのものは下記にモデルがある。
https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

marco : minilm -> crossencoder
mmarco : minilm -> crossencoder
という流れ。

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

とりあえず、このissueの先頭のmirrorされたモデルをエクスポートすれば良さそう。

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

推論コードの例。

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1"
#model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか?', 'ベルリンには何人が住んでいますか?'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'],  padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores_en = model(**features_en).logits
    print(scores_en)
    scores_ja = model(**features_ja).logits
    print(scores_ja)

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1とjeffwan/mmarco-mMiniLMv2-L12-H384-v1は同じモデルみたい。

#corrius/cross-encoder-mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
#        [-8.1277]])
#tensor([[ 9.3747],
#        [-6.4083]])

#jeffwan/mmarco-mMiniLMv2-L12-H384-v1
#tensor([[10.7615],
#        [-8.1277]])
#tensor([[ 9.3747],
#        [-6.4083]])

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

ONNXへの変換

pip3 install optimum
optimum-cli export onnx --model jeffwan/mmarco-mMiniLMv2-L12-H384-v1 mmarco-mMiniLMv2-L12-H384-v1.onnx

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

変換したONNXは470.8MB。

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

推論コード

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "jeffwan/mmarco-mMiniLMv2-L12-H384-v1"

model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

features_en = tokenizer(['How many people live in Berlin?', 'How many people live in Berlin?'], ['Berlin has a population of 3,520,031 registered inhabitants in an area of 891.82 square kilometers.', 'New York City is famous for the Metropolitan Museum of Art.'],  padding=True, truncation=True, return_tensors="pt")
features_ja = tokenizer(['ベルリンには何人が住んでいますか?', 'ベルリンには何人が住んでいますか?'], ['ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', 'ニューヨーク市はメトロポリタン美術館で有名です。'],  padding=True, truncation=True, return_tensors="pt")

import ailia
net = ailia.Net(weight = "mmarco-mMiniLMv2-L12-H384-v1.onnx/model.onnx")
print(features_en)
score_en = net.run([features_en["input_ids"].numpy(), features_en["attention_mask"].numpy()])
print(score_en)
print(features_ja)
score_ja = net.run([features_ja["input_ids"].numpy(), features_ja["attention_mask"].numpy()])
print(score_ja)

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

出力はtorchと一致。

[array([[10.761542],
       [-8.127746]], dtype=float32)]
[array([[ 9.374646],
       [-6.408309]], dtype=float32)]

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

期待するtoken

[     0,      6, 106876,  67540,   2880,   4931,  19758,   6111, 206926,
           1894,     32,      2,      2,      6, 106876,  67540,    154,  24008,
            342,  18949, 119044,    304, 158580, 189402, 219137,    154,  34604,
         160450,   8346, 159616,    281,  92714,    304,  63527,   5016,    487,
          12219,     30,      2]

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

e5と同じtokenizerの出力。

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
inputs = tokenizer('ベルリンには何人が住んでいますか?', padding=True, truncation=True, return_tensors='np')
print(inputs)
inputs = tokenizer('ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。', padding=True, truncation=True, return_tensors='np')
print(inputs)
query

[     0,      6, 106876,  67540,   2880,   4931,  19758,   6111,
        206926,   1894,     32,      2]

contents

[     0,      6, 106876,  67540,    154,  24008,    342,  18949,
        119044,    304, 158580, 189402, 219137,    154,  34604, 160450,
          8346, 159616,    281,  92714,    304,  63527,   5016,    487,
         12219,     30,      2]

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

期待値をデコードすると、接続時のになるので、queryとcontentsを結合し、contentsの先頭の0を2に置き換えれば良い。

print(tokenizer.decode([     0,      6, 106876,  67540,   2880,   4931,  19758,   6111, 206926,
           1894,     32,      2,      2,      6, 106876,  67540,    154,  24008,
            342,  18949, 119044,    304, 158580, 189402, 219137,    154,  34604,
         160450,   8346, 159616,    281,  92714,    304,  63527,   5016,    487,
          12219,     30,      2]))

<s> ベルリンには何人が住んでいますか?</s></s> ベルリンの人口は891.82平方キロメートルの地域に登録された住民が3,520,031人います。</s>

from ailia-models.

kyakuno avatar kyakuno commented on June 10, 2024

厳密には、2回目の2はsep_tokenなのだけど、sep_tokenとeos_tokenが同じシンボルになっているので、2になる。

{
  "bos_token": "<s>",
  "clean_up_tokenization_spaces": true,
  "cls_token": "<s>",
  "eos_token": "</s>",
  "mask_token": {
    "__type": "AddedToken",
    "content": "<mask>",
    "lstrip": true,
    "normalized": true,
    "rstrip": false,
    "single_word": false
  },
  "model_max_length": 512,
  "pad_token": "<pad>",
  "sep_token": "</s>",
  "tokenizer_class": "XLMRobertaTokenizer",
  "unk_token": "<unk>"
}

from ailia-models.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.