Hi there, This is a very beautiful work. I want to use this API for languages other th

Different Language Support about txtai HOT 5 CLOSED

neuml commented on May 18, 2024

Different Language Support

from txtai.

Comments (5)

davidmezzetti commented on May 18, 2024

Thank you for the support.

The best place to start is this notebook: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb

The index in the notebook above uses sentence-transformers. This link has a list of all the sentence transformer models available: https://huggingface.co/models?search=sentence-transformers

The following is an example modification using a multi-lingual model

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"})

If this doesn't work well, another model to try: sentence-transformers/LaBSE

Then change the sections text below to a couple examples in the target language you want to experiment with.

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

Finally change the queries to the target language as well:

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):

I would just iterate over a couple different models until you find one that works well. Let me know how it works out.

from txtai.

ByUnal commented on May 18, 2024

I appreciate for your feedback and concern. I will try your recommendation as soon as I'm available, then I will let you know.

from txtai.

ByUnal commented on May 18, 2024

Hello again, now I tried to transformers that you suggested. I tried both of them for Turkish. First one worked in some cases but it is not efficient. Second one didn't work. I need Turkish language support of transformers

from txtai.

davidmezzetti commented on May 18, 2024

Another one to try: sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

Otherwise you can also try any of the generic Turkish transformer models: https://huggingface.co/models?search=turkish

Those multilingual models are the ones that should support multiple languages. I suspect a model specifically trained for Turkish language and on a NLI/STSB like task for Turkish would work best.

txtai uses the sentence-transformers library to build transformer-based sentence embeddings. I would suggest trying as many of the models to see if any of them work at an acceptable level for your task.

from txtai.

davidmezzetti commented on May 18, 2024

Issue should now be resolved. Tokenization can be disabled by setting the config option:

Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})

from txtai.

Different Language Support about txtai HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs