GithubHelp home page GithubHelp logo

Different Language Support about txtai HOT 5 CLOSED

neuml avatar neuml commented on May 18, 2024
Different Language Support

from txtai.

Comments (5)

davidmezzetti avatar davidmezzetti commented on May 18, 2024

Thank you for the support.

The best place to start is this notebook: https://colab.research.google.com/github/neuml/txtai/blob/master/examples/01_Introducing_txtai.ipynb

The index in the notebook above uses sentence-transformers. This link has a list of all the sentence transformer models available: https://huggingface.co/models?search=sentence-transformers

The following is an example modification using a multi-lingual model

# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings({"method": "transformers", "path": "sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens"})

If this doesn't work well, another model to try: sentence-transformers/LaBSE

Then change the sections text below to a couple examples in the target language you want to experiment with.

sections = ["US tops 5 million confirmed virus cases",
            "Canada's last fully intact ice shelf has suddenly collapsed, forming a Manhattan-sized iceberg",
            "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
            "The National Park Service warns against sacrificing slower friends in a bear attack",
            "Maine man wins $1M from $25 lottery ticket",
            "Make huge profits without work, earn up to $100,000 a day"]

Finally change the queries to the target language as well:

for query in ("feel good story", "climate change", "health", "war", "wildlife", "asia", "north america", "dishonest junk"):

I would just iterate over a couple different models until you find one that works well. Let me know how it works out.

from txtai.

ByUnal avatar ByUnal commented on May 18, 2024

I appreciate for your feedback and concern. I will try your recommendation as soon as I'm available, then I will let you know.

from txtai.

ByUnal avatar ByUnal commented on May 18, 2024

Hello again, now I tried to transformers that you suggested. I tried both of them for Turkish. First one worked in some cases but it is not efficient. Second one didn't work. I need Turkish language support of transformers

from txtai.

davidmezzetti avatar davidmezzetti commented on May 18, 2024

Another one to try: sentence-transformers/distilbert-multilingual-nli-stsb-quora-ranking

Otherwise you can also try any of the generic Turkish transformer models: https://huggingface.co/models?search=turkish

Those multilingual models are the ones that should support multiple languages. I suspect a model specifically trained for Turkish language and on a NLI/STSB like task for Turkish would work best.

txtai uses the sentence-transformers library to build transformer-based sentence embeddings. I would suggest trying as many of the models to see if any of them work at an acceptable level for your task.

from txtai.

davidmezzetti avatar davidmezzetti commented on May 18, 2024

Issue should now be resolved. Tokenization can be disabled by setting the config option:

Embeddings({"method": "transformers", path: "/path/to/model", "tokenize": False})

from txtai.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.