GithubHelp home page GithubHelp logo

argosopentech / argos-translate Goto Github PK

View Code? Open in Web Editor NEW
3.5K 52.0 252.0 3.61 MB

Open-source offline translation library written in Python

Home Page: https://www.argosopentech.com

License: MIT License

Python 93.21% Shell 6.79%
python machine-translation transformers translation language-models linux nlp open-source

argos-translate's Introduction

argos-translate's People

Contributors

aabur avatar aleufroy avatar andrewkdinh avatar andriyor avatar argosopentech avatar ddorian avatar dingedi avatar ederin avatar guillaumekln avatar hollorol avatar jakeroggenbuck avatar jonmagon avatar jorgesumle avatar kolserdav avatar lynxpda avatar mikemoritz avatar milahu avatar mmacu avatar mmokhi avatar mwip avatar pierotofy avatar pirhoo avatar pj-finlay avatar rushter avatar technologyclassroom avatar tpcgold avatar vuizur avatar yogeshwaran01 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

argos-translate's Issues

ERROR: Can't pickle ctranslate2.translator.Translator objects

Hi,

Thank you for the great project!

I am using argos translate in my project. I have created a customized sklearn transformer where I call the argos models for translation. My customized transformer is part of sklearn pipeline. However, when I set the pipeline hyperparameter n_jobs to a value higher than 1, I receive the error:

TypeError: can't pickle ctranslate2.translator.Translator objects

Any ideas/advice how I can solve this issue? Are you planning to make ctranslate2 objects picklable?

Thanks again!

Failed to install with pip

Trying to install argostranslate on Debian unstable with pip I get the error message:

Package sentencepiece was not found in the pkg-config search path.
Perhaps you should add the directory containing `sentencepiece.pc'
to the PKG_CONFIG_PATH environment variable
No package 'sentencepiece' found
Failed to find sentencepiece pkgconfig

Is there anything I can do?

Emoji translations

Using emojis within texts at best gets dropped, and in some cases changes translations to something else.

I know this is a training matter...
But it came to my mind (after some testing and trial-error), that maybe by using something like .encode("unicode_escape")* we could let them stay the same (as it often will, so far that I tested) and then afterwards we get it decoded back...

Basically, since we never have to "translate" those characters, I'm thinking maybe we could filter/keep them...

*P.S. not exactly this encode statement, but to be figured out 😅

Importing argostranslate can fail on snap package dirs

This is probably most a test environment issue, but this can happen:

../../../venv/lib/python3.8/site-packages/argostranslate/package.py:6: in <module>
    from argostranslate import settings
../../../venv/lib/python3.8/site-packages/argostranslate/settings.py:17: in <module>
    for package_dir in content_snap_packages.iterdir():
/home/mike/.pyenv/versions/3.8.6/lib/python3.8/pathlib.py:1121: in iterdir
    for name in self._accessor.listdir(self):
E   FileNotFoundError: [Errno 2] No such file or directory: '/snap/pycharm-community/223/snap_custom/content_snap_packages'

PR: #19

Auto Code Formatting

Ideally there would be some sort of auto code formatting and linting. Related to this there is currently an issue with some of the documentation being formatted:
image

The goal is to comply with PEP 8 and PEP 257 to the extent possible.

Question about non-deterministic results

While investigating using argos-translate as a library, I have noticed non-deterministic results when translating a short test string "Hello world!" using your pre-trained models. For English -> Russian, it returns "Здравствуй мир!" on some hosts, and "Здравствуй!" on others. The results on a given host are deterministic on repeated runs and environments (at least in my testing so far).

I first tried to follow the advice here thinking it could be a random seed issue to no avail:
OpenNMT/OpenNMT-py#392
pytorch/pytorch#7068 (comment)

I was not able to determine any significant differences between hosts (both running on cpu), and the output of ct2_verbose is identical:

[ct2_verbose] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true)
[ct2_verbose] Selected CPU ISA: AVX2
[ct2_verbose] Use Intel MKL: true
[ct2_verbose] SGEMM CPU backend: MKL
[ct2_verbose] GEMM_S16 CPU backend: MKL
[ct2_verbose] GEMM_S8 CPU backend: MKL (u8s8 preferred: true)
[ct2_verbose] Use packed GEMM: false

Manually setting num_hypotheses=2 in the ctranslate2 Translator shows that it appears to be a score difference:

Host #1:
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '!'], 'score': -2.7840166091918945}
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '▁мир', '!'], 'score': -2.841048240661621}

Host #2:
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '▁мир', '!'], 'score': -2.7670412063598633}
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '!'], 'score': -2.7944717407226562}

Setting beam_size=1 so it uses greedy search did produce the same result on both hosts, but I don't think that is a valid solution.

I created a gist to provide some debugging output, and didn't notice any difference in the actual argos-translate parsing logic, so it seems to be much deeper: https://gist.github.com/mikemoritz/a5bf76193ccb16d018a1af9ec584fb41

My questions are:

  • Are there other options you would recommend setting to increase the likelihood of deterministic results? If so, could these be surfaced as options within argos-translate?
  • Is it possible that "Hello world!" is a bad test string? If so, do you have any recommendations?
  • Do you think it could still be a random seed issue that may need to be implemented within argos-translate?
  • Is there additional debugging within ctranslate2 and/or torch that you would recommend to highlight differences between the hosts?

Thanks!

Support for emojis in text translation

Given is the English text: "Well done 👍"

The text itself gets translated perfectly in any language. However, depending on the target language the emoji is translated to "" or "?" or "Benachrichtigung" (in German).

Would it be possible to detect the emoji and leave that character as it is?
Hint: in Unicode 13.0 there are 4 character ranges allocated for emojis: U+1F300 (127744) to U+1FAD6 (129750), 126980 to 127569, 169 to 174 and 8205 to 12953

Cursor not themed in snap

Using the snap, the cursor does not follow the system theme. It's more obvious if you change the cursor theme to something that looks different, like "redglass", but even the Ubuntu default Yaru theme and size is not followed. One Qt app snap that this does work in is KeePassXC. It looks like their snapcraft.yaml has some additions plugs for theming.

Command-line Usage

Would it be possible to support command-line usage? I searched the documentation but found nothing. I would like to automate translating texts and also text files into multiple languages.

As an example I suggest the following:
argos-translate -text "Hello World!" -from en -to de
argos-translate -file Novel.txt -from en -to de

Improve Training scripts

The training scripts have lots of room for improvement. The long term plan is to rewrite them in OpenNMT for PyTorch in a fully automatable way but there are other potential improvements:

  • Auto download data from the Opus Parallel Corpus
  • Auto stop training after a set number of epochs
  • Cleaner implementation/better docs

Close in system tray

Please allow for Argos Translate to close into system tray rather than take up room in the panel.

Decrease Snapcraft distribution size

Currently the Snapcraft image is ~1GB, ~700MB of this is a torch cuda shared object file. If this could be removed automatically in the Snapcraft build process somehow (or maybe on option for all python installs?) then the download and startup time for Snapcraft would greatly improve (these are currently both issues).

List license for language models.

I see that this repository is licensed under the MIT license, but the language training models are hosted outside of this repository that can be downloaded with HTTPS, IPFS or torrent.

Does the same MIT license apply to the models as well, or are they distributed under a different license?

This should probably be listed somewhere.

Add Tests

We currently don't have any tests, but it would be nice to. Not being able to include a .argostranslate file in the tests easily will make this more difficult but at least having some tests would be good.

PIP shows conflicting dependencies

$ pip install argostranslate
Collecting argostranslate
  Using cached argostranslate-1.0.5-py3-none-any.whl (13 kB)
  Using cached argostranslate-1.0.3-py3-none-any.whl (12 kB)
  Using cached argostranslate-1.0-py3-none-any.whl (12 kB)
ERROR: Cannot install argostranslate==1.0, argostranslate==1.0.3 and argostranslate==1.0.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    argostranslate 1.0.5 depends on ctranslate2==1.14.0
    argostranslate 1.0.3 depends on ctranslate2==1.14.0
    argostranslate 1.0 depends on ctranslate2

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Configuration/environment:

--------------------------------------------------------------------------------
  Date: Wed Dec 23 15:59:00 2020 CET

                OS : Darwin
            CPU(s) : 12
           Machine : x86_64
      Architecture : 64bit
       Environment : Python

  Python 3.7.7 (default, Mar 23 2020, 17:31:31)  [Clang 4.0.1
  (tags/RELEASE_401/final)]

             numpy : 1.19.4
           IPython : 7.19.0
            scooby : 0.5.6
--------------------------------------------------------------------------------
pip 20.3.3

PyQt signals logic in GUI

Fix packages_changed = pyqtSignal() in gui.py to correctly update all views when the state of packages has changed.

arm64 support (Librem 5 phone etc.)

This won't install on a Librem phone...

$snap install argos-translate
error: snap "argos-translate" is not available on stable for this architecture (arm64) but exists on other architectures (amd64)

This seems a very handy library and why does it not run on arm computers?

Weird english -> japanese translations (bad training data?)

I'm using argos-translate via libretranslate, so if this is the wrong place for this, I'll move it.

I'm testing out the english -> japanese translations and I think some bad data might have gotten into the training data.

"Hello" is being translated as "お問い合わせ" which translates to "Contact Us" (something you'd expect to see at the bottom of a webpage used for training?)

"Goodbye" is being translated as "フィードバック" (feedback). Again, something you'd expect to see at the bottom of a webpage).

"Help me!" is also being translated as "お問い合わせ".

Not exactly sure how I help, but I figured I'd point out the issue.

Do Screenshots on a Qt based Distro

Screenshots promoted by argos are token from a GNOME display environment. Gnome itself has not the best integration with Qt. Gnome uses Gtk.

Maybe take these screenshots on a display environment with a better Qt integration like KDE Plasma.

Localization

Currently there is no custom localization but this would be nice to have. Qt provides some nice tools for doing this and the apps strings could be translated using the app itself.

Support Language Detection

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already intialized

Hi,
Thanks for the amazing project. This was a fresh install on a new python 3.8.3 virtual environment using pip install and launching straight away. The web app launches, but after a few key strokes, the process crashes with the following error logged to the terminal:
OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. Abort trap: 6

Any advice appreciated here, thanks again.

Port to more platforms

The easiest are probably MacOS and Windows using py2app and py2exe but other platforms to consider could be mobile, BSD, Debian, Red Hat, FlatPak, or BSD. I'd like to be able to run builds on Linux as much as possible but this may not be possible for some platforms.

There's also a decision to be made if we want to use tools like py2app/py2exe or go all in on pyqtdeploy.

There are probably some challenges for doing local translation on mobile so a better strategy may be to build/port simple mobile apps that connect to the LibreTranslate API.

Better model distribution

Currently models are distributed by Google Drive (not ideal) and a slow BitTorrent, so there's lots of room for improvement:

  • More Torrent seeders
  • Create individual torrent files for each model
  • HTTP or FTP mirrors
  • I avoided git distribution because I was worried about running into GitHub limits but we may want to link to the LibreTranslate Git Mirror
  • Open to other ideas too

The plan was to make a separate repo for storing model distribution information so let me know if your interested.

GPU Support

When I first wrote this CTranslate, which does inference, didn't support GPU translation from PyPI. This has since changed and this would be a nice feature to have. All this may take is updating the CTranslate version in requirements.txt and adding documentation but if someone with more CUDA knowledge could look into this I would appreciate it. Also it would be nice to support open-source alternatives to CUDA.

Argos Translate also prints an error message about torch not being able to connect to CUDA:

/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

Torch is only used by Stanza which does sentence boundary detection so it supporting GPU inference isn't as important as CTranslate supporting GPUs for performance but this error message should be supressed.

Filter HTML entities in training scripts data

Hello, I've noticed a bug when translating something to French.
Sometime, there is the HTML entity &apos; appearing instead of the apostrophe.
Some examples:
argos-translate1
argos-translate2
argos-translate5

Though it doesn't happen in these cases
argos-translate4
argos-translate3

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.