argosopentech / argos-translate Goto Github PK

View Code? Open in Web Editor NEW

3.5K 52.0 252.0 3.61 MB

Open-source offline translation library written in Python

Home Page: https://www.argosopentech.com

License: MIT License

Python 93.21% Shell 6.79%

python machine-translation transformers translation language-models linux nlp open-source

argos-translate's Introduction

Argos Open Technologies, LLC

argos-translate's People

Contributors

Stargazers

Watchers

Forkers

vadiml1024 r-build gpg-dev geogubd schabil gdh756462786 trendingtechnology 4144 mikemoritz 123vrsc laeeth andrewkdinh sirwomble socioprophet chenfarong maorinn thomas536 stayen strogo zeta1999 chamberone st3alth nonomal longjohncoder pierotofy mmokhi cedececa smzhou123 gikken-ug ml-ai-nlp-ir jakeroggenbuck aleufroy hmidani-abdelilah mwip yogeshwaran01 niedev off-grid-tech cientista pyver jengbou ra2003 lysfee xy-cypher davalerova zack-sys bellyfat charlesderek metopedia ronymacfly hollorol raceland-automacao-lda cfculhane johnpaulbin aspectro pyran1 skyro468 ach-74583 rezasafaei70 epic1028 renauddahou lamblabo jeehut divers102 product-think2049 wolf-rafael shangdibufashi bowhackem mmacu blmvay alexsevas maxkleiner homgorn fulan0dental slimane-msb lalitnew pirhoo tovspaskin nirlendu whopriyam tplink32 kernelogy doehae erarshad achi0012 clintonkildepstein cahlien bilalilyas90 avsolatorio armanexplorer frankrx41 furkangozukara tomkallo qin-vaitl-process shabdashala novixx-systems aluf-computers evilc3 aabur vvalisoy ktzgraph

argos-translate's Issues

Do you offer setup instructions for OFFLINE translation mode?

My target platforms are:

Hybrid Apps made with: Monaca or Cordova

React Native App made with Expo.io

Thanks

ERROR: Can't pickle ctranslate2.translator.Translator objects

Hi,

Thank you for the great project!

I am using argos translate in my project. I have created a customized sklearn transformer where I call the argos models for translation. My customized transformer is part of sklearn pipeline. However, when I set the pipeline hyperparameter n_jobs to a value higher than 1, I receive the error:

TypeError: can't pickle ctranslate2.translator.Translator objects

Any ideas/advice how I can solve this issue? Are you planning to make ctranslate2 objects picklable?

Thanks again!

Segmentation fault when running from either cli or Python

[1] 91775 segmentation fault sudo argos-translate-cli --from-lang en --to-lang ru "Hello."

No problems with installation of models. Models enumerate properly, etc.

Failed to install with pip

Trying to install argostranslate on Debian unstable with pip I get the error message:

Package sentencepiece was not found in the pkg-config search path.
Perhaps you should add the directory containing `sentencepiece.pc'
to the PKG_CONFIG_PATH environment variable
No package 'sentencepiece' found
Failed to find sentencepiece pkgconfig

Is there anything I can do?

Chinese translation returning input text / overtraining? / low resource languages

Continuing the discussion from this thread.

My plan is to train a Japanese model next with 10,000 epochs (I've used ~30,000 for all the existing ones). Japanese is somewhat similar to Chinese and has a similar amount of data available so it'll be a good test bed plus we can add a new language in the process.

Add support Persian language

Is it possible to add support for Persian language?

Emoji translations

Using emojis within texts at best gets dropped, and in some cases changes translations to something else.

I know this is a training matter...
But it came to my mind (after some testing and trial-error), that maybe by using something like .encode("unicode_escape")* we could let them stay the same (as it often will, so far that I tested) and then afterwards we get it decoded back...

Basically, since we never have to "translate" those characters, I'm thinking maybe we could filter/keep them...

*P.S. not exactly this encode statement, but to be figured out 😅

editing code for GUI Title

i want to change title
but i am changing in GUI.py file but not updated

Remote cloud translations

Support connecting to a remote LibreTranslate server for translations you don't have locally.

flatpak for argos-translate

It would be useful to have a flatpak for argos-translate at https://flathub.org/home .

Importing argostranslate can fail on snap package dirs

This is probably most a test environment issue, but this can happen:

../../../venv/lib/python3.8/site-packages/argostranslate/package.py:6: in <module>
    from argostranslate import settings
../../../venv/lib/python3.8/site-packages/argostranslate/settings.py:17: in <module>
    for package_dir in content_snap_packages.iterdir():
/home/mike/.pyenv/versions/3.8.6/lib/python3.8/pathlib.py:1121: in iterdir
    for name in self._accessor.listdir(self):
E   FileNotFoundError: [Errno 2] No such file or directory: '/snap/pycharm-community/223/snap_custom/content_snap_packages'

PR: #19

Auto Code Formatting

Ideally there would be some sort of auto code formatting and linting. Related to this there is currently an issue with some of the documentation being formatted:

The goal is to comply with PEP 8 and PEP 257 to the extent possible.

Question about non-deterministic results

While investigating using argos-translate as a library, I have noticed non-deterministic results when translating a short test string "Hello world!" using your pre-trained models. For English -> Russian, it returns "Здравствуй мир!" on some hosts, and "Здравствуй!" on others. The results on a given host are deterministic on repeated runs and environments (at least in my testing so far).

I first tried to follow the advice here thinking it could be a random seed issue to no avail:
OpenNMT/OpenNMT-py#392
pytorch/pytorch#7068 (comment)

I was not able to determine any significant differences between hosts (both running on cpu), and the output of ct2_verbose is identical:

[ct2_verbose] CPU: GenuineIntel (SSE4.1=true, AVX=true, AVX2=true)
[ct2_verbose] Selected CPU ISA: AVX2
[ct2_verbose] Use Intel MKL: true
[ct2_verbose] SGEMM CPU backend: MKL
[ct2_verbose] GEMM_S16 CPU backend: MKL
[ct2_verbose] GEMM_S8 CPU backend: MKL (u8s8 preferred: true)
[ct2_verbose] Use packed GEMM: false

Manually setting num_hypotheses=2 in the ctranslate2 Translator shows that it appears to be a score difference:

Host #1:
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '!'], 'score': -2.7840166091918945}
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '▁мир', '!'], 'score': -2.841048240661621}

Host #2:
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '▁мир', '!'], 'score': -2.7670412063598633}
	{'tokens': ['▁З', 'д', 'рав', 'ству', 'й', '!'], 'score': -2.7944717407226562}

Setting beam_size=1 so it uses greedy search did produce the same result on both hosts, but I don't think that is a valid solution.

I created a gist to provide some debugging output, and didn't notice any difference in the actual argos-translate parsing logic, so it seems to be much deeper: https://gist.github.com/mikemoritz/a5bf76193ccb16d018a1af9ec584fb41

My questions are:

Are there other options you would recommend setting to increase the likelihood of deterministic results? If so, could these be surfaced as options within argos-translate?
Is it possible that "Hello world!" is a bad test string? If so, do you have any recommendations?
Do you think it could still be a random seed issue that may need to be implemented within argos-translate?
Is there additional debugging within ctranslate2 and/or torch that you would recommend to highlight differences between the hosts?

Thanks!

Name package manager widows

https://youtu.be/1yLSLgeFzYY?t=368

Support for emojis in text translation

Given is the English text: "Well done 👍"

The text itself gets translated perfectly in any language. However, depending on the target language the emoji is translated to "" or "?" or "Benachrichtigung" (in German).

Would it be possible to detect the emoji and leave that character as it is?
Hint: in Unicode 13.0 there are 4 character ranges allocated for emojis: U+1F300 (127744) to U+1FAD6 (129750), 126980 to 127569, 169 to 174 and 8205 to 12953

Cursor not themed in snap

Using the snap, the cursor does not follow the system theme. It's more obvious if you change the cursor theme to something that looks different, like "redglass", but even the Ubuntu default Yaru theme and size is not followed. One Qt app snap that this does work in is KeePassXC. It looks like their snapcraft.yaml has some additions plugs for theming.

Command-line Usage

Would it be possible to support command-line usage? I searched the documentation but found nothing. I would like to automate translating texts and also text files into multiple languages.

As an example I suggest the following:
argos-translate -text "Hello World!" -from en -to de
argos-translate -file Novel.txt -from en -to de

Improve Training scripts

The training scripts have lots of room for improvement. The long term plan is to rewrite them in OpenNMT for PyTorch in a fully automatable way but there are other potential improvements:

Auto download data from the Opus Parallel Corpus
Auto stop training after a set number of epochs
Cleaner implementation/better docs

Close in system tray

Please allow for Argos Translate to close into system tray rather than take up room in the panel.

Decrease Snapcraft distribution size

Currently the Snapcraft image is ~1GB, ~700MB of this is a torch cuda shared object file. If this could be removed automatically in the Snapcraft build process somehow (or maybe on option for all python installs?) then the download and startup time for Snapcraft would greatly improve (these are currently both issues).

Packages window doesn't handle resizing well

Scale package manager window on package delete

When packages are deleted the packages table shrinks but the window doesn't.

Support for Slovak and/or Czech language

Is there any plan to support above mentioned languages ? If so, how can I help ?

List license for language models.

I see that this repository is licensed under the MIT license, but the language training models are hosted outside of this repository that can be downloaded with HTTPS, IPFS or torrent.

Does the same MIT license apply to the models as well, or are they distributed under a different license?

This should probably be listed somewhere.

Add Tests

We currently don't have any tests, but it would be nice to. Not being able to include a .argostranslate file in the tests easily will make this more difficult but at least having some tests would be good.

PIP shows conflicting dependencies

$ pip install argostranslate

Collecting argostranslate
  Using cached argostranslate-1.0.5-py3-none-any.whl (13 kB)
  Using cached argostranslate-1.0.3-py3-none-any.whl (12 kB)
  Using cached argostranslate-1.0-py3-none-any.whl (12 kB)
ERROR: Cannot install argostranslate==1.0, argostranslate==1.0.3 and argostranslate==1.0.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    argostranslate 1.0.5 depends on ctranslate2==1.14.0
    argostranslate 1.0.3 depends on ctranslate2==1.14.0
    argostranslate 1.0 depends on ctranslate2

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Configuration/environment:

--------------------------------------------------------------------------------
  Date: Wed Dec 23 15:59:00 2020 CET

                OS : Darwin
            CPU(s) : 12
           Machine : x86_64
      Architecture : 64bit
       Environment : Python

  Python 3.7.7 (default, Mar 23 2020, 17:31:31)  [Clang 4.0.1
  (tags/RELEASE_401/final)]

             numpy : 1.19.4
           IPython : 7.19.0
            scooby : 0.5.6
--------------------------------------------------------------------------------

pip 20.3.3

Continuous integration

The plan is to use GitLab for CI. The first step is to update scripts/update_to_pypi.sh to automate uploading a new version to PyPI.

PyQt signals logic in GUI

Fix packages_changed = pyqtSignal() in gui.py to correctly update all views when the state of packages has changed.

Zoom in on text view in GUI

https://youtu.be/1yLSLgeFzYY?t=68

Persistant Settings

arm64 support (Librem 5 phone etc.)

This won't install on a Librem phone...

$snap install argos-translate
error: snap "argos-translate" is not available on stable for this architecture (arm64) but exists on other architectures (amd64)

This seems a very handy library and why does it not run on arm computers?

Weird english -> japanese translations (bad training data?)

I'm using argos-translate via libretranslate, so if this is the wrong place for this, I'll move it.

I'm testing out the english -> japanese translations and I think some bad data might have gotten into the training data.

"Hello" is being translated as "お問い合わせ" which translates to "Contact Us" (something you'd expect to see at the bottom of a webpage used for training?)

"Goodbye" is being translated as "フィードバック" (feedback). Again, something you'd expect to see at the bottom of a webpage).

"Help me!" is also being translated as "お問い合わせ".

Not exactly sure how I help, but I figured I'd point out the issue.

[feature request] copy and paste clipboard in gui

It would be handy to have buttons to:

paste text from clipboard into the gui for translation
copy text from translation into clipboard

Do Screenshots on a Qt based Distro

Screenshots promoted by argos are token from a GNOME display environment. Gnome itself has not the best integration with Qt. Gnome uses Gtk.

Maybe take these screenshots on a display environment with a better Qt integration like KDE Plasma.

Better Handle HTML input

Python version doesn't show app icon

The Python version doesn't show the application (this does work in Snapcraft because it uses a separate icon file):

argos-translate/argostranslate/gui.py

Lines 194 to 197 in 4f5396a

 # Icon 

 icon_path = Path(os.path.dirname(__file__)) / 'img' / 'icon.png' 

 icon_path = str(icon_path) 

 self.setWindowIcon(QIcon(icon_path))

Localization

Currently there is no custom localization but this would be nice to have. Qt provides some nice tools for doing this and the apps strings could be translated using the app itself.

Support Language Detection

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already intialized

Hi,
Thanks for the amazing project. This was a fresh install on a new python 3.8.3 virtual environment using pip install and launching straight away. The web app launches, but after a few key strokes, the process crashes with the following error logged to the terminal:
OMP: Error #15: Initializing libiomp5.dylib, but found libiomp5.dylib already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/. Abort trap: 6

Any advice appreciated here, thanks again.

Port to more platforms

The easiest are probably MacOS and Windows using py2app and py2exe but other platforms to consider could be mobile, BSD, Debian, Red Hat, FlatPak, or BSD. I'd like to be able to run builds on Linux as much as possible but this may not be possible for some platforms.

There's also a decision to be made if we want to use tools like py2app/py2exe or go all in on pyqtdeploy.

There are probably some challenges for doing local translation on mobile so a better strategy may be to build/port simple mobile apps that connect to the LibreTranslate API.

Add Language: Japanese

Better model distribution

Currently models are distributed by Google Drive (not ideal) and a slow BitTorrent, so there's lots of room for improvement:

More Torrent seeders
Create individual torrent files for each model
HTTP or FTP mirrors
I avoided git distribution because I was worried about running into GitHub limits but we may want to link to the LibreTranslate Git Mirror
Open to other ideas too

The plan was to make a separate repo for storing model distribution information so let me know if your interested.

GUI support for downloading packages

There is now a package index that can be updated, and packages can be automatically downloaded from Python. GUI support would be nice.

8ae1c39

GPU Support

When I first wrote this CTranslate, which does inference, didn't support GPU translation from PyPI. This has since changed and this would be a nice feature to have. All this may take is updating the CTranslate version in requirements.txt and adding documentation but if someone with more CUDA knowledge could look into this I would appreciate it. Also it would be nice to support open-source alternatives to CUDA.

Argos Translate also prints an error message about torch not being able to connect to CUDA:

/usr/local/lib/python3.8/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0

Torch is only used by Stanza which does sentence boundary detection so it supporting GPU inference isn't as important as CTranslate supporting GPUs for performance but this error message should be supressed.

deploy in windows

i want to deploy this in windows
can you pls guide in details

Automated download of data for training scripts

Training language models is currently very manual. Opus has an API to gather data: https://pypi.org/project/opus-api/

Show loading indicator while installing models

Installing a large number of models requires unzipping them which can take a while. A loading indicator should be shown so that users don't think the program has frozen.

How to participate to translation?

I want to fix some translation issues and add new strings, but not understand how can I contribute to the project. Is there any instructions how to change dictionary database? As understand I should operate with https://github.com/argosopentech/onmt-models repository?

Give better error message when CLI doesn't have packages installed

#3 (comment)

Filter HTML entities in training scripts data

Hello, I've noticed a bug when translating something to French.
Sometime, there is the HTML entity ' appearing instead of the apostrophe.
Some examples: