GithubHelp home page GithubHelp logo

libindic / indic-trans Goto Github PK

View Code? Open in Web Editor NEW
257.0 257.0 61.0 516.51 MB

The project aims on adding a state-of-the-art transliteration module for cross transliterations among all Indian languages including English.

License: GNU Affero General Public License v3.0

Makefile 0.19% Python 96.92% Cython 2.89%

indic-trans's People

Contributors

ashutoshsingh0223 avatar copyninja avatar deepthi-chand avatar gokulnc avatar irshadbhat avatar jishnu7 avatar rajiv256 avatar ritwikmishra avatar stultus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

indic-trans's Issues

Embedded Executable

Cython has the --embed option to build a standalone executable (see https://stackoverflow.com/questions/22507592/making-an-executable-in-cython?lq=1 and http://masnun.rocks/2016/10/01/creating-an-executable-file-using-cython/ for more info), so it should be possibile to compile the python files to c modules and then make to compiler object files to executable on the local architecture.
All dependency will be statically resolved, and the dynamic linking will be provided for system libraries.

This project is using Cython as extension in the setup.py, any way to add support for the executable embedding ?

Thank you.

Unable to install Indic-Trans

Hi
I have python 3.9 and python 3.12 (Only one functional at any given time).
I have Microsoft Visual Studio Community 2022 (64-bit) - Current Version 17.8.0
OC is Windows 11

I am trying my best to install Indic-Trans.
Not able to do.
I am getting multiple errors and unable to solve the same.

building 'indictrans._decode.beamsearch' extension
"C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -IC:\Users\localadmin\AppData\Local\Temp\pip-build-env-j5amqm2w\overlay\Lib\site-packages\numpy\core\include -ID:\Python312\include -ID:\Python312\Include "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Auxiliary\VS\include" /Tcindictrans/_decode/beamsearch.c /Fobuild\temp.win-amd64-cpython-312\Release\indictrans/_decode/beamsearch.obj
beamsearch.c
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\include\vadefs.h(61): error C2371: 'uintptr_t': redefinition; different basic types
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\include\sys/stdint.h(58): note: see declaration of 'uintptr_t'
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\include\vcruntime.h(195): error C2371: 'intptr_t': redefinition; different basic types
C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.38.33130\include\sys/stdint.h(57): note: see declaration of 'intptr_t'

All that is possible, I have done. Nothing seems to be breaking the ice. It is not a smooth installation.
Expecting some kind of support to come out of this issue.
Thanks

installation error

I'm trying to install it on my mac but its giving this error

[1/4] Cythonizing indictrans/_decode/beamsearch.pyx
[2/4] Cythonizing indictrans/_decode/viterbi.pyx
[3/4] Cythonizing indictrans/_utils/ctranxn.pyx
[4/4] Cythonizing indictrans/_utils/sparseadd.pyx
ERROR:root:Error parsing
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/core.py", line 111, in pbr
attrs = util.cfg_to_args(path, dist.script_args)
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/util.py", line 249, in cfg_to_args
pbr.hooks.setup_hook(config)
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/hooks/init.py", line 25, in setup_hook
metadata_config.run()
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/hooks/base.py", line 27, in run
self.hook()
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/hooks/metadata.py", line 26, in hook
self.config['name'], self.config.get('version', None))
File "/Library/Python/2.7/site-packages/pbr-3.0.0-py2.7.egg/pbr/packaging.py", line 748, in get_version
name=package_name))
Exception: Versioning for this project requires either an sdist tarball, or access to an upstream git repository. It's also possible that there is a mismatch between the package name in setup.cfg and the argument given to pbr.version.VersionInfo. Project name indictrans was given, but was not able to be found.
error in setup command: Error parsing /Users/muhammadsharjeel/Desktop/indic-trans-master/setup.cfg: Exception: Versioning for this project requires either an sdist tarball, or access to an upstream git repository. It's also possible that there is a mismatch between the package name in setup.cfg and the argument given to pbr.version.VersionInfo. Project name indictrans was given, but was not able to be found.

Source text Tokenization

I'm dealing with a assamese asm dataset from Wikipedia dump, and I have found very long sentences without any word bounds like

বৈৰাগীমঠইয়াৰউচ্চমানদণ্ডৰশিক্ষানুষ্ঠানৰবাবেঅসমবিখ্যাতইয়াতপ্ৰাথমিকস্তৰৰপৰাউচ্চস্তৰলৈশিক্ষানুষ্ঠানআছেঅসমতথাউত্তৰপূৰ্বাঞ্চলৰএখনআগশাৰীৰকনিষ্ঠমহাবিদ্যালয়ছল্টব্ৰুকএকাডেমীইয়াতেইঅৱস্থিত

that indic-trans will transliterate to roman english like

bairagimathiaruchimandandrashikshanushthanravabssmavikhyatiatprathamikstarraparauchchastaralishikshanushthanachasmatthauttarapurbanchalraekhanagsharikanishimahavidyalayachaltbrookacademiyateiavasthit

Assumed that the asm-eng model did a good job, my question is if the source text was correct in terms of tokenization i.e. do the input source text need any tokenization?

Sorry for my ignorance about that specific language, according to Google Translate that text is correct, even if the language detects as Bengali bn, while my language detection has a specific label for it, and for Google the transliteration was

Bairaāgīmaṭha'iẏāra'uccamānadaṇḍaraśikṣānuṣṭhānarabābē'asamabikhyāta'iẏātapraāthamikastararaparaā'uccastaralaiśikṣānuṣṭhāna'āchē'asamatathā'uttarapūrbāñcalara'ēkhana'āgaśāraīrakaniṣṭhamahābidyālaẏachalṭabrauka'ēkāḍēmī'iẏātē'i'arasthita

So I could assume that both are correct and there is no further need of any tokenization.

Thank you very much.

Get better result after human validation

Hi,
we are using indic-trans in order to make transliteration from Hindi to Roman/Eng.
After applying your model, I have good results in general, but there are still some errors, as some Hindi people have shown us.

चैन should be chain not chaiyn , 
कमल should be kamal not camel , 
मिलने should be milne not milane 

Any advice to get better results? like increase training set? choose between BeamSearch and Viterbi?

Thanks

Tamil transliterator puts 'm' as 's'

Steps to reproduce the issue: Find the file "tamtrial2out.txt" and use

indictrans < /tamtrial2out.txt --s tam --t eng --build-lookup > /tamtrial2outrom.txt

you should get a "tamtrial2outrom.txt" and compare it with the attached file of the same name.

Issue: The 6th and the 2nd last token 'patus' is actually 'patum'.

tamtrial2outrom.txt
tamtrial2out.txt

Support for Bhojpuri, Rajasthani

Thanks for this amazing work!
I'm not sure (sorry if I'm confusing some languages) if the transliteration tool is missing the models for Bhojpuri, Rajasthani languages.
To clarify, I have composed this table for the current models support:

Lang ISO 639-3 ISO 639-2 TRANS MODEL NOTES
Hindi hin hi YES
Tamil tam ta YES
Telugu tel te YES
Punjabi pan pa YES Panjabi
Marathi mrt mr YES
Gujarati guj gu YES
Bengali ben be YES
Kannada kan kn YES
Bhojpuri bho bh NO Bihari languages: Bhojpuri, Magahi, and Maithili
Malayalam mal ml YES
Urdu urd ur YES
Rajasthani raj - NO
Odia (Oriya) ori or YES
Assamese asm as YES
Nepali nep ne YES
Bodo brx   YES
Konkani kok - YES

Is that correct?

initialisation error of Transliterator

trn = Transliterator(source='hin', target='eng', build_lookup=True)


ValueError Traceback (most recent call last)
in
----> 1 trn = Transliterator(source='hin', target='eng', build_lookup=True)

/usr/local/lib/python3.6/dist-packages/indictrans/transliterator.py in init(self, source, target, decode, build_lookup, rb)
89 'Language pair %s-%s is not implemented.' %
90 (source, target))
---> 91 i2o = Ind2Target(source, target, decoder, build_lookup)
92 self.transform = _get_trans(i2o, decode)
93 else:

/usr/local/lib/python3.6/dist-packages/indictrans/script_transliterate.py in init(self, source, target, decoder, build_lookup)
20 target,
21 decoder,
---> 22 build_lookup)
23 self.letters = set(string.ascii_letters)
24 self.non_alpha = re.compile(r"([^a-zA-Z%s]+)" % (self.esc_ch))

/usr/local/lib/python3.6/dist-packages/indictrans/base.py in init(self, source, target, decoder, build_lookup)
67 self.esc_ch = '\x00' # escape-sequence for Roman in WX
68 self.dist_dir = os.path.dirname(os.path.abspath(file))
---> 69 self.base_fit()
70
71 def load_models(self):

/usr/local/lib/python3.6/dist-packages/indictrans/base.py in base_fit(self)
115 def base_fit(self):
116 # load models
--> 117 self.load_models()
118 # load mapping tables for Urdu
119 if 'urd' in [self.source, self.target]:

/usr/local/lib/python3.6/dist-packages/indictrans/base.py in load_models(self)
78 '%s/models/%s/classes.npy' %
79 (self.dist_dir, model),
---> 80 encoding='latin1')[0]
81 self.coef_ = np.load(
82 '%s/models/%s/coef.npy' % (self.dist_dir, model),

/usr/local/lib/python3.6/dist-packages/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
445 else:
446 return format.read_array(fid, allow_pickle=allow_pickle,
--> 447 pickle_kwargs=pickle_kwargs)
448 else:
449 # Try a pickle

/usr/local/lib/python3.6/dist-packages/numpy/lib/format.py in read_array(fp, allow_pickle, pickle_kwargs)
690 # The array contained Python objects. We need to unpickle the data.
691 if not allow_pickle:
--> 692 raise ValueError("Object arrays cannot be loaded when "
693 "allow_pickle=False")
694 if pickle_kwargs is None:

ValueError: Object arrays cannot be loaded when allow_pickle=False

beamsearch arguments gets TypeError: coercing to Unicode: need string or buffer, list found

Hello, I have added beamsearch in the arguments to be passed to the like Transliterator

# initialize transliterator object
    decode='viterbi'
    if args.beamsearch:
        decode='beamsearch'
    trn = Transliterator(args.source,
                            args.target,
                            rb=args.rb,
                            build_lookup=args.build_lookup,
                            decode=decode)

but I get an error when using it:

Traceback (most recent call last):
  File "/usr/local/bin/indictrans", line 10, in <module>
    sys.exit(main())
  File "/Library/Python/2.7/site-packages/indictrans/__init__.py", line 138, in main
    process_args(args)
  File "/Library/Python/2.7/site-packages/indictrans/__init__.py", line 129, in process_args
    ofp.write(tline)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 357, in write
    data, consumed = self.encode(object, self.errors)
TypeError: coercing to Unicode: need string or buffer, list found

[QUESTION] How get the exact length of Hindi text without Mark, Nonspacing

In my source indic text (like Hindi) I'm dealing with text tokenization. I'm currently using the https://github.com/irshadbhat/polyglot-tokenizer for this task. By the way I came across the problem of the non Unicode characters in the Mark, Nonspacing (Mn) Category - https://www.fileformat.info/info/unicode/category/Mn/list.htm, and I wonder if libindic is handling this internally in some way.
This was my approach: https://stackoverflow.com/questions/54345897/fastest-way-to-count-non-spacing-chars-in-unicode-text-in-python

Thank you.

File Not Found Error


FileNotFoundError Traceback (most recent call last)
d:\NaMo\States\Uttar Pradesh\Codes[CM][UP] - Scoping.ipynb Cell 73 line 2
1 # indictrans setup
----> 2 trn = Transliterator(source = 'eng', target = 'hin', build_lookup = True)

File ~\AppData\Roaming\Python\Python310\site-packages\indictrans\transliterator.py:82, in Transliterator.init(self, source, target, decode, build_lookup, rb)
78 raise NotImplementedError(
79 'Language pair %s-%s is not implemented.' %
80 (source, target))
81 if source == 'eng':
---> 82 ru2i = Rom2Target(source, target, decoder, build_lookup)
83 else:
84 ru2i = Urd2Target(source, target, decoder, build_lookup)

File ~\AppData\Roaming\Python\Python310\site-packages\indictrans\script_transliterate.py:60, in Rom2Target.init(self, source, target, decoder, build_lookup)
59 def init(self, source, target, decoder, build_lookup=False):
---> 60 super(Rom2Target, self).init(source,
61 target,
62 decoder,
63 build_lookup)
64 self.non_alpha = re.compile(r"([^a-z]+)")
65 self.letters = set(string.ascii_letters[:26])

File ~\AppData\Roaming\Python\Python310\site-packages\indictrans\base.py:69, in BaseTransliterator.init(self, source, target, decoder, build_lookup)
...
79 (self.dist_dir, model),
80 encoding='latin1',
81 allow_pickle=True)[0]

FileNotFoundError: [Errno 2] No such file or directory: 'C:\Users\Admin\AppData\Roaming\Python\Python310\site-packages\indictrans/models/eng-hin/sparse.vec'

Blind Spots in Gujarati and Oriya

Hello!

I am encountering some blind spots where the result is empty.

Examples

  • GU: ૐ
  • OR: ୱ୍ୱି and ଵୀ
from indictrans import Transliterator
trn = Transliterator(source='ori', target='eng', build_lookup=True)
trn.transform('''ୱ୍ୱି''')

I am not sure if the first example is really GU, but I think the second OR example is valid.

Any solution?

Thanks!

Wrong symbol appears in urdu-to-hindi transliteration

I used your software (via python wrapper) to transliterate some Urdu words into equivalent Hindi written form.
And I have found that the wrong symbol "M" (latin M letter) appears during processing of several words, for instance
اورانجنیرنگ
اور ْا ِن ْج ِن ِیر ِن ْگ ْ
ہو.ننگا
which were converted into
औरइMज िन ीयर ि ंग
औरइMज िन ीयर ि ंग
ह ो ंMग ा
respectively.

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'long'

While following the model training guide, ran into this error.

ValueError                                Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 clf = trunk.train_sp(X, y, n_iter=5, verbose=2)

File ~\indic-trans\indictrans\trunk\__init__.py:69, in train_sp(X, y, n_iter, lr_exp, random_state, verbose)
     65 def train_sp(X, y, n_iter=10, lr_exp=0.1,
     66              random_state=37, verbose=0):
     67     clf = StructuredPerceptron(random_state=random_state,
     68                                n_iter=n_iter, verbose=verbose)
---> 69     clf.fit(X, y)
     70     return clf

File ~\indic-trans\indictrans\trunk\perceptron.py:149, in StructuredPerceptron.fit(self, X, y)
    146 Y_diff *= -lr
    147 w_update = Y_diff.T * X_i
--> 149 t_trans = count_tranxn(y_t_i, n_classes)
    150 p_trans = count_tranxn(y_pred, n_classes)
    151 b_trans_update = lr * (p_trans - t_trans)

File ~\indic-trans\indictrans\_utils\ctranxn.pyx:7, in indictrans._utils.ctranxn.count_tranxn()
      5 np.import_array()
      6 
----> 7 @cython.boundscheck(False)
      8 @cython.wraparound(False)
      9 def count_tranxn(np.ndarray[ndim=1, dtype=np.npy_intp] y, n_classes):

ValueError: Buffer dtype mismatch, expected 'npy_intp' but got 'long'

encountering gcc error

This is the following error i am getting:
"distutils.errors.CompileError: command 'gcc' failed with exit status 1\n"

Since cython=0.24.a0a is not available, I am using cython=0.24 and python=3.5

Transliterate proper nouns from a dictionary.

Is it possible to translate proper nouns from a given dictionary rather than take them from model's output?

For eg. I want to create an FD
Can I give a dictionary like {"FD": "ऍफ़डी"} and system ignores model's output for this word?

import error: ctranxn

Context

Python 2.7 (also re-created with 3.5)
Darwin Kernel Version 15.6.0
macosx-10.11-x86_64

installed packages (after following installation steps):

$ pip freeze
Cython==0.24.1
future==0.15.2
indictrans==0.0.1.dev285
numpy==1.11.1
pbr==1.10.0
scipy==0.18.0
six==1.10.0

What I did

Installed package by following steps in README.rst. Attempted two things:

$ indictrans --h

and

$ python
>>> from indictrans import Transliterator

Expected behavior

A help message from the command line call and the ability to import within the python interpreter

Actual behavior

Got ImportError.

Python 2.7 version:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "indictrans/__init__.py", line 9, in <module>
    from ._utils import UrduNormalizer, WX
  File "indictrans/_utils/__init__.py", line 6, in <module>
    from .ctranxn import count_tranxn
ImportError: No module named ctranxn

Python 3.5:

Traceback (most recent call last):
  File "/Users/bob/src/indic-trans/env/bin/indictrans", line 6, in <module>
    from indictrans import main
  File "/Users/bob/src/indic-trans/indictrans/__init__.py", line 9, in <module>
    from ._utils import UrduNormalizer, WX
  File "/Users/bob/src/indic-trans/indictrans/_utils/__init__.py", line 6, in <module>
    from .ctranxn import count_tranxn
ImportError: No module named 'indictrans._utils.ctranxn'

I'm also attaching the full backscroll of the compilation, etc, for more info.
indictrans_console.txt

Dataset for training

Hello, reading the paper IIIT-H System Submission for FIRE2014 Shared Task on Transliterated Search I have read about the dataset used for the training that where

• Monolingual corpora of English, Hindi and Gujarati in their native scripts.
• Word lists with corpus frequencies for English, Hindi, Ben- gali and Gujarati.
• Word transliteration pairs for Hindi-English, Bengali-English and Gujarati-English.

plus additional crawled Romanized data.

Would it be possibile to provide these dataset in order to train the system from scratch?

Thank you.

About Sanskrit and Hindi

In my findings, the hin-eng model is capable to transliterate input text in indian Sanskrit even if the selected model used was hin (Hindi). This of course does not happen when the input text is let's say urd or pan. Can we say therefore that there are similarities among Sanskrit and Devanagari, so the model is transcribing it anyways?

Like in the following cases

text transliteration
षट्त्रिंशकात्मकजगद्गगनावभास- shattrinshkatmakajgadgaganavbhas-
संविन्मरीचिचयचुम्बितबिम्बशोभम् sanvinmarichichaychumbitbimbashobham
षट्त्रिंशकं भरतसूत्रमिदं विवृण्व- shattrinshakan bharatsutramidan vivrunv-
न्वन्दे शिवं श्रुतितदर्थविवेकधाम nvande shivan shrutitadarthvivekdham

or

text transliteration
यदेतत्कालिन्दी-तनुतर-तरङ्गाकृति शिवे yadetatkalindi-tanutar-tarangakriti shive
कृशे मध्ये किञ्चिज्जननि तव यद्भाति सुधियाम् krishe madhye kinchijjanani tav yadbhaati sudhiyam
विमर्दा-दन्योन्यं कुचकलशयो-रन्तरगतं vimarda-danyonyan kuchkalashyo-rantaragatan
तनूभूतं व्योम प्रविशदिव नाभिं कुहरिणीम् tanubhutan vyom pravishadiv naabhin kuharinim
स्थिरो गङ्गा वर्तः स्तनमुकुल-रोमावलि-लता sthiro ganga vartah stanmukul-romavali-lata
कलावालं कुण्डं कुसुमशर तेजो-हुतभुजः kalavalan kundan kusumshar tejo-hutbhujah
रते-र्लीलागारं किमपि तव नाभिर्गिरिसुते rate-rlilagaaran kimapi tav nabhirgirisute

Thank you.

About Kannada and Hindi Script to Roman models

Hello, I have a question about Kannada. Given these roman words

sundara sundaraa sunder
chandiraa chandira chandir
mana man

I get this script

ಸುಂದರಾ ಸುಂದರಾ ಸುಂದರ
ಚಂದೀರಾ ಚಂದೀರಾ ಚಂದಿರ
ಮನಾ ಮನ

So it seems that the model acts in the same (maybe wrong?) way for some words like

mana -> ಮನಾ, that actually it should be mana -> ಮನ.

The same happens for sundara (sunder) and chandira (chandir) and the related transliterations.
I'm using the kan-eng / eng-kan model in this case.

ValueError: numpy.unfunc size changed

I have clone repository and successfully build but still receiving this error
Traceback (most recent call last):
File "/Users/asifkhan/Development/testenv/bin/indictrans", line 5, in
from indictrans import main
File "/Users/asifkhan/Development/testenv/lib/python3.6/site-packages/indictrans/init.py", line 9, in
from ._utils import UrduNormalizer, WX
File "/Users/asifkhan/Development/testenv/lib/python3.6/site-packages/indictrans/_utils/init.py", line 6, in
from .ctranxn import count_tranxn
File "indictrans/_utils/ctranxn.pyx", line 1, in init indictrans._utils.ctranxn
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

How one can improvise urd to eng transliteration

Using indic-trans module for transliteration of urd to eng is not 99% correct example I passed
urd='''محمد حفیظ پاکستانی کرکٹ ٹیم کے لیے مزید کھیلنا چاہتے تھے: شاہد آفرید''' and gives transliterate into english '''mohammad hafiz pakistani crookat tem ke liye majid khelnaa chahate the: shahid afaredi''' which is not correct 99% it should be "Muhammad Hafeez pakistani cricket team key liye mazeed khalna chahatey thay: shahid afridi"

Is there a way we can our own improvise mapping.? or what is your advice.

Is Hindi-Urdu transliteration lossless?

First of all, thanks a lot for this repo!
It's very useful.

Just wanted to know if the PersoArabic to Devanagari transliteration is lossless.
That is, will Urdu->Hindi->Urdu or Hindi->Urdu->Hindi always retain all the same original characters?

Also, is Urdu only rule-based available?
@irshadbhat

from indictrans import Transliterator

while importing Transliterator, it is throwing numpy error

ImportError Traceback (most recent call last)

in <cell line: 2>()
1 # build_lookup saves time for big corpus. Transliterate hindi text into english
----> 2 from indictrans import Transliterator

2 frames

/usr/local/lib/python3.9/dist-packages/indictrans/init.py in
7 import argparse
8
----> 9 from ._utils import UrduNormalizer, WX
10 from .transliterator import Transliterator
11

/usr/local/lib/python3.9/dist-packages/indictrans/_utils/init.py in
4
5 from .wx import WX
----> 6 from .ctranxn import count_tranxn
7 from .sparseadd import sparse_add
8 from .one_hot_encoder import OneHotEncoder

/usr/local/lib/python3.9/dist-packages/indictrans/_utils/ctranxn.pyx in init indictrans._utils.ctranxn()

init.pxd in numpy.import_array()

ImportError: numpy.core.multiarray failed to import

this is how i installed.

!git clone https://github.com/libindic/indic-trans.git

%cd indic-trans
!pip install -r requirements.txt
!pip install .
%cd

SOLVED: x86_64-linux-gnu-gcc errors during installation

OS: Ubuntu 18.04
Python: 3.7
Virtualenv: 15.1.0

• Python.h: No such file or directory

copying indictrans/trunk/tests/hin2rom.tnt -> build/lib.linux-x86_64-3.7/indictrans/trunk/tests
running build_ext
building 'indictrans._decode.beamsearch' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/indictrans
creating build/temp.linux-x86_64-3.7/indictrans/_decode
x86_64-linux-gnu-gcc -pthread -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/home/username/.local/lib/python3.7/site-packages/numpy/core/include -I/usr/include/python3.7m -c indictrans/_decode/beamsearch.c -o build/temp.linux-x86_64-3.7/indictrans/_decode/beamsearch.o
indictrans/_decode/beamsearch.c:36:10: fatal error: Python.h: No such file or directory
 #include "Python.h"
          ^~~~~~~~~~
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Solution

sudo apt-get install python3.7-dev

Change "3.7" with your Python version.

Source

• could not create '/usr/lib/python3.7/site-packages': Permission denied

x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-Bsymbolic-functions -Wl,-z,relro -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 build/temp.linux-x86_64-3.7/indictrans/_utils/sparseadd.o -o build/lib.linux-x86_64-3.7/indictrans/_utils/sparseadd.cpython-37m-x86_64-linux-gnu.so
running install_lib
creating /usr/lib/python3.7/site-packages
error: could not create '/usr/lib/python3.7/site-packages': Permission denied

Solution

Add --user to the installation command.

python3 setup.py install --user

ImportError: No module named ctranxn

When I am trying to import Transliterator from indictrans I am getting this error

from indictrans import Transliterator
Traceback (most recent call last):
File "", line 1, in
File "indictrans/init.py", line 9, in
from ._utils import UrduNormalizer, WX
File "indictrans/_utils/init.py", line 6, in
from .ctranxn import count_tranxn
ImportError: No module named ctranxn

python setup.py throw "ImportError: No module named Cython.Build"

I have python 3.7 version,
Cython==0.29.15
scipy==1.4.1

Still on running python setup.py, i get the error:

PS D:\indic-trans> py setup.py

Traceback (most recent call last):
  File "setup.py", line 7, in <module>
    from Cython.Build import cythonize

ImportError: No module named Cython.Build

Transliteration problem

indictrans <Ta_June_2017.dev.ta --s tam --t eng --build-lookup >tam_dev_rom.txt
Traceback (most recent call last):
File "/home/bharaj/anaconda2/bin/indictrans", line 10, in
sys.exit(main())
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/init.py", line 124, in main
process_args(args)
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/init.py", line 110, in process_args
build_lookup=args.build_lookup)
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/transliterator.py", line 91, in init
i2o = Ind2Target(source, target, decoder, build_lookup)
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/script_transliterate.py", line 22, in init
build_lookup)
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/base.py", line 69, in init
self.base_fit()
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/base.py", line 117, in base_fit
self.load_models()
File "/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/base.py", line 75, in load_models
with open('%s/models/%s/sparse.vec' % (self.dist_dir, model)) as jfp:
IOError: [Errno 2] No such file or directory: u'/home/bharaj/anaconda2/lib/python2.7/site-packages/indictrans/models/tam-eng/sparse.vec'

some bengali alphabets are probably missing includes ৎ

trn = Transliterator(source='ben', target='eng', decode='beamsearch')
eng = trn.transform(u'ৎ')

this returns ['', '', '', '', '']
eng = trn.transform(u'শরৎ')
this returns -> ['shar', 'sar', 'sha', 'ther', 'phor'] which should be sarat or sharat

also for this eng = trn.transform(u'চট্টোপাধ্যায়')
i get this return: ['chattopadya', 'chattopadhya', 'chattopadyay', 'chattopaya', 'chattopadhyay'] where the last one is to be used. How can I use that or set as preferred? Can you explain in detail about the training etc. I have gone through your blog, as I am no expert in this I could not find my way. Please help. And thanks for such a utility.

License file not present

First of all, I would like to thank you for your work...Of all transliteration libraries, this library gives more accurate results. We would like to use it for transliteration... but, there is no License file and Can you please add a license file to this project?

URDU To ENGLISH Transliteration mapping issue

I have tested and found some mapping problem for urdu to english here is an example
if give مزید it convert as "MAJID" where as it should covert as "MAZEED". I have check individual character "ذ" or "ز" is giving correct mapping as "Z" "ZA"

Please look into it or advice what I can do for correct conversion.

Issue while importing the Package after successful installation

Hey, I was following the installation instructions from the repo itself and trying to install it on Google Colab, and did this:

%cd indic-trans
!pip install -r requirements.txt
!pip install .

Output:

remote: Enumerating objects: 1, done.
remote: Counting objects: 100% (1/1), done.
remote: Total 2207 (delta 0), reused 0 (delta 0), pack-reused 2206
Receiving objects: 100% (2207/2207), 516.51 MiB | 21.84 MiB/s, done.
Resolving deltas: 100% (1094/1094), done.
Checking out files: 100% (718/718), done.
/content/indic-trans
Requirement already satisfied: pbr in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 1)) (5.5.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 2)) (1.15.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 3)) (0.16.0)
Requirement already satisfied: cython>=0.24.0a0 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 4)) (0.29.21)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 5)) (1.19.5)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.6/dist-packages (from -r requirements.txt (line 6)) (1.4.1)
Processing /content/indic-trans
Requirement already satisfied: pbr in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (5.5.1)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (1.15.0)
Requirement already satisfied: future in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (0.16.0)
Requirement already satisfied: cython>=0.24.0a0 in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (0.29.21)
Requirement already satisfied: numpy>=1.10.4 in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (1.19.5)
Requirement already satisfied: scipy>=0.13.3 in /usr/local/lib/python3.6/dist-packages (from indictrans==1.2.3) (1.4.1)
Building wheels for collected packages: indictrans
 Building wheel for indictrans (setup.py) ... done
 Created wheel for indictrans: filename=indictrans-1.2.3-cp36-cp36m-linux_x86_64.whl size=337575760 sha256=2769cffc8443c57526b8d81c4233453b510c010c2cdebeb021bbdc1d034e9e4c
 Stored in directory: /root/.cache/pip/wheels/fb/b5/0a/2fe0e82a9d815df9ef9224b1c214ad4f0476e9d6f104264bb2
Successfully built indictrans
Installing collected packages: indictrans
 Found existing installation: indictrans 1.2.3
   Uninstalling indictrans-1.2.3:
     Successfully uninstalled indictrans-1.2.3
Successfully installed indictrans-1.2.3

And it got built successfully. Yet, when I'm running:

from indictrans import Transliterator

It's throwing me this error:


ModuleNotFoundError                       Traceback (most recent call last)

<ipython-input-3-617ab5bb8b8d> in <module>()
----> 1 from indictrans import Transliterator
      2 trn = Transliterator(source='hin', target='eng', build_lookup=True)

1 frames

/content/indic-trans/indictrans/__init__.py in <module>()
      7 import argparse
      8 
----> 9 from ._utils import UrduNormalizer, WX
     10 from .transliterator import Transliterator
     11 

/content/indic-trans/indictrans/_utils/__init__.py in <module>()
      4 
      5 from .wx import WX
----> 6 from .ctranxn import count_tranxn
      7 from .sparseadd import sparse_add
      8 from .one_hot_encoder import OneHotEncoder

ModuleNotFoundError: No module named 'indictrans._utils.ctranxn'

Any ideas how to get rid off this?

Urdu to English transliteration not implemented?

Hi, thank you for your work!
Running this command
indictrans --s urd --t eng --build-lookup
resulted in:
NotImplementedError: Language "eng" is not implemented.

So is Urdu to English transliteration not implemented? I tried transliterating from Hindi to English, it worked.

Throws multiple Traceback errors

I installed everything as mentioned and all went well without any problem. Though, when implementing, it throws these errors: (Screenshot)

indictrans error

I'm trying to transliterate this Urdu word-list into roman(English). If possible, please upload it somewhere if you get it converted successfully, or any other list if you have.

I think you should also add all useful word-list files you've converted successfully in this/other repo.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.