GithubHelp home page GithubHelp logo

Comments (12)

jekbradbury avatar jekbradbury commented on August 26, 2024 1

Does the change in #52 solve your problem? You should be able to try it locally.

from text.

nelson-liu avatar nelson-liu commented on August 26, 2024

It's not on PyPI yet, so you can't pip install. With your terminal, cd to the location of the torchtext code on your disk (download or clone the repo first), then run: python setup.py install

from text.

mambuDL avatar mambuDL commented on August 26, 2024

Thanks for the reply. It solved my problem. Unfortunatelly now I have another one. I tried to use the translator.py file inside the tests folder. To do I downloaded the same dataset and put a folder. Inside the code I only change the path at line-25 so that it points to the data where I've downloaded it. I got the following error:

$ python translation.py 

    Warning: no model found for 'de'

    Only loading the 'de' tokenizer.


    Warning: no model found for 'en'

    Only loading the 'en' tokenizer.

Traceback (most recent call last):
  File "translation.py", line 27, in <module>
    fields=(DE, EN))
  File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
  File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
    EN.build_vocab(train.trg, max_size=50000)
  File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
  File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 83, in preprocess
  File "translation.py", line 14, in tokenize_de
    return [tok.text for tok in spacy_de.tokenizer(url.sub('@URL@', text))]
TypeError: Argument 'string' has incorrect type (expected unicode, got str)

This is the modification I did :

train, val = datasets.TranslationDataset.splits(
    path='~/myproject/data/de-en/', train='train.tags.de-en',
    validation='IWSLT16.TED.tst2013.de-en', exts=('.de', '.en'),
    fields=(DE, EN))

when I command ls under ~/myproject/data/de-en this is the result:

IWSLT16.TED.dev2010.de-en.de.xml  
IWSLT16.TED.tst2010.de-en.en.xml
  IWSLT16.TED.tst2012.de-en.de.xml
  IWSLT16.TED.tst2013.de-en.en.xml
  IWSLT16.TEDX.dev2012.de-en.de.xml 
 IWSLT16.TEDX.tst2013.de-en.en.xml  
README             
  train.tags.de-en.en
IWSLT16.TED.dev2010.de-en.en.xml
  IWSLT16.TED.tst2011.de-en.de.xml 
 IWSLT16.TED.tst2012.de-en.en.xml 
 IWSLT16.TED.tst2014.de-en.de.xml
  IWSLT16.TEDX.dev2012.de-en.en.xml
  IWSLT16.TEDX.tst2014.de-en.de.xml
  train.en
IWSLT16.TED.tst2010.de-en.de.xml  
IWSLT16.TED.tst2011.de-en.en.xml  
IWSLT16.TED.tst2013.de-en.de.xml  
IWSLT16.TED.tst2014.de-en.en.xml 
 IWSLT16.TEDX.tst2013.de-en.de.xml 
 IWSLT16.TEDX.tst2014.de-en.en.xml  
train.tags.de-en.de

from text.

nelson-liu avatar nelson-liu commented on August 26, 2024

i'm presuming you're running this on python 2 --- you're going to want to convert the string to the unicode type before tokenizing it. Either 1) run this on python 3, or 2) convert the strings to unicode beforehand with six.text_type (you'll want to pip install six to use it, there's an example here).

@jekbradbury, perhaps it'd be worth converting to unicode in the preprocess function?

from text.

jekbradbury avatar jekbradbury commented on August 26, 2024

Yeah, looks like that's an oversight since we do it in the lower=True case.

from text.

nelson-liu avatar nelson-liu commented on August 26, 2024

@mambuDL if you pull from master, rerun python setup.py install, and try what you did again, it should work out.

from text.

mambuDL avatar mambuDL commented on August 26, 2024

@nelson-liu @jekbradbury Thanks, I did aply what you suggest. Although that error now has gone, the new ones appeared :)

This is the traceback I got when I run python translation.py

python translation.py 

   Warning: no model found for 'de'

   Only loading the 'de' tokenizer.


   Warning: no model found for 'en'

   Only loading the 'en' tokenizer.

Traceback (most recent call last):
 File "translation.py", line 27, in <module>
   fields=(DE, EN))
 File "build/bdist.linux-x86_64/egg/torchtext/data/dataset.py", line 56, in splits
 File "build/bdist.linux-x86_64/egg/torchtext/datasets/translation.py", line 35, in __init__
   EN.build_vocab(train.trg, max_size=50000)
 File "build/bdist.linux-x86_64/egg/torchtext/data/example.py", line 44, in fromlist
 File "build/bdist.linux-x86_64/egg/torchtext/data/field.py", line 89, in preprocess
 File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 13, in __call__
 File "build/bdist.linux-x86_64/egg/torchtext/data/pipeline.py", line 19, in call
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 19: ordinal not in range(128)

from text.

nelson-liu avatar nelson-liu commented on August 26, 2024

hmm, looks like it's necessary to encode to UTF-8...not sure if it's better to do that in Field.preprocess, or while reading the translation dataset (with io.open instead of open).

from text.

mambuDL avatar mambuDL commented on August 26, 2024

Does it mean I should give up to use this package and try to write my own code to read the textual dataset for now if I use python2 or are you planning to fix it soon or later ?

from text.

PetrochukM avatar PetrochukM commented on August 26, 2024

Following up on this. What is the release timeline? When will this project be released to PyPI?

from text.

marikgoldstein avatar marikgoldstein commented on August 26, 2024

It's up on pip but it's one of the 0.1.x versions. Cloning and running python setup.py install gives the most recent version with many working dataset module features.

from text.

marikgoldstein avatar marikgoldstein commented on August 26, 2024

@jekbradbury I left a comment in pull #52. Seems like there is still an ascii vs. UTF-8 issue. I commented there because this issue thread is a mix of a few issues.

from text.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.