GithubHelp home page GithubHelp logo

Error while training about dual_encoding HOT 3 OPEN

danieljf24 avatar danieljf24 commented on July 21, 2024
Error while training

from dual_encoding.

Comments (3)

danieljf24 avatar danieljf24 commented on July 21, 2024

I guess the file named vec500flickr30m.tar.gz (3.0G) has not been downloaded completely.

from dual_encoding.

n-bravo avatar n-bravo commented on July 21, 2024

Hello.
I have the exact same problem.
First I got this encoding problem when trying to read the id.txt file

Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/dual_encoding-master/venv/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1277060: invalid continuation byte

because my pc use UTF-8 as default. I tried with ISO-8859-1 by changing the __init__ in basic/bigfile.py
self.names = open(id_file, encoding='ISO-8859-1').read().strip().split()
and I could read the file, but now the length of self.names is len(self.names) = 1746908 instead of the 1743364 reported in shape.txt, so the encoding I choosed must be wrong.
Any idea what encoding should I use to read id.txt?

update: I tried with the files from Google Drive and http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz but the problem persists in both

from dual_encoding.

n-bravo avatar n-bravo commented on July 21, 2024

Found the solution: The problem is that I was trying to run the code in Python3, but the "id.txt" was written in python2.7 and its encoding is a bit different to python3.
The solution was either run with python2.7 or:
1.- Open with python2.7 the file "id.txt" and get the list of words with .strip().split()
names = open("id.txt").read().strip().split()
2.- Save the list with json with the option ensure_ascii=False like this
json.dump(names, open("id.json", "w"), ensure_ascii=False)
3.- Run the BigFile code with python3 by replacing
self.names = open(id_file).read().strip().split()
with
self.names = json.load(open(id_file, "r", encoding='latin-1'))
and done, len(self.names) = 1743364 as intended, therefore the list of vectors is read as the original.

Hope it helps!

from dual_encoding.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.