Hello Im trying to train the model and gets the following error : <c

Error while training about dual_encoding HOT 3 OPEN

danieljf24 commented on July 21, 2024

Error while training

from dual_encoding.

Comments (3)

danieljf24 commented on July 21, 2024

I guess the file named vec500flickr30m.tar.gz (3.0G) has not been downloaded completely.

from dual_encoding.

n-bravo commented on July 21, 2024

Hello.
I have the exact same problem.
First I got this encoding problem when trying to read the id.txt file

Traceback (most recent call last): File "<input>", line 1, in <module> File "/home/dual_encoding-master/venv/lib/python3.5/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe6 in position 1277060: invalid continuation byte

because my pc use UTF-8 as default. I tried with ISO-8859-1 by changing the __init__ in basic/bigfile.py
self.names = open(id_file, encoding='ISO-8859-1').read().strip().split()
and I could read the file, but now the length of self.names is len(self.names) = 1746908 instead of the 1743364 reported in shape.txt, so the encoding I choosed must be wrong.
Any idea what encoding should I use to read id.txt?

update: I tried with the files from Google Drive and http://lixirong.net/data/w2vv-tmm2018/word2vec.tar.gz but the problem persists in both

from dual_encoding.

n-bravo commented on July 21, 2024

Found the solution: The problem is that I was trying to run the code in Python3, but the "id.txt" was written in python2.7 and its encoding is a bit different to python3.
The solution was either run with python2.7 or:
1.- Open with python2.7 the file "id.txt" and get the list of words with .strip().split()
names = open("id.txt").read().strip().split()
2.- Save the list with json with the option ensure_ascii=False like this
json.dump(names, open("id.json", "w"), ensure_ascii=False)
3.- Run the BigFile code with python3 by replacing
self.names = open(id_file).read().strip().split()
with
self.names = json.load(open(id_file, "r", encoding='latin-1'))
and done, len(self.names) = 1743364 as intended, therefore the list of vectors is read as the original.

Hope it helps!

from dual_encoding.

Recommend Projects

Error while training about dual_encoding HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs