Hey, I'm just wondering if it would be possible to save corpora to t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Corpus compression about gensim HOT 11 OPEN

piskvorky commented on May 18, 2024

Corpus compression

from gensim.

Comments (11)

piskvorky commented on May 18, 2024

Hello David, sure, I'm for it! And yes, the stream would need to support seek in write mode for MmCorpus. I also like to compress the .mm files, although I do it from the command line, not from Python. If you find a way to add this functionality, I'll certainly include it in gensim.

I imagine writing the files already compressed from Python could work by overloading the fname parameters in saveCorpus/serialize methods (not writing another method). If fname is a string, treat it as filename, otherwise, assume it's a file-like object (stream). MmCorpus does exactly that, but only for reading, not writing.

from gensim.

piskvorky commented on May 18, 2024

Another thought: if fname is a string, check for its extensions -- if it's .gz/.bz2 etc., open the file as appropriate stream automatically. A little convenience trick to save users from writing serialize(BZ2.BZ2File('blabla.bz2')).

This of course still assumes seeking back and rewriting the stream works, at least for MmCorpus.

from gensim.

DavidNemeskey commented on May 18, 2024

I checked the code for seek(), and it seems it is only used in matutils, and then I have only found these two lines:

self.fout.seek(len(MmWriter.HEADER_LINE))
fin.seek(0)

I think even with zlib, we should be save with these. Seeking to after the first line, let alone the very beginning should be very, very cheap. I was more concerned about being in the middle of a large file, and then seeking back 100 bytes, which seems to be impossible in zlib and such a scenario caused a huge performance impact in a C program I saw.

So maybe it would be worth a try? Of .gz and .bz2, I would go with gzip, because it is much faster. I just hope it is implemented in C and not Python. :)

from gensim.

DavidNemeskey commented on May 18, 2024

Oh yes, one more thing -- the same could be implemented for the Dictionary as well. I'll take a look into it, if I can find the time.

from gensim.

piskvorky commented on May 18, 2024

It's not about performance (Python uses zlib), it's about modifying compressed file (seek in write mode). To my knowledge it cannot be done, but if you find a way, it will be a welcome addition.

from gensim.

DavidNemeskey commented on May 18, 2024

I think it should also be about performance. But yes, I've just realized that seeking in the writer might be wishful thinking in a compressed stream. However, I think the dictionary could still benefit from compression.

from gensim.

piskvorky commented on May 18, 2024

Hey David, what's the status here?

I'm preparing for another release (some time next week), so I'm going through old issues :)

from gensim.

piskvorky commented on May 18, 2024

I have an idea: in cases where the header info is known (number of docs/features/non-zeroes), there is no need to seek. In this case, the MM corpus could be written compressed without problems. The case of streamed input (header unknown) would not allow compression.

I'll keep this issue open; in case somebody needs this functionality in the future, they can open a pull request.

from gensim.

tmylk commented on May 18, 2024

@DavidNemeskey Is this feature still needed?

from gensim.

DavidNemeskey commented on May 18, 2024

@tmylk To tell you the truth, I haven't used gensim for quite a long time, so I do not need it right now. I think the original use case still stands, though. But I leave the decision up to you.

from gensim.

piskvorky commented on May 18, 2024

Leave this open @tmylk ; it is a "wishlist" item. Being able to serialize streamed MatrixMarket in compressed form would be super useful.

Incidentally, I think bz2 could allow such thing (unlike gzip). The MM header could be written in a separate "block" from the rest of the corpus, and "rewritten" later, making streamed MM compression possible.

from gensim.

Corpus compression about gensim HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs