Comments (11)
Hello David, sure, I'm for it! And yes, the stream would need to support seek
in write mode for MmCorpus
. I also like to compress the .mm
files, although I do it from the command line, not from Python. If you find a way to add this functionality, I'll certainly include it in gensim.
I imagine writing the files already compressed from Python could work by overloading the fname
parameters in saveCorpus
/serialize
methods (not writing another method). If fname
is a string, treat it as filename, otherwise, assume it's a file-like object (stream). MmCorpus
does exactly that, but only for reading, not writing.
from gensim.
Another thought: if fname
is a string, check for its extensions -- if it's .gz
/.bz2
etc., open the file as appropriate stream automatically. A little convenience trick to save users from writing serialize(BZ2.BZ2File('blabla.bz2'))
.
This of course still assumes seeking back and rewriting the stream works, at least for MmCorpus
.
from gensim.
I checked the code for seek(), and it seems it is only used in matutils, and then I have only found these two lines:
self.fout.seek(len(MmWriter.HEADER_LINE))
fin.seek(0)
I think even with zlib, we should be save with these. Seeking to after the first line, let alone the very beginning should be very, very cheap. I was more concerned about being in the middle of a large file, and then seeking back 100 bytes, which seems to be impossible in zlib and such a scenario caused a huge performance impact in a C program I saw.
So maybe it would be worth a try? Of .gz and .bz2, I would go with gzip, because it is much faster. I just hope it is implemented in C and not Python. :)
from gensim.
Oh yes, one more thing -- the same could be implemented for the Dictionary as well. I'll take a look into it, if I can find the time.
from gensim.
It's not about performance (Python uses zlib
), it's about modifying compressed file (seek in write mode). To my knowledge it cannot be done, but if you find a way, it will be a welcome addition.
from gensim.
I think it should also be about performance. But yes, I've just realized that seeking in the writer might be wishful thinking in a compressed stream. However, I think the dictionary could still benefit from compression.
from gensim.
Hey David, what's the status here?
I'm preparing for another release (some time next week), so I'm going through old issues :)
from gensim.
I have an idea: in cases where the header info is known (number of docs/features/non-zeroes), there is no need to seek. In this case, the MM corpus could be written compressed without problems. The case of streamed input (header unknown) would not allow compression.
I'll keep this issue open; in case somebody needs this functionality in the future, they can open a pull request.
from gensim.
@DavidNemeskey Is this feature still needed?
from gensim.
@tmylk To tell you the truth, I haven't used gensim for quite a long time, so I do not need it right now. I think the original use case still stands, though. But I leave the decision up to you.
from gensim.
Leave this open @tmylk ; it is a "wishlist" item. Being able to serialize streamed MatrixMarket in compressed form would be super useful.
Incidentally, I think bz2 could allow such thing (unlike gzip). The MM header could be written in a separate "block" from the rest of the corpus, and "rewritten" later, making streamed MM compression possible.
from gensim.
Related Issues (20)
- Merging corpora requires converting itertools chain object to list object HOT 2
- Inconsistent documentation for LdaSeqModel
- Is there anyway to adjust the weight of the node? HOT 1
- Deprecation Warning for sparsetools namespace HOT 2
- simple_processing() str_iterator issue HOT 3
- Pretrained model for doc2vec HOT 1
- File "<string>", line 111, in finalize_options AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__' when installing gensim 3.8.3 with pip install
- add functions to reproduce preprocessing matching `GoogleNews`, `GLoVe`, etc pretrained word-vectors HOT 1
- generate change log for 4.3.2
- Windows wheel broken for Python 3.10
- Compiled extensions are very slow when built with Cython 3.0.0
- Tests fail: RuntimeError: Compiled extensions are unavailable. HOT 3
- TypeError: __randomstate_ctor() takes from 0 to 1 positional arguments but 2 were given HOT 2
- Search feature on website is broken HOT 1
- How to open doc2vec trained on an older version of gensim? HOT 3
- is the summarization module removed in the newest version of gensim, i find it nowhere in the documentation? HOT 1
- Vocabulary size is much smaller than requested HOT 2
- Docs still reference fasttext.build_vocab sentences parameter HOT 1
- EnsembleLDA with pyLDAvis visualisation
- library stubs are missing HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gensim.