GithubHelp home page GithubHelp logo

Corpus compression about gensim HOT 11 OPEN

piskvorky avatar piskvorky commented on May 18, 2024
Corpus compression

from gensim.

Comments (11)

piskvorky avatar piskvorky commented on May 18, 2024

Hello David, sure, I'm for it! And yes, the stream would need to support seek in write mode for MmCorpus. I also like to compress the .mm files, although I do it from the command line, not from Python. If you find a way to add this functionality, I'll certainly include it in gensim.

I imagine writing the files already compressed from Python could work by overloading the fname parameters in saveCorpus/serialize methods (not writing another method). If fname is a string, treat it as filename, otherwise, assume it's a file-like object (stream). MmCorpus does exactly that, but only for reading, not writing.

from gensim.

piskvorky avatar piskvorky commented on May 18, 2024

Another thought: if fname is a string, check for its extensions -- if it's .gz/.bz2 etc., open the file as appropriate stream automatically. A little convenience trick to save users from writing serialize(BZ2.BZ2File('blabla.bz2')).

This of course still assumes seeking back and rewriting the stream works, at least for MmCorpus.

from gensim.

DavidNemeskey avatar DavidNemeskey commented on May 18, 2024

I checked the code for seek(), and it seems it is only used in matutils, and then I have only found these two lines:

  • self.fout.seek(len(MmWriter.HEADER_LINE))
  • fin.seek(0)

I think even with zlib, we should be save with these. Seeking to after the first line, let alone the very beginning should be very, very cheap. I was more concerned about being in the middle of a large file, and then seeking back 100 bytes, which seems to be impossible in zlib and such a scenario caused a huge performance impact in a C program I saw.

So maybe it would be worth a try? Of .gz and .bz2, I would go with gzip, because it is much faster. I just hope it is implemented in C and not Python. :)

from gensim.

DavidNemeskey avatar DavidNemeskey commented on May 18, 2024

Oh yes, one more thing -- the same could be implemented for the Dictionary as well. I'll take a look into it, if I can find the time.

from gensim.

piskvorky avatar piskvorky commented on May 18, 2024

It's not about performance (Python uses zlib), it's about modifying compressed file (seek in write mode). To my knowledge it cannot be done, but if you find a way, it will be a welcome addition.

from gensim.

DavidNemeskey avatar DavidNemeskey commented on May 18, 2024

I think it should also be about performance. But yes, I've just realized that seeking in the writer might be wishful thinking in a compressed stream. However, I think the dictionary could still benefit from compression.

from gensim.

piskvorky avatar piskvorky commented on May 18, 2024

Hey David, what's the status here?

I'm preparing for another release (some time next week), so I'm going through old issues :)

from gensim.

piskvorky avatar piskvorky commented on May 18, 2024

I have an idea: in cases where the header info is known (number of docs/features/non-zeroes), there is no need to seek. In this case, the MM corpus could be written compressed without problems. The case of streamed input (header unknown) would not allow compression.

I'll keep this issue open; in case somebody needs this functionality in the future, they can open a pull request.

from gensim.

tmylk avatar tmylk commented on May 18, 2024

@DavidNemeskey Is this feature still needed?

from gensim.

DavidNemeskey avatar DavidNemeskey commented on May 18, 2024

@tmylk To tell you the truth, I haven't used gensim for quite a long time, so I do not need it right now. I think the original use case still stands, though. But I leave the decision up to you.

from gensim.

piskvorky avatar piskvorky commented on May 18, 2024

Leave this open @tmylk ; it is a "wishlist" item. Being able to serialize streamed MatrixMarket in compressed form would be super useful.

Incidentally, I think bz2 could allow such thing (unlike gzip). The MM header could be written in a separate "block" from the rest of the corpus, and "rewritten" later, making streamed MM compression possible.

from gensim.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.