GithubHelp home page GithubHelp logo

thoughtriver / lmdb-embeddings Goto Github PK

View Code? Open in Web Editor NEW
414.0 19.0 30.0 69 KB

Fast word vectors with little memory usage in Python

License: GNU General Public License v3.0

Python 100.00%
word vectors embeddings lmdb gensim memory speed text word2vec fasttext

lmdb-embeddings's Introduction

lmdb-embeddings

Build Status

LMDB Embeddings

Query word vectors (embeddings) very quickly with very little querying time overhead and far less memory usage than gensim or other equivalent solutions. This is made possible by Lightning Memory-Mapped Database.

Inspired by Delft. As explained in their readme, this approach permits us to have the pre-trained embeddings immediately "warm" (no load time), to free memory and to use any number of embeddings similtaneously with a very negligible impact on runtime when using SSD.

For instance, in a traditional approach glove-840B takes around 2 minutes to load and 4GB in memory. Managed with LMDB, glove-840B can be accessed immediately and takes only a couple MB in memory, for a negligible impact on runtime (around 1% slower).

Installation

pip install lmdb-embeddings

Reading vectors

from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Writing vectors

An example to write an LMDB vector file from a gensim model. As any iterator that yields word and vector pairs is supported, if you have the vectors in an alternative format then it is just a matter of altering the iter_embeddings method below appropriately.

I will be writing a CLI interface to convert standard formats soon.

from gensim.models.keyedvectors import KeyedVectors
from lmdb_embeddings.writer import LmdbEmbeddingsWriter


GOOGLE_NEWS_PATH = 'GoogleNews-vectors-negative300.bin.gz'
OUTPUT_DATABASE_FOLDER = 'GoogleNews-vectors-negative300'


print('Loading gensim model...')
gensim_model = KeyedVectors.load_word2vec_format(GOOGLE_NEWS_PATH, binary=True)


def iter_embeddings():
    for word in gensim_model.vocab.keys():
        yield word, gensim_model[word]

print('Writing vectors to a LMDB database...')

writer = LmdbEmbeddingsWriter(iter_embeddings()).write(OUTPUT_DATABASE_FOLDER)

# These vectors can now be loaded with the LmdbEmbeddingsReader.

LRU Cache

A reader with an LRU (Least Recently Used) cache is included. This will save the embeddings for the 50,000 most recently queried words and return the same object instead of querying the database each time. Its interface is the same as the standard reader. See functools.lru_cache in the standard library.

from lmdb_embeddings.reader import LruCachedLmdbEmbeddingsReader
from lmdb_embeddings.exceptions import MissingWordError

embeddings = LruCachedLmdbEmbeddingsReader('/path/to/word/vectors/eg/GoogleNews-vectors-negative300')

try:
    vector = embeddings.get_word_vector('google')
except MissingWordError:
    # 'google' is not in the database.
    pass

Customisation

By default, LMDB Embeddings uses pickle to serialize the vectors to bytes (optimized and pickled with the highest available protocol). However, it is very easy to use an alternative approach - simply inject the serializer and unserializer as callables into the LmdbEmbeddingsWriter and LmdbEmbeddingsReader.

A msgpack serializer is included and can be used in the same way.

from lmdb_embeddings.writer import LmdbEmbeddingsWriter
from lmdb_embeddings.serializers import MsgpackSerializer

writer = LmdbEmbeddingsWriter(
    iter_embeddings(),
    serializer=MsgpackSerializer().serialize
).write(OUTPUT_DATABASE_FOLDER)
from lmdb_embeddings.reader import LmdbEmbeddingsReader
from lmdb_embeddings.serializers import MsgpackSerializer

reader = LmdbEmbeddingsReader(
    OUTPUT_DATABASE_FOLDER,
    unserializer=MsgpackSerializer().unserialize
)

Running tests

pytest

Author

Contributing

Contributions, issues and feature requests are welcome!

Show your support

Give a ⭐️ if this project helped you!

License

Copyright © 2019 ThoughtRiver.
This project is GPL-3.0 licensed.

lmdb-embeddings's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lmdb-embeddings's Issues

mdb_page_search_root Bus error

GDB debug:
6c362419-6c45-4ae9-b1fb-f10c50bd52fd

If the word can be found, then it won't bus error. If the word can't be found, it will Bus error.
image

  • This is mdb_page_search_root:
 /** Finish #mdb_page_search() / #mdb_page_search_lowest().
  *   The cursor is at the root page, set up the rest of it.
  */
 static int
 mdb_page_search_root(MDB_cursor *mc, MDB_val *key, int flags)
 {
      MDB_page        *mp = mc->mc_pg[mc->mc_top];
      int rc;
      DKBUF;

      while (IS_BRANCH(mp)) {
              MDB_node        *node;
              indx_t          i;

              DPRINTF(("branch page %"Yu" has %u keys", mp->mp_pgno, NUMKEYS(mp)));
              /* Don't assert on branch pages in the FreeDB. We can get here
               * while in the process of rebalancing a FreeDB branch page; we must
               * let that proceed. ITS#8336
               */
              mdb_cassert(mc, !mc->mc_dbi || NUMKEYS(mp) > 1);
              DPRINTF(("found index 0 to page %"Yu, NODEPGNO(NODEPTR(mp, 0))));

              if (flags & (MDB_PS_FIRST|MDB_PS_LAST)) {
                      i = 0;
                      if (flags & MDB_PS_LAST) {
                              i = NUMKEYS(mp) - 1;
                              /* if already init'd, see if we're already in right place */
                              if (mc->mc_flags & C_INITIALIZED) {
                                      if (mc->mc_ki[mc->mc_top] == i) {
                                              mc->mc_top = mc->mc_snum++;
                                              mp = mc->mc_pg[mc->mc_top];
                                              goto ready;
                                      }
                              }
                      }
              } else {
                      int      exact;
                      node = mdb_node_search(mc, key, &exact);
                      if (node == NULL)
                              i = NUMKEYS(mp) - 1;
                      else {
                              i = mc->mc_ki[mc->mc_top];
                              if (!exact) {
                                      mdb_cassert(mc, i > 0);
                                      i--;
                              }
                      }
                      DPRINTF(("following index %u for key [%s]", i, DKEY(key)));
              }

              mdb_cassert(mc, i < NUMKEYS(mp));
              node = NODEPTR(mp, i);

              if ((rc = mdb_page_get(mc, NODEPGNO(node), &mp, NULL)) != 0)
                      return rc;

              mc->mc_ki[mc->mc_top] = i;
              if ((rc = mdb_cursor_push(mc, mp)))
                      return rc;

 ready:
              if (flags & MDB_PS_MODIFY) {
                      if ((rc = mdb_page_touch(mc)) != 0)
                              return rc;
                      mp = mc->mc_pg[mc->mc_top];
              }
      }

      if (!IS_LEAF(mp)) {
              DPRINTF(("internal error, index points to a %02X page!?",
                  mp->mp_flags));
              mc->mc_txn->mt_flags |= MDB_TXN_ERROR;
              return MDB_CORRUPTED;
      }

      DPRINTF(("found leaf page %"Yu" for key [%s]", mp->mp_pgno,
          key ? DKEY(key) : "null"));
      mc->mc_flags |= C_INITIALIZED;
      mc->mc_flags &= ~C_EOF;

      return MDB_SUCCESS;
 }

Licensing

Any chance you can add a license file to this? To spell out what people are and are not allowed to do?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.