GithubHelp home page GithubHelp logo

ltgoslo / simple_elmo Goto Github PK

View Code? Open in Web Editor NEW
51.0 8.0 3.0 176 KB

Simple library to work with pre-trained ELMo models in TensorFlow

Home Page: https://pypi.org/project/simple-elmo/

License: GNU General Public License v3.0

Python 100.00%
elmo embeddings nlp tensorflow

simple_elmo's Introduction

Simple_elmo is a Python library to work with pre-trained ELMo contextualized language models in TensorFlow.

This is a significantly updated wrapper to the original ELMo implementation. The main changes are:

  • more convenient and transparent data loading (including from compressed files)
  • code adapted to modern TensorFlow versions (including TensorFlow 2).

Installation

pip install --upgrade simple_elmo

Make sure to update the package regularly.

Usage

from simple_elmo import ElmoModel

model = ElmoModel()

Loading

First, let's load a pretrained model:

model.load(PATH_TO_ELMO)

Required arguments

PATH_TO_ELMO is either a ZIP archive downloaded from the NLPL vector repository, OR a directory containing 2 files:

  • *.hdf5, pre-trained ELMo weights in HDF5 format (simple_elmo assumes the file is named model.hdf5; if it is not found, the first existing file with the .hdf5 extension will be used);
  • options.json, description of the model architecture in JSON;

One can also provide a vocab.txt/vocab.txt.gz file in the same directory: a one-word-per-line vocabulary of words to be cached (as character id representations) before inference. Even if it is not present at all, ELMo will still process all words normally. However, providing the vocabulary file can slightly increase inference speed when working with very large corpora (by reducing the amount of word to char ids conversions).

Optional arguments

  • max_batch_size: integer, default 32;

    the maximum number of sentences/documents in a batch during inference; your input will be automatically split into chunks of the respective size; if your computational resources allow, you might want to increase this value.

  • limit: integer, default 100;

    the number of words from the vocabulary file to actually cache (counted from the first line). Increase the default value if you are sure these words occur in your data much more often than 1 or 2 times.

  • full: boolean, default False;

    if True, will try to load the full model from TensorFlow checkpoints, together with the vocabulary. Models loaded this way can be used for language modeling.

Working with models

Currently, we provide three methods for loaded models (will be expanded in the future):

  • model.get_elmo_vectors(SENTENCES)

  • model.get_elmo_vector_average(SENTENCES)

  • model.get_elmo_substitutes(RAW_SENTENCES)

SENTENCES is a list of input sentences (lists of words). RAW_SENTENCES is a list of input sentences as strings.

The get_elmo_vectors() method produces a tensor of contextualized word embeddings. Its shape is (number of sentences, the length of the longest sentence, ELMo dimensionality).

The get_elmo_vector_average() method produces a tensor with one vector per each input sentence, constructed by averaging individual contextualized word embeddings. Its shape is (number of sentences, ELMo dimensionality).

Both these methods can be used with the layers argument, which takes one of the three values:

  • average (default): return the average of all ELMo layers for each word;
  • top: return only the top (last) layer for each word;
  • all: return all ELMo layers for each word (an additional dimension appears in the produced tensor, with the shape equal to the number of layers in the model, 3 as a rule)

Use these tensors for your downstream tasks.

Another argument for these methods is session. It defaults to None which means a new TensorFlow session is created automatically when the method is called. This is convenient, since one does not have to worry about initializing the computational graph. However, in some cases, you might want to re-use an existing session (for example, to call the method multiple times without the initialization overhead).

For this to work, one must do all the initialization manually before the method is called, for example:

import tensorflow as tf
from simple_elmo import ElmoModel

graph = tf.Graph()
with graph.as_default() as elmo_graph:
    elmo_model = ElmoModel()
    elmo_model.load(PATH_TO_ELMO)
...
with elmo_graph.as_default() as current_graph:
    tf_session = tf.compat.v1.Session(graph=elmo_graph)
        with tf_session.as_default() as sess:
            elmo_model.elmo_sentence_input = simple_elmo.elmo.weight_layers("input", elmo_model.sentence_embeddings_op)
            sess.run(tf.compat.v1.global_variables_initializer())
...
elmo_model.get_elmo_vectors(SENTENCES, session=tf_session)
elmo_model.get_elmo_vectors(SENTENCES2, session=tf_session)
...

The get_elmo_substitutes() method currently works only with the models loaded with full=True. For each input sentence, it produces a list of lexical substitutes (LM predictions) for each word token in the sentence, produced by the forward and backward ELMo language models. The substitutes are yielded as dictionaries containing the vocabulary identifiers of the most probable LM predictions, their lexical forms and their logit scores. NB: this method is still experimental!

Example scripts

We provide three example scripts to make it easier to start using simple_elmo right away:

python3 get_elmo_vectors.py -i test.txt -e ~/PATH_TO_ELMO/

This script simply returns contextualized ELMo embeddings for the words in your input sentences.

python3 text_classification.py -i paraphrases_lemm.tsv.gz -e ~/PATH_TO_ELMO/

This script can be used to perform document pair classification (like in text entailment or paraphrase detection). Simple average of ELMo embeddings for all words in a document is used; then, the cosine similarity between two documents is calculated and used as a single classifier feature. Evaluated with macro F1 score and 10-fold cross-validation.

Example paraphrase dataset for English (adapted from MRPC):

Example paraphrase datasets for Russian (adapted from http://paraphraser.ru/):

python3 wsd_eval.py -i senseval3.tsv -e ~/PATH_TO_ELMO/

This script takes as an input a word sense disambiguation (WSD) dataset and a pre-trained ELMo model. It extracts token embeddings for ambiguous words and trains a simple Logistic Regression classifier to predict word senses. Averaged macro F1 score across all words in the test set is used as the evaluation measure (with 5-fold cross-validation).

Example WSD datasets for English (adapted from Senseval 3):

Example WSD datasets for Russian (adapted from RUSSE'18):

Frequently Asked Questions

Where can I find pre-trained ELMo models?

Several repositories are available where one can download ELMo models compatible with simple_elmo:

Can I load ELMoForManyLangs models?

Unfortunately not. These models are trained using a slightly different architecture. Therefore, they are not compatible neither with AllenNLP nor with simple_elmo. You should use the original ELMoForManyLangs code to work with these models.

I see a lot of warnings about deprecated methods

This is normal. The simple_elmo library is based on the original ELMo implementation which was aimed at the versions of TensorFlow which are very outdated today. We significantly updated the code and fixed many warnings - but not all of them yet. The work continues (and will eventually lead to a complete switch to TensorFlow 2).

Meaniwhile, these warnings can be ignored: they do not harm the resulting embeddings in any way.

Can I train my own ELMo with this library?

Currently we provide ELMo training code (updated and improved in the same way compared to the original implementation) in a separate repository. It will be integrated into the simple_elmo package at some point.

simple_elmo's People

Contributors

akutuzov avatar dayyass avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

simple_elmo's Issues

KeyError: "There is no item named 'vocab.txt' in the archive"

Thanks for great work!
I have downloaded 170-model from NLPL word embeddings repository (Russian CoNLL17 corpus, ELMo) - it does not have vocab.txt in it and KeyError: "There is no item named 'vocab.txt' in the archive" occurred when trying to load it into the model, but I thought it was optional as described in docs. Any hints how to resolve?

     64                 )
     65             zf = zipfile.ZipFile(directory)
---> 66             vocab_file = zf.open("vocab.txt")
     67             options_file = zf.open("options.json")
     68             weight_file = zf.open("model.hdf5")

~/.pyenv/versions/3.7.3/lib/python3.7/zipfile.py in open(self, name, mode, pwd, force_zip64)
   1465         else:
   1466             # Get info object for name
-> 1467             zinfo = self.getinfo(name)
   1468 
   1469         if mode == 'w':

~/.pyenv/versions/3.7.3/lib/python3.7/zipfile.py in getinfo(self, name)
   1393         if info is None:
   1394             raise KeyError(
-> 1395                 'There is no item named %r in the archive' % name)
   1396 
   1397         return info

KeyError: "There is no item named 'vocab.txt' in the archive"```

Ошибка загрузки модели под Win

Добрый день,

при загрузке ELMO модели (немного доработанный get_elmo_vectors.py) в анаконде под Win с python 3.7.1 столкнулся с ошибкой в строке bilm/data.py:29

Traceback (most recent call last):
  File "run_elmo1.py", line 17, in <module>
    batcher, sentence_character_ids, elmo_sentence_input = load_elmo_embeddings(elmo_dir)
  File "E:\github\simple_elmo\elmo_helpers.py", line 81, in load_elmo_embeddings
    batcher = Batcher(vocab_file, 50)
  File "E:\github\simple_elmo\bilm\data.py", line 207, in __init__
    lm_vocab_file, max_token_length
  File "E:\github\simple_elmo\bilm\data.py", line 118, in __init__
    super(UnicodeCharsVocabulary, self).__init__(filename, **kwargs)
  File "E:\github\simple_elmo\bilm\data.py", line 29, in __init__
    for line in f:
  File "C:\Users\eek\Anaconda3\envs\tf21\lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 300: character maps to <undefined>

После добавления руками encoding='utf-8' в вызове open() в строке 27 модель загрузилась и отработал инференс.

Questions about integrating your models & code

Thanks for this repo. I am interested in integrating some of your ELMo models (eg, ID 162 for Latin [http://vectors.nlpl.eu/repository/20/162.zip]) into the CLTK (an NLP framework for ancient/dead languages).

Would you be the right one to answer a few questions about loading these models? I have some questions that may seem very simple to you :)

  1. You have pinned your code to tensorflow to 1.15.2. Is there a specific reason for this? If including TF for my users, I would prefer it to be a little newer. Also, do you have any idea whether keras could load these files?

  2. What other libraries are available to load these models? I ask because there are multiple Python projects capable of using (and sometimes fine-tuning) ELMo, however their conventions for naming files are different. For example, the ELMo directory from NLPL has (config.json, meta.json, word.dic, char.dic, encoder.pkl, token_embedder.pkl) yet I do not see such file types when looking at the "big" ELMo libraries like https://github.com/allenai/allennlp/ .

Not having any "vocab.txt" doesn't work when loading

In the README file, it says the following:

One can also provide a vocab.txt/vocab.txt.gz file in the same directory: a one-word-per-line vocabulary of words to be cached (as character id representations) before inference. Even if it is not present at all, ELMo will still process all words normally. However, providing the vocabulary file can slightly increase inference speed when working with very large corpora (by reducing the amount of word to char ids conversions).

However, when I tried to load the model with the zip file it shows the following error:

KeyError Traceback (most recent call last)
Cell In[57], line 1
----> 1 model.load('./elmo-english.zip')

File ~\anaconda3\envs\python_old\lib\site-packages\simple_elmo\elmo_helpers.py:84, in ElmoModel.load(self, directory, max_batch_size, limit, full)
80 raise SystemExit(
81 "Error: loading models from ZIP archives requires Python >= 3.7."
82 )
83 zf = zipfile.ZipFile(directory)
---> 84 vocab_file = zf.read("vocab.txt").decode("utf-8")
85 options_file = zf.read("options.json").decode("utf-8")
86 weight_file = zf.open("model.hdf5")

File ~\anaconda3\envs\python_old\lib\zipfile.py:1475, in ZipFile.read(self, name, pwd)
1473 def read(self, name, pwd=None):
1474 """Return file bytes for name."""
-> 1475 with self.open(name, "r", pwd) as fp:
1476 return fp.read()

File ~\anaconda3\envs\python_old\lib\zipfile.py:1514, in ZipFile.open(self, name, mode, pwd, force_zip64)
1511 zinfo._compresslevel = self.compresslevel
1512 else:
1513 # Get info object for name
-> 1514 zinfo = self.getinfo(name)
1516 if mode == 'w':
1517 return self._open_to_write(zinfo, force_zip64=force_zip64)

File ~\anaconda3\envs\python_old\lib\zipfile.py:1441, in ZipFile.getinfo(self, name)
1439 info = self.NameToInfo.get(name)
1440 if info is None:
-> 1441 raise KeyError(
1442 'There is no item named %r in the archive' % name)
1444 return info

KeyError: "There is no item named 'vocab.txt' in the archive"

I used python version 3.8.16 in jupyter notebook. The function model.load() doesn't work when there is no 'vocab.txt' in the archive.

FileNotFoundError: [Errno 2] No such file or directory: '*\options.json'

Hi, Thank you for your great job, i installed simple-elmo and I downloaded arabic model; and after I run this line :
`from simple_elmo import ElmoModel

model = ElmoModel()
model.load("*/136")
model.get_elmo_vectors(sents)`

I get this error :
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/admin/Desktop/Data/136\\options.json'

Thank you.

read model data directly from archive file

to avoid the need for users to copy the files out of the NLPL vectors repository, how much effort would be required to make the code read its data directly out of the zip archive?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.