GithubHelp home page GithubHelp logo

amansrivastava17 / embedding-as-service Goto Github PK

View Code? Open in Web Editor NEW
204.0 10.0 29.0 1.97 MB

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques

License: MIT License

Python 99.87% Dockerfile 0.13%
nlp bert xlnet word2vec glove fasttext ulmfit embedding transformer bert-as-service

embedding-as-service's Introduction

embedding-as-service

One-Stop Solution to encode sentence to fixed length vectors from various embedding techniques
• Inspired from bert-as-service

GitHub stars Downloads Pypi package GitHub issues GitHub license Contributors

What is itInstallationGetting StartedSupported EmbeddingsAPI

What is it

Encoding/Embedding is a upstream task of encoding any inputs in the form of text, image, audio, video, transactional data to fixed length vector. Embeddings are quite popular in the field of NLP, there has been various Embeddings models being proposed in recent years by researchers, some of the famous one are bert, xlnet, word2vec etc. The goal of this repo is to build one stop solution for all embeddings techniques available, here we are starting with popular text embeddings for now and later on we aim to add as much technique for image, audio, video inputs also.

embedding-as-service help you to encode any given text to fixed length vector from supported embeddings and models.

💾 Installation

▴ Back to top

Here we have given the capability to use embedding-as-service like a module or you can run it as a server and handle queries by installing client package embedding-as-service-client

Using embedding-as-service as module

Install the embedding-as-servive via pip.

$ pip install embedding-as-service

Note that the code MUST be running on Python >= 3.6. Again module does not support Python 2!

Using embedding-as-service as a server

Here you also need to install a client module embedding-as-service-client

$ pip install embedding-as-service # server
$ pip install embedding-as-service-client # client

Client module need not to be on Python 3.6, it supports both Python2 and Python3

⚡ ️Getting Started

▴ Back to top

1. Intialise encoder using supported embedding and models from here

If using embedding-as-service as a module

>>> from embedding_as_service.text.encode import Encoder  
>>> en = Encoder(embedding='bert', model='bert_base_cased', max_seq_length=256)  

If using embedding-as-service as a server

# start the server by proving embedding, model, port, max_seq_length[default=256], num_workers[default=4]
$ embedding-as-service-start --embedding bert --model bert_base_cased --port 8080 --max_seq_length 256
>>> from embedding_as_service_client import EmbeddingClient
>>> en = EmbeddingClient(host=<host_server_ip>, port=<host_port>)

2. Get sentences tokens embedding

>>> vecs = en.encode(texts=['hello aman', 'how are you?'])  
>>> vecs  
array([[[ 1.7049843 ,  0.        ,  1.3486509 , ..., -1.3647075 ,  
 0.6958289 ,  1.8013777 ], ... [ 0.4913215 ,  0.60877025,  0.73050433, ..., -0.64490885, 0.8525057 ,  0.3080206 ]]], dtype=float32)  
>>> vecs.shape  
(2, 128, 768) # batch x max_sequence_length x embedding_size  

3. Using pooling strategy, click here for more.

Supported Pooling Methods
Strategy Description
None no pooling at all, useful when you want to use word embedding instead of sentence embedding. This will results in a [max_seq_len, embedding_size] encode matrix for a sequence.
reduce_mean take the average of all token embeddings
reduce_min take the minumun of all token embeddings
reduce_max take the maximum of all token embeddings
reduce_mean_max do reduce_mean and reduce_max separately and then concat them together
first_token get the token embedding of first token of a sentence
last_token get the token embedding of last token of a sentence
>>> vecs = en.encode(texts=['hello aman', 'how are you?'], pooling='reduce_mean')  
>>> vecs  
array([[-0.33547154,  0.34566957,  1.1954105 , ...,  0.33702594,  
 1.0317835 , -0.785943  ], [-0.3439088 ,  0.36881036,  1.0612687 , ...,  0.28851607, 1.1107115 , -0.6253736 ]], dtype=float32)  
  
>>> vecs.shape  
(2, 768) # batch x embedding_size  

4. Show embedding Tokens

>>> en.tokenize(texts=['hello aman', 'how are you?'])  
[['_hello', '_aman'], ['_how', '_are', '_you', '?']]  

5. Using your own tokenizer

>>> texts = ['hello aman!', 'how are you']  
  
# a naive whitespace tokenizer  
>>> tokens = [s.split() for s in texts]  
>>> vecs = en.encode(tokens, is_tokenized=True)  

📋 API

▴ Back to top

  1. class embedding_as_service.text.encoder.Encoder
Argument Type Default Description
embedding str Required embedding method to be used, check Embedding column here
model str Required Model to be used for mentioned embedding, check Model column here
max_seq_length int 128 Maximum Sequence Length, default is 128
  1. def embedding_as_service.text.encoder.Encoder.encode
Argument Type Default Description
Texts List[str] or List[List[str]] Required List of sentences or list of list of sentence tokens in case of is_tokenized=True
pooling str (Optional) Pooling methods to apply, here is available methods
is_tokenized bool False set as True in case of tokens are passed for encoding
batch_size int 128 maximum number of sequences handled by encoder, larger batch will be partitioned into small batches.
  1. def embedding_as_service.text.encoder.Encoder.tokenize
Argument Type Default Description
Texts List[str] Required List of sentences

✅ Supported Embeddings and Models

▴ Back to top

Here are the list of supported embeddings and their respective models.

Embedding Model Embedding dimensions Paper
1️⃣ albert albert_base 768 Read Paper 🔖
albert_large 1024
albert_xlarge 2048
albert_xxlarge 4096
2️⃣ xlnet xlnet_large_cased 1024 Read Paper 🔖
xlnet_base_cased 768
3️⃣ bert bert_base_uncased 768 Read Paper 🔖
bert_base_cased 768
bert_multi_cased 768
bert_large_uncased 1024
bert_large_cased 1024
4️⃣ elmo elmo_bi_lm 512 Read Paper 🔖
5️⃣ ulmfit ulmfit_forward 300 Read Paper 🔖
ulmfit_backward 300
6️⃣ use use_dan 512 Read Paper 🔖
use_transformer_large 512
use_transformer_lite 512
7️⃣ word2vec google_news_300 300 Read Paper 🔖
8️⃣ fasttext wiki_news_300 300 Read Paper 🔖
wiki_news_300_sub 300
common_crawl_300 300
common_crawl_300_sub 300
9️⃣ glove twitter_200 200 Read Paper 🔖
twitter_100 100
twitter_50 50
twitter_25 25
wiki_300 300
wiki_200 200
wiki_100 100
wiki_50 50
crawl_42B_300 300
crawl_840B_300 300

Credits

This software uses the following open source packages:

Contributors ✨

Thanks goes to these wonderful people (emoji key):


MrPranav101

💻 📖 🚇

Aman Srivastava

💻 📖 🚇

Chirag Jain

💻 📖 🚇

Ashutosh Singh

💻 📖 🚇

Dhaval Taunk

💻 📖 🚇

Alec Koumjian

🐛

Pradeesh

🐛

This project follows the all-contributors specification. Contributions of any kind welcome!

Please read the contribution guidelines first.

Citing

▴ Back to top

If you use embedding-as-service in a scientific publication, we would appreciate references to the following BibTex entry:

@misc{aman2019embeddingservice,
  title={embedding-as-service},
  author={Srivastava, Aman},
  howpublished={\url{https://github.com/amansrivastava17/embedding-as-service}},
  year={2019}
}

embedding-as-service's People

Contributors

akoumjian avatar allcontributors[bot] avatar amansrivastava17 avatar ashutoshsingh0223 avatar chiragjn avatar dependabot[bot] avatar dhavaltaunk08 avatar dlperf avatar mrpranav101 avatar prasys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

embedding-as-service's Issues

Performance issue in /server/embedding_as_service (by P3)

Hello! I've found a performance issue in /text/xlnet/models/data_utils.py: dataset.batch(bsz_per_core, drop_remainder=True)(line 571) should be called before dataset.cache().map(parser).repeat()(line 570), which could make your program more efficient.

Here is the tensorflow document to support it.

Besides, you need to check the function parser called in .map(parser) whether to be affected or not to make the changed code work properly. For example, if parser needs data with shape (x, y, z) as its input before fix, it would require data with shape (batch_size, x, y, z) after fix.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

Unable to install

Collecting embedding-as-service
  Using cached embedding_as_service-2.0.1-py2.py3-none-any.whl (140 kB)
Collecting smart-open==1.8.4
  Using cached smart_open-1.8.4.tar.gz (63 kB)
Collecting bert-tensorflow==1.0.1
  Using cached bert_tensorflow-1.0.1-py2.py3-none-any.whl (67 kB)
Collecting embedding-as-service
  Using cached embedding_as_service-2.0.0-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-1.6.0-py3-none-any.whl (138 kB)
  Using cached embedding_as_service-1.5.0-py3-none-any.whl (137 kB)
Collecting tqdm==4.32.2
  Using cached tqdm-4.32.2-py2.py3-none-any.whl (50 kB)
Collecting requests==2.21.0
  Using cached requests-2.21.0-py2.py3-none-any.whl (57 kB)
Collecting embedding-as-service
  Using cached embedding_as_service-1.4.0-py3-none-any.whl (137 kB)
  Using cached embedding_as_service-1.3.0-py3-none-any.whl (130 kB)
  Using cached embedding_as_service-1.0.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.9.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.8.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.7.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.6.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.5.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.4.0-py3-none-any.whl (127 kB)
Collecting keras==2.2.4
  Using cached Keras-2.2.4-py2.py3-none-any.whl (312 kB)
Collecting numpy==1.16.4
  Using cached numpy-1.16.4.zip (5.1 MB)
Collecting embedding-as-service
  Using cached embedding_as_service-0.3.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.2.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.1.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.9-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.8-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.7-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.6-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.5-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.4-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.3.tar.gz (101 kB)
    ERROR: Command errored out with exit status 1:
     command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_b92c8b6f06314513b9c0f503051c3bab/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_b92c8b6f06314513b9c0f503051c3bab/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-bo11tpx2
         cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_b92c8b6f06314513b9c0f503051c3bab/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_b92c8b6f06314513b9c0f503051c3bab/setup.py", line 6, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/8e/db/6d4a90013dcaffce09a03ac33377fb7748ecc66bb87b71bf66e012dacb98/embedding_as_service-0.0.3.tar.gz#sha256=1d1acca51129707dee93cb47b43d67297c3ecfb0ae1e2d3f574c92bf0920320f (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached embedding_as_service-0.0.2.tar.gz (1.3 kB)
    ERROR: Command errored out with exit status 1:
     command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_13c5a0df862943ec82ce0f3c3d1bcad2/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_13c5a0df862943ec82ce0f3c3d1bcad2/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-00mb03q5
         cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_13c5a0df862943ec82ce0f3c3d1bcad2/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_13c5a0df862943ec82ce0f3c3d1bcad2/setup.py", line 6, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/43/8a/6e0726f57546d29a799077d2168803350c54d641068fdc7ba7bacbd1cbed/embedding_as_service-0.0.2.tar.gz#sha256=6286d676d5983a9eefea052dd5fb8d8d75bd6603b50d3e80955bd2135e6e5c34 (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached embedding_as_service-0.0.1.tar.gz (1.3 kB)
    ERROR: Command errored out with exit status 1:
     command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_0985b926e11f4b7d828e4a230ed45d7e/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_0985b926e11f4b7d828e4a230ed45d7e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-6h629r6s
         cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_0985b926e11f4b7d828e4a230ed45d7e/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-73u8bfbs/embedding-as-service_0985b926e11f4b7d828e4a230ed45d7e/setup.py", line 6, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/ac/fa/691d432d0c2a42929dccfd04b3d898a1ea5210a2aecf5282a1bb9647f443/embedding_as_service-0.0.1.tar.gz#sha256=1091c5db093f1c1c24e850a3f50eb774b20795849d0183e4af49e7d39fbf6cb9 (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Cannot install embedding-as-service==0.0.4, embedding-as-service==0.0.5, embedding-as-service==0.0.6, embedding-as-service==0.0.7, embedding-as-service==0.0.8, embedding-as-service==0.0.9, embedding-as-service==0.1.0, embedding-as-service==0.2.0, embedding-as-service==0.3.0, embedding-as-service==0.4.0, embedding-as-service==0.5.0, embedding-as-service==0.6.0, embedding-as-service==0.7.0, embedding-as-service==0.8.0, embedding-as-service==0.9.0, embedding-as-service==1.0.0, embedding-as-service==1.3.0, embedding-as-service==1.4.0, embedding-as-service==1.5.0, embedding-as-service==1.6.0, embedding-as-service==2.0.0 and embedding-as-service==2.0.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    embedding-as-service 2.0.1 depends on sentencepiece==0.1.85
    embedding-as-service 2.0.0 depends on sentencepiece==0.1.85
    embedding-as-service 1.6.0 depends on sentencepiece==0.1.85
    embedding-as-service 1.5.0 depends on tensorflow==1.15.0
    embedding-as-service 1.4.0 depends on tensorflow==1.14.0
    embedding-as-service 1.3.0 depends on tensorflow==1.14.0
    embedding-as-service 1.0.0 depends on tensorflow==1.14.0
    embedding-as-service 0.9.0 depends on tensorflow==1.14.0
    embedding-as-service 0.8.0 depends on tensorflow==1.14.0
    embedding-as-service 0.7.0 depends on tensorflow==1.14.0
    embedding-as-service 0.6.0 depends on tensorflow==1.14.0
    embedding-as-service 0.5.0 depends on tensorflow==1.14.0
    embedding-as-service 0.4.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.3.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.2.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.1.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.9 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.8 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.7 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.6 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.5 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.4 depends on sentencepiece==0.1.82

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies
WARNING: You are using pip version 21.2.4; however, version 21.3.1 is available.
You should consider upgrading via the '/Users/yukun/Documents/test/embedding/bin/python -m pip install --upgrade pip' command.
(embedding) yukun@HOME-DEL-C02G23S0Q05N embedding % /Users/yukun/Documents/test/embedding/bin/python -m pip install --upgrade pip
Requirement already satisfied: pip in ./lib/python3.9/site-packages (21.2.4)
Collecting pip
  Using cached pip-21.3.1-py3-none-any.whl (1.7 MB)
Installing collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 21.2.4
    Uninstalling pip-21.2.4:
      Successfully uninstalled pip-21.2.4
Successfully installed pip-21.3.1
(embedding) yukun@HOME-DEL-C02G23S0Q05N embedding % pip install embedding-as-service
Collecting embedding-as-service
  Using cached embedding_as_service-2.0.1-py2.py3-none-any.whl (140 kB)
  Using cached embedding_as_service-2.0.0-py3-none-any.whl (140 kB)
Collecting requests==2.21.0
  Using cached requests-2.21.0-py2.py3-none-any.whl (57 kB)
Collecting embedding-as-service
  Using cached embedding_as_service-1.6.0-py3-none-any.whl (138 kB)
  Using cached embedding_as_service-1.5.0-py3-none-any.whl (137 kB)
Collecting tensorflow-hub==0.4.0
  Using cached tensorflow_hub-0.4.0-py2.py3-none-any.whl (75 kB)
Collecting bert-tensorflow==1.0.1
  Using cached bert_tensorflow-1.0.1-py2.py3-none-any.whl (67 kB)
Collecting embedding-as-service
  Using cached embedding_as_service-1.4.0-py3-none-any.whl (137 kB)
  Using cached embedding_as_service-1.3.0-py3-none-any.whl (130 kB)
  Using cached embedding_as_service-1.0.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.9.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.8.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.7.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.6.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.5.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.4.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.3.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.2.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.1.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.9-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.8-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.7-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.6-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.5-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.4-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.3.tar.gz (101 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_9c3bc610ae584bef9a324574d470b92f/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_9c3bc610ae584bef9a324574d470b92f/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-g730mipy
       cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_9c3bc610ae584bef9a324574d470b92f/
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_9c3bc610ae584bef9a324574d470b92f/setup.py", line 6, in <module>
      with open('requirements.txt') as f:
  FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/8e/db/6d4a90013dcaffce09a03ac33377fb7748ecc66bb87b71bf66e012dacb98/embedding_as_service-0.0.3.tar.gz#sha256=1d1acca51129707dee93cb47b43d67297c3ecfb0ae1e2d3f574c92bf0920320f (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached embedding_as_service-0.0.2.tar.gz (1.3 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_024540b978614d36ad90bfed03d50e3e/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_024540b978614d36ad90bfed03d50e3e/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-cizztkhi
       cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_024540b978614d36ad90bfed03d50e3e/
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_024540b978614d36ad90bfed03d50e3e/setup.py", line 6, in <module>
      with open('requirements.txt') as f:
  FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/43/8a/6e0726f57546d29a799077d2168803350c54d641068fdc7ba7bacbd1cbed/embedding_as_service-0.0.2.tar.gz#sha256=6286d676d5983a9eefea052dd5fb8d8d75bd6603b50d3e80955bd2135e6e5c34 (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
  Using cached embedding_as_service-0.0.1.tar.gz (1.3 kB)
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/yukun/Documents/test/embedding/bin/python -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_4eccda2f22bd44f49fe99bdf8fd68e18/setup.py'"'"'; __file__='"'"'/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_4eccda2f22bd44f49fe99bdf8fd68e18/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-pip-egg-info-o7y2ick6
       cwd: /private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_4eccda2f22bd44f49fe99bdf8fd68e18/
  Complete output (5 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/private/var/folders/5v/ch_dxl5912d4b01mmx58fspc0000gq/T/pip-install-t876_3v6/embedding-as-service_4eccda2f22bd44f49fe99bdf8fd68e18/setup.py", line 6, in <module>
      with open('requirements.txt') as f:
  FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
  ----------------------------------------
WARNING: Discarding https://files.pythonhosted.org/packages/ac/fa/691d432d0c2a42929dccfd04b3d898a1ea5210a2aecf5282a1bb9647f443/embedding_as_service-0.0.1.tar.gz#sha256=1091c5db093f1c1c24e850a3f50eb774b20795849d0183e4af49e7d39fbf6cb9 (from https://pypi.org/simple/embedding-as-service/). Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
ERROR: Cannot install embedding-as-service==0.0.4, embedding-as-service==0.0.5, embedding-as-service==0.0.6, embedding-as-service==0.0.7, embedding-as-service==0.0.8, embedding-as-service==0.0.9, embedding-as-service==0.1.0, embedding-as-service==0.2.0, embedding-as-service==0.3.0, embedding-as-service==0.4.0, embedding-as-service==0.5.0, embedding-as-service==0.6.0, embedding-as-service==0.7.0, embedding-as-service==0.8.0, embedding-as-service==0.9.0, embedding-as-service==1.0.0, embedding-as-service==1.3.0, embedding-as-service==1.4.0, embedding-as-service==1.5.0, embedding-as-service==1.6.0, embedding-as-service==2.0.0 and embedding-as-service==2.0.1 because these package versions have conflicting dependencies.

The conflict is caused by:
    embedding-as-service 2.0.1 depends on tensorflow==1.15.2
    embedding-as-service 2.0.0 depends on sentencepiece==0.1.85
    embedding-as-service 1.6.0 depends on sentencepiece==0.1.85
    embedding-as-service 1.5.0 depends on sentencepiece==0.1.82
    embedding-as-service 1.4.0 depends on tensorflow==1.14.0
    embedding-as-service 1.3.0 depends on tensorflow==1.14.0
    embedding-as-service 1.0.0 depends on tensorflow==1.14.0
    embedding-as-service 0.9.0 depends on tensorflow==1.14.0
    embedding-as-service 0.8.0 depends on tensorflow==1.14.0
    embedding-as-service 0.7.0 depends on tensorflow==1.14.0
    embedding-as-service 0.6.0 depends on tensorflow==1.14.0
    embedding-as-service 0.5.0 depends on tensorflow==1.14.0
    embedding-as-service 0.4.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.3.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.2.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.1.0 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.9 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.8 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.7 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.6 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.5 depends on sentencepiece==0.1.82
    embedding-as-service 0.0.4 depends on sentencepiece==0.1.82

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/user_guide/#fixing-conflicting-dependencies

Performance issues in server/embedding_as_service/text/xlnet/models/data_utils.py(P2)

Hello,I found a performance issue in the definition of parse_files_to_dataset ,
server/embedding_as_service/text/xlnet/models/data_utils.py,
dataset = dataset.cache().map(parser) was called without num_parallel_calls.
I think it will increase the efficiency of your program if you add this.

Here is the documemtation of tensorflow to support this thing.

Looking forward to your reply. Btw, I am very glad to create a PR to fix it if you are too busy.

How to get token level embedding?

Using the pooling strategy i can get the first or the last token, how do I get a token embedding for all the tokens in a sentence?

Unable to install on Python 3.10

Tested on Windows 11 and Python 3.10.11

# pip install embedding-as-service
Collecting embedding-as-service
  Using cached embedding_as_service-3.1.2-py3-none-any.whl (140 kB)
Collecting keras==2.2.4 (from embedding-as-service)
  Using cached Keras-2.2.4-py2.py3-none-any.whl (312 kB)
Collecting tqdm==4.32.2 (from embedding-as-service)
  Using cached tqdm-4.32.2-py2.py3-none-any.whl (50 kB)
Collecting numpy==1.16.4 (from embedding-as-service)
  Using cached numpy-1.16.4.zip (5.1 MB)
  Preparing metadata (setup.py) ... done
Collecting requests==2.21.0 (from embedding-as-service)
  Using cached requests-2.21.0-py2.py3-none-any.whl (57 kB)
Collecting bert-tensorflow==1.0.1 (from embedding-as-service)
  Using cached bert_tensorflow-1.0.1-py2.py3-none-any.whl (67 kB)
Collecting tensorflow-hub==0.4.0 (from embedding-as-service)
  Using cached tensorflow_hub-0.4.0-py2.py3-none-any.whl (75 kB)
Collecting smart-open==6.2.0 (from embedding-as-service)
  Using cached smart_open-6.2.0-py3-none-any.whl (58 kB)
INFO: pip is looking at multiple versions of embedding-as-service to determine which version is compatible with other requirements. This could take a while.
Collecting embedding-as-service
  Using cached embedding_as_service-3.1.1-py3-none-any.whl (140 kB)
Collecting smart-open==1.8.4 (from embedding-as-service)
  Using cached smart_open-1.8.4.tar.gz (63 kB)
  Preparing metadata (setup.py) ... done
Collecting embedding-as-service
  Using cached embedding_as_service-3.1.0-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-3.0.2-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-3.0.1-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-3.0.0-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-2.0.2-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-2.0.1-py2.py3-none-any.whl (140 kB)
INFO: pip is looking at multiple versions of embedding-as-service to determine which version is compatible with other requirements. This could take a while.
  Using cached embedding_as_service-2.0.0-py3-none-any.whl (140 kB)
  Using cached embedding_as_service-1.6.0-py3-none-any.whl (138 kB)
  Using cached embedding_as_service-1.5.0-py3-none-any.whl (137 kB)
  Using cached embedding_as_service-1.4.0-py3-none-any.whl (137 kB)
  Using cached embedding_as_service-1.3.0-py3-none-any.whl (130 kB)
INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to reduce runtime. See https://pip.pypa.io/warnings/backtracking for guidance. If you want to abort this run, press Ctrl + C.
  Using cached embedding_as_service-1.0.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.9.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.8.0-py3-none-any.whl (129 kB)
  Using cached embedding_as_service-0.7.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.6.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.5.0-py3-none-any.whl (128 kB)
  Using cached embedding_as_service-0.4.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.3.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.2.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.1.0-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.9-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.8-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.7-py3-none-any.whl (127 kB)
  Using cached embedding_as_service-0.0.6-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.5-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.4-py3-none-any.whl (125 kB)
  Using cached embedding_as_service-0.0.3.tar.gz (101 kB)
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\Ali\AppData\Local\Temp\pip-install-gt9gcw8y\embedding-as-service_0cc15a3bda0c440294a4eedf992672dc\setup.py", line 6, in <module>
          with open('requirements.txt') as f:
      FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Memory and speed issue.

I was working with sentences that are of length 200 approx and total sentences was 1000. But when I feed that to the model, my computer as well as google colab ran out of memory. Also, when I tried it with loop i.e feeding one by one, the process becomes too slow i.e for 2 hours of running the code it only found the embeddings of 40-50 sentences only. So, Is there any way to speed up the process?

Request to add MobileBert

Request to add MobileBert

I would like to request to add MobileBert, since it is smaller and faster. The following texts are also a request from huggingface/transformers#4185:

New model addition
This issue is to request adding the MobileBERT model, which was recently released by Carnegie Mellon University, Google Research, and Google Brain.

Model description
MobileBERT is a more computationally-efficient model for achieving BERT-base level accuracy on a smartphone.

Open source status
The model implementation is available: https://github.com/google-research/google-research/tree/master/mobilebert

Model weights are available: uncased_L-24_H-128_B-512_A-4_F-4_OPT (MobileBERT Optimized Uncased English)

Authors: Zhiqing Sun, Hongkun Yu (@saberkun), Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou

Same sentence produces a different vector in XLNet

While running the same sentence using 'XLNet' and 'xlnet_base_cased', the model produces different embeddings.
for example; if
vec1 = en.encode(texts=['he is anger'],pooling='reduce_mean')
vec2 = en.encode(texts=['he is anger'],pooling='reduce_mean')
both the vec1 and vec2 have different embedding vectors.

Is it possible to speed it up without printing a certain line

Thank so much for your code, its helping me in my project.
But I am facing a small issue not sure if we can fix it or not:

I am embedding (using XLNET) 3 inputs each input has 5400 sentences and each sentence has 20 words
so the total is 3 * 5400 * 20

2 hours has passed till now and it still didn't finish embedding the first input (5400 *20)
I think I will run out of memory in Google Colab as well before finishing the embedding part [ofcourse this happened 🙈 it didn't even finish embedding the 1st input]

I notice that with each embedding its printing

"Converting texts to features: 100%|██████████| 20/20 [00:00<00:00, 17637.95it/s]INFO:tensorflow:memory input None
INFO:tensorflow:Use float type <dtype: 'float32'>" 

is it possible to avoid printing that line with each embedding to speed up the process ?

issue during installation

When I use this
pip install embedding-as-service
or
python setup.py

Issue :
Traceback (most recent call last): File "setup.py", line 4, in <module> long_description = fh.read() UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 169: illegal multibyte sequence

Installation environment
Windows10+Python36

Does it support Roberta now ?

Does it support Roberta now ?. I saw 'roberta' tag but did not find roberta in README. So, Does this repo support Roberta ?

XLNet encoding issue

Please consider the following code.

from embedding_as_service.text.encode import Encoder
en = Encoder(embedding='xlnet', model='xlnet_base_cased', download=True)
vecs = en.encode(texts=["hello, how are you?"])

The above code gives the following error. I'm using python 3.6 (I've run using spyder)

ValueError: Trying to share variable model/transformer/r_w_bias, but specified shape (12, 12, 64) and found shape (24, 16, 64)

By the way, I'm able to generate the embeddings through bert.

Thanks,
Hari

Unable to install embedding as service on Colab.

When we are installing it on colab using

!pip install embedding_as_service

I am unable to install it as we are getting this error. please have a look at the same and provide us the solution for the same.
123

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 2.0.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary embedding-as-service -w /tmp/ext embedding-as-service==2.0.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting embedding-as-service==2.0.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/ae5/c26b95f36631a/embedding_as_service-2.0.0.tar.gz (119 kB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-3ob4vns9/embedding-as-service/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-3ob4vns9/embedding-as-service/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-3ob4vns9/embedding-as-service/pip-egg-info
         cwd: /tmp/pip-wheel-3ob4vns9/embedding-as-service/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-3ob4vns9/embedding-as-service/setup.py", line 6, in <module>
        with open('requirements.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Will be great if we can specify where and how to store embedding weights

A little bit more flexibility regarding where to store embeddings weights will be greatly appreciated in order to share / reuse weights already downloaded (sorry I don't have enough time to write the code right now)

from embedding_as_service.text.encode import Encoder
en = Encoder(embedding='bert', model='bert_base_cased', model_path = '/what/ever/I/want', download=False)

Really a great piece of work, I'm playing with it since 2 hours, works like a charm
Thx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.