as-ideas / headliner Goto Github PK

View Code? Open in Web Editor NEW

229.0 229.0 41.0 2.78 MB

🏖 Easy training and deployment of seq2seq models.

Home Page: https://as-ideas.github.io/headliner/

License: Other

Python 99.99% Shell 0.01%

neural-network nlp python seq2seq tensorflow

headliner's People

Contributors

Stargazers

Watchers

headliner's Issues

does headliner support beam search during decoding?

hello,i want to kown does headliner supper beam search way when pridection,if it support, how can i use it,the document dont mention it, and when i check the source code, it seems you use the greedy search during decoding.
many thanks.

Question: Pre-trained BERT for MT

Hi,

has there been any research done from your side comparing BLEU score to SOTA results for this pre-trained BERT for MT?
Or is it merely to illustrate the flexibility of the library, e.g, attaching a decoder and re-training on a custom dataset?

Import library

Hi,

Thank you so much for sharing this library. It seems to be very easy to use compared to others.

I try to import the library to reproduce your example but I receive an error message :

File "/home/ubuntu/.local/lib/python3.5/site-packages/headliner/model/summarizer.py", line 14
self.vectorizer: Union[Vectorizer, None] = None
^
SyntaxError: invalid syntax

I am running Python 3.5.2.

Am I doing something wrong?

Thanks

Enable retraining with non-trainable embedding.

Currently if a model is retrained the embedding is switched to trainable=true. Use an additional flag for trainable embeddings that is restored when loading the model.

Models are not saved after training.

I am training a transformer model on kaggle. After the training, models are not getting saved. Earlier it was working fine before the addition of BERT transformer.

Here is the notebook: https://www.kaggle.com/mohitsaini235/chatbot?scriptVersionId=22931537

I have commented out some code so that you can run it quickly and see the results.

Code cleanup

Can I bypass tokenizer and predict seq2seq directly?

I am trying to use Transformer to predict a sequence from another sequence. But I failed to see how can I bypass the tokenizer so that I can directly send my sequence to Transformer.

Trainig with custom AmazonFoodReview Dataset for TextSummarization

Hi,
First of all thanx for bringing such an easy to use Sequence-to-Sequence NN to open source.
Actually I was thinking to use HeadLiner, and for testing I started Training with custom data of AmazonFoodReviews for a summarization model. But it ended up with a loss of "4.662851969401042".

I was using TransformerSummarizer to train the custom model and the code is here below.

summarizer = TransformerSummarizer(num_heads=1,embedding_size=64, max_prediction_len=20) trainer = Trainer(batch_size=2, steps_per_epoch=100) trainer.train(summarizer, training_data, num_epochs=100) summarizer.save('/tmp/summarizer')

Training data was in form of tuple (only 10 samples to print here) :
[('have bought several of the vitality canned dog food products and have found them all to be of good quality the product looks more like stew than processed meat and it smells better my labrador is finicky and she appreciates this product better than most', 'good quality dog food'), ('product arrived labeled as jumbo salted peanuts the peanuts were actually small sized unsalted not sure if this was an error or if the vendor intended to represent the product as jumbo', 'not as advertised'), ('this is confection that has been around few centuries it is light pillowy citrus gelatin with nuts in this case filberts and it is cut into tiny squares and then liberally coated with powdered sugar and it is tiny mouthful of heaven not too chewy and very flavorful highly recommend this yummy treat if you are familiar with the story of lewis the lion the witch and the wardrobe this is the treat that seduces edmund into selling out his brother and sisters to the witch', 'delight says it all'), ('if you are looking for the secret ingredient in robitussin believe have found it got this in addition to the root beer extract ordered and made some cherry soda the flavor is very medicinal', 'cough medicine'), ('great taffy at great price there was wide assortment of yummy taffy delivery was very quick if your taffy lover this is deal', 'great taffy'), ('got wild hair for taffy and ordered this five pound bag the taffy was all very enjoyable with many flavors watermelon root beer melon peppermint grape etc my only complaint is there was bit too much red black licorice flavored pieces between me my kids and my husband this lasted only two weeks would recommend this brand of taffy it was delightful treat', 'nice taffy'), ('this saltwater taffy had great flavors and was very soft and chewy each candy was individually wrapped well none of the candies were stuck together which did happen in the expensive version fralinger would highly recommend this candy served it at beach themed party and everyone loved it', 'great just as good as the expensive brands'), ('this taffy is so good it is very soft and chewy the flavors are amazing would definitely recommend you buying it very satisfying', 'wonderful tasty taffy'), ('right now am mostly just sprouting this so my cats can eat the grass they love it rotate it around with wheatgrass and rye too', 'yay barley'), ('this is very healthy dog food good for their digestion also good for small puppies my dog eats her required amount at every feeding', 'healthy dog food')]

vocab encoder: 18122, vocab decoder: 4439

But the prediction failed badly.

Then I started to give a try to BertSummarizer on same dataset.
Then it gives the error : TypeError: generator yielded an element that did not match the expected structure. The expected structure was (tf.int32, tf.int32, tf.int32), but the yielded element was ([3, 7293, 1725, 14131, 10785, 16089, 17337, 2220, 4703, 6185, 12287, 574, 7293, 6281, 16104, 414, 16352, 1242, 10785, 6793, 12569, 16089, 12274, 9204, 10148, 9029, 15200, 16066, 12257, 9667, 574, 8317, 14571, 1408, 10321, 8751, 8290, 5960, 574, 14182, 736, 16193, 12274, 1408, 16066, 10171, 2], [3, 1655, 3098, 1136, 1500, 2])..

I would be very thankful if you help me out.

Thanx.

Bert Model prediction giving same output

Hi @cschaefer26 and @datitran ,

Thank you so much for the headliner library, I have been playing around it for a while and so far really enjoyed it.

I was working on a use case for summarization and use headliner's bert model, I followed the following readme code (with few tweaks for SummarizerTransformer parameters) for headliner with my use case train_data:

from headliner.preprocessing import Preprocessor

train_data = [('Some inputs.', 'Some outputs.')] * 10

# use BERT-specific start and end token
preprocessor = Preprocessor(start_token='[CLS]',
                            end_token='[SEP]',
                            lower_case=True)
train_prep = [preprocessor(t) for t in train_data]
targets_prep = [t[1] for t in train_prep]


from tensorflow_datasets.core.features.text import SubwordTextEncoder
from transformers import BertTokenizer
from headliner.model import SummarizerBert

# Use a pre-trained BERT embedding and BERT tokenizer for the encoder 
tokenizer_input = BertTokenizer.from_pretrained('bert-base-uncased')
tokenizer_target = SubwordTextEncoder.build_from_corpus(
    targets_prep, target_vocab_size=2**13,  reserved_tokens=[preprocessor.start_token, preprocessor.end_token])

vectorizer = Vectorizer(tokenizer_input, tokenizer_target)
summarizer = SummarizerBert(num_heads=4,
                            feed_forward_dim=512,
                            num_layers_encoder=3,
                            num_layers_decoder=3,
                            bert_embedding_encoder='bert-base-uncased',
                            embedding_encoder_trainable=False,
                            embedding_size_encoder=768,
                            embedding_size_decoder=64,
                            dropout_rate=0.1,
                            max_prediction_len=400)
)
summarizer.init_model(preprocessor, vectorizer)

trainer = Trainer(batch_size=2)
trainer.train(summarizer, train_data, num_epochs=200)

I train the model for 200 epochs and i see in logs that the loss keeps on reducing (starting from around 4 to 0.69), so seems like training happens just fine.

After that when i try to do prediction on the saved model using following code, it gives me the same prediction always for any test_sentence :

from headliner.model.summarizer_bert import SummarizerBert

summarizer = SummarizerBert.load('/path/to/headliner_bert_model')
summarizer.predict(test_sentence)

Please if anyone can advise if am missing anything? with respect to prediction part for bert model?

Improve tokenization

use SubwordTokenkzer
make iterative fit to data (currently data is loaded into memory)

Implement Transformer.

Error while loading trained model

I followed the documentation and got error for the following code:

Training part

NUM_UNITS = 1024
BATCH_SIZE = 32
STEPS_PER_EPOCH = len(data) // BATCH_SIZE
STEPS_TO_LOG = 100
MAX_OUTPUT_LENGTH = 50
EPOCHS = 20
EMB_SIZE = 128

from headliner.trainer import Trainer
from headliner.model.summarizer_attention import SummarizerAttention

summarizer = SummarizerAttention(lstm_size=NUM_UNITS, embedding_size=EMB_SIZE)
trainer = Trainer(batch_size=BATCH_SIZE, 
                  steps_per_epoch=STEPS_PER_EPOCH, 
                  steps_to_log=STEPS_TO_LOG, 
                  max_output_len=MAX_OUTPUT_LENGTH, 
                  model_save_path=save_path)
trainer.train(summarizer, train, num_epochs=EPOCHS, val_data=test)

Loading pre-trained model:

summarizer_loaded = SummarizerAttention.load('summarizer')
trainer = Trainer(batch_size=2)
trainer.train(summarizer_loaded, data)

---------------------------------------------------------------------------
UnknownError                              Traceback (most recent call last)
<ipython-input-20-1bcb176df5c9> in <module>()
      1 summarizer_loaded = SummarizerAttention.load('summarizer')
      2 trainer = Trainer(batch_size=2)
----> 3 trainer.train(summarizer_loaded, data)
      4 # summarizer_loaded.save('/tmp/summarizer_retrained')

C:\ProgramData\Anaconda3\lib\site-packages\headliner\trainer.py in train(self, summarizer, train_data, val_data, num_epochs, scorers, callbacks)
    203         train_step = summarizer.new_train_step(self.loss_function, self.batch_size, apply_gradients=True)
    204         while epoch_count < num_epochs:
--> 205             for train_source_seq, train_target_seq in train_dataset.take(-1):
    206                 batch_count += 1
    207                 current_loss = train_step(train_source_seq, train_target_seq)

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in __next__(self)
    620 
    621   def __next__(self):  # For Python 3 compatibility
--> 622     return self.next()
    623 
    624   def _next_internal(self):

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in next(self)
    664     """Returns a nested structure of `Tensor`s containing the next element."""
    665     try:
--> 666       return self._next_internal()
    667     except errors.OutOfRangeError:
    668       raise StopIteration

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\iterator_ops.py in _next_internal(self)
    649             self._iterator_resource,
    650             output_types=self._flat_output_types,
--> 651             output_shapes=self._flat_output_shapes)
    652 
    653       try:

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\gen_dataset_ops.py in iterator_get_next_sync(iterator, output_types, output_shapes, name)
   2670       else:
   2671         message = e.message
-> 2672       _six.raise_from(_core._status_to_exception(e.code, message), None)
   2673   # Add nodes to the TensorFlow graph.
   2674   if not isinstance(output_types, (list, tuple)):

C:\ProgramData\Anaconda3\lib\site-packages\six.py in raise_from(value, from_value)

UnknownError: AttributeError: 'Vectorizer' object has no attribute 'max_input_len'
Traceback (most recent call last):

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\ops\script_ops.py", line 221, in __call__
    ret = func(*args)

  File "C:\ProgramData\Anaconda3\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 585, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "C:\ProgramData\Anaconda3\lib\site-packages\headliner\trainer.py", line 264, in <genexpr>
    data_vectorized = (vectorizer(d) for d in data_preprocessed)

  File "C:\ProgramData\Anaconda3\lib\site-packages\headliner\preprocessing\vectorizer.py", line 42, in __call__
    if self.max_input_len is not None:

AttributeError: 'Vectorizer' object has no attribute 'max_input_len'


	 [[{{node PyFunc}}]] [Op:IteratorGetNextSync]

Training with longer examples leads to `InvalidArgumentError`

When I try to train a SummarizerTransformer on longer training examples I get the following error: InvalidArgumentError: Incompatible shapes: [1,11,64] vs. [1,8,64] in train_step. It looks like it is depending on the length of the targets.

Minimal example:

from headliner.trainer import Trainer
from headliner.model.summarizer_transformer import SummarizerTransformer

data = [
        ('You are the stars, earth and sky for me!', 'I love you I love you I love you.'),
        ('You are the stars, earth and sky for me!', 'I love you.')
]

summarizer = SummarizerTransformer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=1, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=1)

Leads to an InvalidArgumentError while

from headliner.trainer import Trainer
from headliner.model.summarizer_transformer import SummarizerTransformer

data = [
        ('You are the stars, earth and sky for me!', 'I love you.'),
        ('You are the stars, earth and sky for me!', 'I love you.')
]

summarizer = SummarizerTransformer(embedding_size=64, max_prediction_len=20)
trainer = Trainer(batch_size=1, steps_per_epoch=100)
trainer.train(summarizer, data, num_epochs=1)

without the ('You are the stars, earth and sky for me!', 'I love you I love you I love you.') pair, works fine.

I use:

python==3.6
tensorflow==2.0.0
headliner== 0.0.22

It does not depend on if I run it on a gpu or cpu only.

Can you reproduce this bug?

Implement static graph usage.

pad targets
use tf.function annotation with flexible input signature

Colab hosted notebooks for quick demo.

It is not an issue. It is more like a suggestion.

Is it possible to have google colab hosted notebooks like this in the documentation to play around with code and to run quick demos?

as-ideas / headliner Goto Github PK

headliner's People

Contributors

Stargazers

Watchers

Forkers

headliner's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs