GithubHelp home page GithubHelp logo

sequential_sentence_classification's Introduction

Sequential Sentence Classification

This repo has code and data for our paper "Pretrained Language Models for Sequential Sentence Classification".

How to run

pip install -r requirements.txt
scripts/train.sh tmp_output_dir

Update the scripts/train.sh script with the appropriate hyperparameters and datapaths.

CSAbstrcut dataset

The train, dev, test splits of the dataset are in data/CSAbstrcut.

CSAbstrcut is also available on the Huggingface Hub.

Citing

If you use the data or the model, please cite,

@inproceedings{Cohan2019EMNLP,
  title={Pretrained Language Models for Sequential Sentence Classification},
  author={Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Dan Weld},
  year={2019},
  booktitle={EMNLP},
}

sequential_sentence_classification's People

Contributors

armancohan avatar dakinggg avatar ibeltagy avatar soldni avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sequential_sentence_classification's Issues

prediction steps

It will be very helpful, if you let know how to use trained model to predict on new data. There is prediction.py , how there is no description or steps to be followed to run this.
Thanks.

run on windows

I change the code to train.sh to fit in windows.
The code is as below:

set SEED=15270
set PYTORCH_SEED= SEED / 10
set NUMPY_SEED=PYTORCH_SEED / 10
set BERT_VOCAB=https://ai2-s2-research.s3-us-west-2.amazonaws.com/scibert/allennlp_files/scivocab_uncased.vocab
set BERT_WEIGHTS=https://ai2-s2-research.s3-us-west-2.amazonaws.com/scibert/allennlp_files/scibert_scivocab_uncased.tar.gz
set TRAIN_PATH=data\CSAbstruct\train.jsonl
set DEV_PATH=data\CSAbstruct\dev.jsonl
set TEST_PATH=data\CSAbstruct\test.jsonl
set USE_SEP=true # true for our model. false for baseline
set WITH_CRF=false # CRF only works for the baseline
set cuda_device=0
set BATCH_SIZE=4
set LR=5e-5
set TRAINING_DATA_INSTANCES=1668
set NUM_EPOCHS=2
set MAX_SENT_PER_EXAMPLE=10
set SENT_MAX_LEN=80
set SCI_SUM=false
set USE_ABSTRACT_SCORES=false
set SCI_SUM_FAKE_SCORES=false # use fake scores for testing
python -m allennlp.run train sequential_sentence_classification/config.jsonnet --include-package sequential_sentence_classification -s output

But got the error:json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Can you privide me some instruction to run on windows?

Incorrect [SEP] token ID

Hey!
I have noticed that in model.py you obtain the [SEP] mask with this line of code:

if self.use_sep:
# The following code collects vectors of the SEP tokens from all the examples in the batch,
# and arrange them in one list. It does the same for the labels and confidences.
# TODO: replace 103 with '[SEP]'
sentences_mask = sentences['bert'] == 103 # mask for all the SEP tokens in the batch
embedded_sentences = embedded_sentences[sentences_mask] # given batch_size x num_sentences_per_example x sent_len x vector_len
# returns num_sentences_per_batch x vector_len

Is 103 a correct value? I looked at the vocab.txt from the original BERT module and it seems that [SEP] token has an ID of 102 (since the vocab file is 0 -indexed). Is it a mistake or do you use a different vocab file than the original one?

Best,
Kacper

Implementation of segmentation embeddings

Hi,
Thanks for releasing this awesome repo.
In your implementation, I found each input sentence has a separate id in the "bert-type-ids". I wonder if you use these sentences ids to generate the segmentation embeddings?

BERT uses the sum of the token embeddings, the segmentation embeddings, and the position embeddings as input embeddings. They define there are no more than two input segments and only use 0 and 1 as the segment ids, which means the segment ids greater than 1 have not appeared in the pre-training of BERT.

pip install -r requirements.txt doesn't work

OS: Windows
Python version: 3.7

I created a clean environment with conda. After running the pip install -r requirements.txt or pip3 install -r requirements.txt I got a following error:

λ pip3 install -r requirements.txt
Collecting jsonlines (from -r requirements.txt (line 1))
  Using cached https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Obtaining allennlp from git+git://github.com/ibeltagy/allennlp@fp16_and_others#egg=allennlp (from -r requirements.txt (line 2))
  Updating c:\lplp-kacper\code_repo\sequential_sentence_classification\src\allennlp clone (to revision fp16_and_others)
  Running command git fetch -q --tags
  Running command git reset --hard -q ac2b21da6008d0e41d31192ea596153988c000a4
Collecting six (from jsonlines->-r requirements.txt (line 1))
  Downloading https://files.pythonhosted.org/packages/65/eb/1f97cb97bfc2390a276969c6fae16075da282f5058082d4cb10c6c5c1dba/six-1.14.0-py2.py3-none-any.whl
Collecting torch>=1.2.0 (from allennlp->-r requirements.txt (line 2))
  ERROR: Could not find a version that satisfies the requirement torch>=1.2.0 (from allennlp->-r requirements.txt (line 2)) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.2.0 (from allennlp->-r requirements.txt (line 2))
WARNING: You are using pip version 19.2.3, however version 20.0.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command.

This I think is related to this issue: pytorch/pytorch#19406
By following the advice there I was able to solve the problem.
The following command solved the error for me before running the requirements.txt:

conda create -n allennlp python=3.7

pip install torch==1.2.0+cpu torchvision==0.4.0+cpu -f https://download.pytorch.org/whl/torch_stable.html

pip install allennlp

Just posting this here in case if anyone runs in the same problem as I did. I am closing this issue for now since it has been solved.

Possibility of bug with batch_size > 1 and long intermediate sentences with bert-base models

Hi,

First of all, thanks for publicly sharing your code. I think, there is a bug when batch_size > 1 and when an intermediate instance in this batch gets truncated when its number of sub-tokens exceed 512 (while using a bert-base transformer model).

https://github.com/allenai/sequential_sentence_classification/blob/master/sequential_sentence_classification/model.py#L135

Here, instead of truncating the labels on the intermediate instance, the current code is truncating the labels at the end of the batch. This might result in label-mismatch especially for the following instances in that batch. Can you confirm this and if there is a bug indeed, I would like to volunteer to provide a fix for this.

Thanks,

Prediction with CRF for true test data

Hi,

Thank you for sharing the code for your work. I have been able to run the code and train the models for all the three cases. When I try to run predictor.py, it works fine for BERT and (BERT + Transformer) models. However, when using predictor.py for a (BERT + Transformer + CRF) model, since we don't have true labels for test data, the below code gives error:

  ` if self.with_crf:

        mask_sentences = (labels != -1)

        best_paths = self.crf.viterbi_tags(label_logits, mask_sentences)`

I understand why it gives an error, but was not sure how or what to modify in order for it to work.

Any suggestion will be really helpful.

Thank you!

Paper Ids

Greetings, where can I find the paper id or title so I can search for the full paper over the internet?

Thank you.

problem for running requirements.txt

Hi,

I have runned the document <requirements.txt>, but it returns a problem, like the following picture:
I have also accessed the github link given in the <requirements.txt>, it returns a 404 error.
I wonder if the link is invalid?

issue

jsonnet not loading even though installed

Running macOS Big Sur (11.2.3). I installed allennlp successfully with pip in my sys environment. The installed version is jsonnet-0.17.0. When I try to run the train script, I get this warning and then error:

WARNING - allennlp.common.params - _jsonnet not loaded, treating sequential_sentence_classification/config.jsonnet as json
Traceback (most recent call last):
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "//sequential_sentence_classification/src/allennlp/allennlp/run.py", line 21, in
run()
File "/
/sequential_sentence_classification/src/allennlp/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "//sequential_sentence_classification/src/allennlp/allennlp/commands/init.py", line 102, in main
args.func(args)
File "/
/sequential_sentence_classification/src/allennlp/allennlp/commands/train.py", line 124, in train_model_from_args
args.cache_prefix)
File "//sequential_sentence_classification/src/allennlp/allennlp/commands/train.py", line 162, in train_model_from_file
params = Params.from_file(parameter_filename, overrides)
File "/
/sequential_sentence_classification/src/allennlp/allennlp/common/params.py", line 477, in from_file
file_dict = json.loads(evaluate_file(params_file, ext_vars=ext_vars))
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/init.py", line 348, in loads
return _default_decoder.decode(s)
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Error while running scripts/train.sh

I am getting this error when trying to run the scripts/train.sh file

Traceback (most recent call last):
File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/sequential_sentence_classification/src/allennlp/allennlp/run.py", line 15, in
from allennlp.commands import main # pylint: disable=wrong-import-position
File "/content/sequential_sentence_classification/src/allennlp/allennlp/commands/init.py", line 8, in
from allennlp.commands.configure import Configure
File "/content/sequential_sentence_classification/src/allennlp/allennlp/commands/configure.py", line 26, in
from allennlp.service.config_explorer import make_app
File "/content/sequential_sentence_classification/src/allennlp/allennlp/service/config_explorer.py", line 24, in
from allennlp.common.configuration import configure, choices
File "/content/sequential_sentence_classification/src/allennlp/allennlp/common/configuration.py", line 17, in
from allennlp.data.dataset_readers import DatasetReader
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/init.py", line 1, in
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/dataset_readers/init.py", line 10, in
from allennlp.data.dataset_readers.ccgbank import CcgBankDatasetReader
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/dataset_readers/ccgbank.py", line 9, in
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/dataset_readers/dataset_reader.py", line 8, in
from allennlp.data.instance import Instance
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/instance.py", line 3, in
from allennlp.data.fields.field import DataArray, Field
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/fields/init.py", line 7, in
from allennlp.data.fields.array_field import ArrayField
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/fields/array_field.py", line 10, in
class ArrayField(Field[numpy.ndarray]):
File "/content/sequential_sentence_classification/src/allennlp/allennlp/data/fields/array_field.py", line 49, in ArrayField
@OVERRIDES
File "/usr/local/lib/python3.7/dist-packages/overrides/overrides.py", line 88, in overrides
return _overrides(method, check_signature, check_at_runtime)
File "/usr/local/lib/python3.7/dist-packages/overrides/overrides.py", line 114, in _overrides
_validate_method(method, super_class, check_signature)
File "/usr/local/lib/python3.7/dist-packages/overrides/overrides.py", line 135, in _validate_method
ensure_signature_is_compatible(super_method, method, is_static)
File "/usr/local/lib/python3.7/dist-packages/overrides/signature.py", line 93, in ensure_signature_is_compatible
ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
File "/usr/local/lib/python3.7/dist-packages/overrides/signature.py", line 288, in ensure_return_type_compatibility
f"{method_name}: return type {sub_return} is not a {super_return}."
TypeError: ArrayField.empty_field: return type None is not a <class 'allennlp.data.fields.field.Field'>.

Benchmark models from the paper

Hi!

I am reproducing your models that were implemented in the "Pretrained Language Models for Sequential Sentence Classification" paper. It is a really great work and we have managed to implement your novel models successfully in TF. We are, however, unsure about the implementation details of the benchmark models, i.e. BERT + Transformer and BERT + Transformer + CRF. Could you be so kind to clarify some questions about these models? I couldn't find them in this repo.

In the paper, you say:
“We compare our approach with two strong BERT-based baselines, finetuned for the task. The first baseline, BERT+Transformer, uses the [CLS] token to encode individual sentences as described in Devlin et al. (2018). We add an additional Transformer layer over the [CLS] vectors to contextualize the sentence representations over the entire sequence. The second baseline, BERT+Transformer+CRF, additionally adds a CRF layer. Both baselines split long lists of sentences into splits of length 30 using the method in §2 to fit into the GPU memory.”

Regarding the input shape for these benchmark models, are these sentences also concatenated together with [SEP] tokens? Or are they inserted separately to the BERT, where the batch size is a number of sentences?

Regarding the BERT+Transformer, could you explain me how it was exactly implemented?
My guess is that you input each sentence separately to the BERT, extract all [CLS] tokens from the batch, and then input it to Encoder layer followed by the Dense and Softmax layers. Is my reasoning correct? Was the decoder layer used as well?

Regarding the BERT+Transformer+CRF, where did you exactly put the CRF layer? Was it straight after the Encoder layer, after the Dense layer, or maybe after the SoftMax layer?

Kind regards,
Kacper Kubara

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.