gem-benchmark / nl-augmenter Goto Github PK
View Code? Open in Web Editor NEWNL-Augmenter ๐ฆ โ ๐ A Collaborative Repository of Natural Language Transformations
License: MIT License
NL-Augmenter ๐ฆ โ ๐ A Collaborative Repository of Natural Language Transformations
License: MIT License
Hi all,
If one runs the evaluate.py
script against our transformation (#230), the results are very strange. The performance is too good, considering the dramatic changes made by our transformation.
Here is the performance of the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset
The accuracy on this subset which has 1000 examples = 96.0
Applying transformation:
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1000/1000 [00:19<00:00, 51.83it/s]
Finished transformation! 1000 examples generated from 1000 original examples, with 1000 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 1000 examples = 100.0
On the other hand, if we use non-default models, they produce reasonable results (kudos to @sotwi):
roberta-base-SST-2: 94.0 -> 51.0
bert-base-uncased-QQP: 92.0 -> 67.0
roberta-large-mnli: 91.0 -> 43.0
I speculate that the problem in the default test could be caused by some deficiency in the model aychang/roberta-base-imdb
and / or the imdb
dataset. But I'm not knowledgeable enough in the inner workings of the model to identify the source of the problem.
How to reproduce the strange results:
Get the writing_system_replacement transformation from #230.
cd to the NL-Augmenter dir.
Run this:
python3 evaluate.py -t WritingSystemReplacement
Expected results:
a massive drop in accuracy, similar to the results by @sotwi on non-default models, as mentioned above.
Observed results:
a perfect accuracy of 100.0.
These issues appear when trying to use this transformation outside of the root NL-Augumenter directory. For example in another sub-directory off the root directory. The fixes needed are the following:
grammaire.py
using the full import path: import transformations.insert_abbreviation.grammaire as grammaire
sys.path.append("./transformations/insert_abbreviation")
file = os.path.join(os.path.dirname(os.path.abspath(__file__)), '<file_name>.txt')
to get a handle on the current path relative to transformation.py
script file. This will allow easy access to the two .txt resource files.Here is the stack trace when the EnglishInflectionalVariation
class is initialised:
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/english_inflectional_variation/__init__.py", line 1, in <module>
from .transformation import *
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/english_inflectional_variation/transformation.py", line 1, in <module>
import random, lemminflect
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/venv/lib/python3.9/site-packages/lemminflect/__init__.py", line 49, in <module>
spacy.tokens.Token.set_extension('inflect', method=Inflections().spacyGetInfl)
File "spacy/tokens/token.pyx", line 47, in spacy.tokens.token.Token.set_extension
ValueError: [E090] Extension 'inflect' already exists on Token. To overwrite the existing extension, set `force=True` on `Token.set_extension`.
Some of the transformations/filters use different spacy models (en
, es
, zh
, de
). The way it is loaded needs to be standardized. The function initialize_models
in initialize.py
needs to be re-written to accommodate language parameter and the following transformations/filters should be updated.
Once the changes are done, test the modules individually using pytest using the below command,
pytest -s --t=<module_name>
Transformations:
Filters:
Originally requested by @AbinayaM02 in #126 (review)
When running python evaluate.py -t ButterFingersPerturbation -task "TEXT_TO_TEXT_GENERATION" -p 1
, there will be error of
Here is the performance of the model on the transformed set
Length of Evaluation dataset is 226
Traceback (most recent call last):
File "evaluate.py", line 67, in <module>
if_filter
File "./NL-Augmenter/evaluation/evaluation_engine.py", line 41, in evaluate
percentage_of_examples=percentage_of_examples,
File "./NL-Augmenter/evaluation/evaluation_engine.py", line 115, in execute_model
split=f"test[:{percentage_of_examples}%]",
File "./NL-Augmenter/evaluation/evaluate_text_generation.py", line 44, in evaluate
dataset, summarization_pipeline, transformation=operation
File "./NL-Augmenter/evaluation/evaluate_text_generation.py", line 70, in transformation_performance
pt_dataset, summarization_pipeline
File "./NL-Augmenter/evaluation/evaluate_text_generation.py", line 81, in performance_on_dataset
article, gold_summary = example
File "./NL-Augmenter/dataset.py", line 301, in <genexpr>
yield (datapoint[field] for field in self.fields)
TypeError: string indices must be integers
This line is breaking the package.
self.nlp = ...
should go under the __init__()
method.
How can we detect, which language is used for the evaluation on the fly?
We want to apply the correct transformation in "generate" on the fly according to the current language...
Thanks in advance
It seems Spacy's tokenizer behaves differently when I run pytest -s --t=emojify
and pytest -s --t=light --f=light
.
For example, I added the following snippet in my generate()
function:
print([str(t) for t in self.nlp(sentence)])
With input sentence "Apple is looking at buying U.K. startup for $132 billion."
pytest -s --t=emojify
gives:
['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '132', 'billion', '.']
However, pytest -s --t=light --f=light
gives:
['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$1', '32', 'billion.']
I use the fowling code to load spacy:
import spacy
from initialize import spacy_nlp
self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
It looks very strange. Am I overlooking something?
Hi @raft001,
It seems that in addition to issue #310 there are two other issues that need addressing:
noun_pairs.json
is missing. This needed on line 17.file = os.path.join(os.path.dirname(os.path.abspath(__file__)), '<file_name>.json')
Then current_path
can be used as the absolute path to your resource files.
It does not appear possible to load the NumberToWord
transformation after installing nlaugmenter.
This is likely due to number-to-word
breaking python's path loading.
The module number-to-word
should be changed to number_to_word
.
Solution:
number-to-word
to number_to_word
number_to_word
in the test/mapper.py file in the appropriate dictionary (either heavy or light transformation depending on the flag heavy
)pytest -s --t=number_to_word
When running the disability_transformation
there are several unresolved references to the spacy_nlp
variable. In particular on lines:
doc = nlp(text)
spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
Hi there,
Just wondering - is there any reason spacy
is locked with the old version spacy==2.2.4
in the main requirements.txt
?
Spacy 3.0 was quite a big upgrade from 2.2.4, and 3.1.0
was just released today so it might make sense to look forward and make that a requirement instead.
I don't think any current implementations would break by this upgrade but I'm happy to make a PR for it and fix things if needed.
tense
transformation requires a library called pattern
. The library is forked and the forked version is used in requirements which is causing the pip to fail. To avoid pip from failing, change the requirements to the actual pip package.
Our transformation packages so far have all been in snake_case
. @raft001, I believe this is your transformation. Could you change this, please?
I just tried your transformation on some summarization inputs and it returns nothing there. I believe there may be an issue if a document does not include any entities.
not sure why module is failing while building....i am using direct install for one library from git (see first line of requirements.txt)
it is working in local and also on multiple GCS vms (where i did dev).
reference:
https://github.com/GEM-benchmark/NL-Augmenter/pull/113/checks?check_run_id=3040556102
I've tried running the evaluate.py
script in this Colab notebook. I get the following error:
OSError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZNK3c104Type14isSubtypeOfExtESt10shared_ptrIS0_EPSo
The mapper file needs to be updated with the transformation/filter names so that default installation and pytest happens for all the light transformations/filters.
looks like system used to build the image is RHEL based, hence pycurl installation is failing, please refer to the solution in this link and install libraries accordingly.
fail build: https://github.com/GEM-benchmark/NL-Augmenter/pull/113/checks?check_run_id=3026110664
reference link: https://stackoverflow.com/questions/66419978/could-not-install-pycurl-7-43-0-6-on-python-3-8-8-rhel-8-3
Almost all transformations such as, for example, butter_fingers_perturbation
or replace_numerical_values
use a seed
in their constructor that is set to some value. How are we going to handle the global seed? we could easily set one in initialize.py
that get's imported in each transformation and set that as the default, similar to what is currently done for spacy_nlp
. Otherwise, we can also set it during evaluation, as far as I could tell that is not currently done but I think having a global default is a little cleaner.
Happy to make the required changes if that's something we'd want.
I think there might be something broken with the filter tests, at least when I extended the test.json
of the TextContainsKeywordsFilter
to contain another test case:
{
"type": "keywords",
"test_cases": [
{
"class": "TextContainsKeywordsFilter",
"args": {
"keywords": ["in", "at"]
},
"inputs": {
"sentence": "Andrew played cricket in India"
},
"outputs": true
},
{
"class": "TextContainsKeywordsFilter",
"args": {
"keywords": ["sad"]
},
"inputs": {
"sentence": "Andrew played cricket in India"
},
"outputs": false
}
]
}
And then ran: pytest -s --f=keywords
It fails the test, although from my understanding it should still work properly. In particular, after printing self.keywords
in the filter
method, it seems like there is no new instance created for the new test case and the old keywords are still used which causes the second test case to fail.
Am I misusing something here? I ran into this when writing the tests for my addition of a filter.
When running the gender_neutral_rewrite
there are several unresolved references to the spacy_nlp variable. In particular on line:
self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm")
Please use from initialize import spacy_nlp
to get a handle on the global spacy instance.
There is also an unresolved reference on Line 495: def generate(self, sentence: str) -> List[str]
. List[str]
is not resolvable. Should this be lower case? e.g. list[str]
from nlaugmenter.transformations.formality_change.transformation import Formal2Casual
OSError: prithivida/parrot_adequacy_on_BART is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
The model (prithivida/parrot_adequacy_on_BART
) is indeed not available on huggingface anymore. Perhaps an acceptable alternative is to use prithivida/parrot_adequacy_model
instead?
This issue concerns the following line in the main test script:
NL-Augmenter/test/test_main.py
Line 26 in 27ab1d7
The zip()
builtin (which is used in the above-mentioned line to pair up expected sentences with generated sentences) clips the longer of its two inputted iterables to the length of the shorter iterable. E.g.:
>>> list(zip([1,2,3], [6,7,8,9,10]))
[(1, 6), (2, 7), (3, 8)]
This means that even if a transformation generates fewer sentences (e.g. 0) than the expected number of sentences, it will still pass and the later expected sentences will not get evaluated. This also makes it impossible to test affirmatively that a transformation does not generate any outputs for a given input.
I would recommend either assert
ing that the two iterables are of equal length, or replacing zip()
with zip_longest()
.
The ocr_perturbation
package requires trdg==1.6.0
. However, under macOS 11.6 with Python 3.9 it will not install due to a dependency on pillow==7.0.0
, which generates a RequiredDependencyException: zlib
error.
Installing pillow==8.3.2
works fine but is too new for trdg==1.6.0
.
Installing trdg==1.7.0
has a dependency conflicts with opencv-python
:
ERROR: Cannot install opencv-python==4.5.3.56, trdg and trdg==1.7.0 because these package versions have conflicting dependencies.
The conflict is caused by:
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.5.3.56 depends on numpy>=1.19.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.5.2.54 depends on numpy>=1.19.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.5.2.52 depends on numpy>=1.19.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.5.1.48 depends on numpy>=1.19.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.4.0.46 depends on numpy>=1.19.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.4.0.42 depends on numpy>=1.17.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.4.0.40 depends on numpy>=1.17.3
trdg 1.7.0 depends on numpy<1.17 and >=1.16.4
opencv-python 4.3.0.38 depends on numpy>=1.17.3
When run it throws the following error messages:
/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/sentiment_emoji_augmenter/transformation.py:103: SyntaxWarning: "is" with a literal. Did you mean "=="?
if sentiment is "pos":
/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/sentiment_emoji_augmenter/transformation.py:106: SyntaxWarning: "is" with a literal. Did you mean "=="?
elif sentiment is "neg":
These issues appear when trying to use this transformation outside of the root NL-Augumenter directory. For example in another sub-directory off the root directory. The fixes needed are the following:
spell_corrections = os.path.join(
"transformations", "correct_common_misspellings", "spell_corrections.json"
)
file = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'spell_corrections.json')
to get a handle on the current path relative to transformation.py script file.Here are some random ideas informally put which could be used for perturbations & augmentations. @vgtomahawk is making a formal list in this branch.
Meanwhile here is an informal list for the benefit of the participants.
Interchange positions of SRL AM arguments for non-overlapping AM arguments:
The ButterFingersPertubation could be implemented for keyboard types other than English - like Devanagiri (Hindi, Marathi, Nepail), Shahmukhi (Urdu, Persian), South Indian languages (Tamil, Telugu, Kannada, Malayalam) or Chinese, etc.
Style transfer approaches could be interesting to look at - Changing formal to informal and vice versa. Check this model.
The above are only related to SentenceOperation. There are other transformation types too which could be looked at.
Hi All,
I am using a Mac OS for my project so I am running into an issue when trying to evaluate my transformations. As I do not have Nvidia GPUs, I would like to use the CPU when working with PyTorch otherwise I would get an "AssertionError: Torch not compiled with CUDA enabled".
Mac OS users that do not have Nvidia GPU will have to set device = -1 to not use GPU:
MacOS: "AssertionError: Torch not compiled with CUDA enabled"
allenai/allennlp#877
This seems to be stemmed from the fact that there is currently no way to change the is_CUDA flag that is being set to TRUE by default in the evaluate()
method inside evaluate_text_classification.py
to FALSE. (There is code to set the device to 0 or -1 based on the is_cuda
flag.)
I am able to run my evaluations by changing the is_cuda
flag in the code. It will probably be better to make it an argument so that future users who want to use CPUs instead of GPUs to be able to do it when running python evaluate.py -t [transformation] -task [task_type]
I will be happy to make the required changes if that's something we'd want.
Thanks,
Tim
Thank you for your great work! It's super useful!
I have a suggestion for improvement -
Some transformations are working with a "swap" principle. For example, in GenderSwap, if we had "sister" in the original sentence then it would be transformed to 'brother" and vice versa.
There are scenarios when it's important to know what direction the transformation went, female to male or male to female. In my case for example, I want to compare the performances of my model on female/male sentences on inference time.
I really liked the way TenseTransformation works. You need to specify in the constructor what tense (past/present/future) you want to transform to.
Maybe that could be applicable for other swap transformations?
Thanks again!
speech-tag
to speech_tag
token-amount
to token_amount
Setting the random seed for each word leads to jumpy behavior, like the sentence not being transformed at all. The random seed should be set for the whole sentence (outside of the for
loop).
Hi everyone, I'm the original author of the STRAP paraphrasers (paper link) which were recently accepted to NL-Augmenter (#227), an effort led by @Filco306. Excited to see these models in NL-Augmenter!
After discussing with @Filco306 and seeing the PR, I saw that 6 different variants of the paraphraser have been provided, a "Basic" style agnostic paraphraser as well as five style-specific paraphrasers (link). While the "Basic" paraphraser is implemented fine, for the style-specific paraphrasers it's recommended to use a two-step pipelined process ---
(1) normalize the text using the "Basic" paraphraser;
(2) pass the output from (1) through the style-specific paraphraser.
This is important since all style-specific paraphrasers were trained on the outputs of "Basic", so any other text is technically out-of-distribution. In an ablation study (-Inf PP.
in Table 3 of the paper) we saw a significant drop in style transfer performance without this step. Moreover, the two-step process helps boost output diversity since the "Basic" paraphraser strips input style. This should be fairly simple to implement.
Another minor point is that the models are fully compatible with the new HuggingFace generate(...)
APIs, which provide additional functionality compared to what was originally implemented in my repository (in other words, this import can be avoided). Here's an example of how to do it,
out = gpt2.generate(
input_ids=gpt2_sentences[:, 0:init_context_size],
max_length=gpt2_sentences.shape[1],
return_dict_in_generate=True,
eos_token_id=eos_token_id,
output_scores=True,
do_sample=top_k > 0 or top_p > 0.0,
top_k=top_k,
top_p=top_p,
temperature=temperature,
num_beams=beam_size,
token_type_ids=segments[:, 0:init_context_size]
)
Also CCing the NL-Augmenter reviewers for the style paraphraser to keep them in the loop --- @sebastianGehrmann @Nickeilf @juand-r @kaustubhdhole
Hi @raft001
noun_pairs.json appears in the outermost folder.
https://github.com/GEM-benchmark/NL-Augmenter/blob/main/noun_pairs.json
This needs to be removed and checked if all works fine without it.
When running this transformation there are several unresolved references to the spacy_nlp variable. In particular on line:
self.nlp = spacy_nlp if spacy_nlp else spacy.load("en_core_web_sm", disable=['ner','textcat'])
Please use from initialize import spacy_nlp
to get a handle on the global spacy instance.
Many transformations load spacy multiple times and reparse the same utterance. We will need a mechanism to load spacy once and parse once or at least cache the parse for a string so that when running all transformations together, there is no repetition of parsing.
The p1_noun_transformation
relies on wptools as a dependency. However, wptools depends on pycurl. Unfortunately, pycurl keeps throwing the following message when used:
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/p1_noun_transformation/__init__.py", line 1, in <module>
from .transformation import *
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/transformations/p1_noun_transformation/transformation.py", line 9, in <module>
import wptools
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/venv/lib/python3.9/site-packages/wptools/__init__.py", line 23, in <module>
from . import core
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/venv/lib/python3.9/site-packages/wptools/core.py", line 14, in <module>
from . import request
File "/Users/saad/Documents/Research Work/GEM/NL-Augmenter/venv/lib/python3.9/site-packages/wptools/request.py", line 17, in <module>
import pycurl
ImportError: pycurl: libcurl link-time ssl backends (secure-transport) do not include compile-time ssl backend (openssl)
There should probably be another label called "filter" to quickly check in the PR's which transformations/filters have already been implemented. Both of my PRs are filters and should therefore not have a transformation label.
Hi @Filco306
Thank you for your great work to make the powerful paraphrasing model easily accessible through HuggingFace! Now it is much easier for me to work with it without the hassle of handling complicated dependencies!
But is there any way for us to use a larger batch size and more GPUs to accelerate the paraphrasing process? Now it I could use only one GPU and a small batch size. I read your implementation here but there does not seem to be an easy to do either of them.
Thank you. I am looking forward to your reply.
Hi,
While running the evaluate
method (for #246), I get an error in my re.sub
method for one of the tests --most likely due to a problem with the escape characters. I can replace it with string.replace
to solve the problem. However, this branch is already merged. Do you suggest creating a new branch or to leave the corresponding eval columns empty?
When adjusting the tests for #146 I noticed that I almost never needed to adjust the first test case in each test.json
but all the others. It almost felt as if the first one was being skipped since it is so unlikely that all other test cases needed slight adjustments but the first one always perfectly matched. Can someone quickly check if everything works as intended there? Could very well be chance as well but just to make sure.
codespell --ignore-words-list="fro,ist,oder"
./dataset.py:122: relavent ==> relevant
./dataset.py:143: hierachy ==> hierarchy
./notebooks/Write_a_sample_transformation.ipynb:1442: tht ==> the, that
./notebooks/Write_a_sample_transformation.ipynb:1718: exisiting ==> existing
./evaluation/evaluate_text_generation.py:84: upto ==> up to
./transformations/change_two_way_ne/README.md:11: implemetation ==> implementation
Set the heavy flag to True and add them to the test/mapper.py.
Transformations:
neural_question_paraphraser
mr_value_replacement
protaugment_diverse_paraphrase
Filters:
oscillatory_hallucination
Hello!
First of all, thanks for the effort to build such a collaborative framework!
At the moment, the augmentation methods and filters are only provided with a single example per call. Since there are many techniques that need the whole dataset with the class information (to be conditioned on the class, to interpolate instances, etc.), I wanted to ask if there are plans to add this to this framework?
In the __init__
on line 42:
device='cuda'
This transformation always assumes that a CUDA device is available. However, it should check first to see if a CUDA device available using this helper function from pytorch: https://pytorch.org/docs/stable/generated/torch.cuda.is_available.html
If not, then CPU should be used instead.
The URL in https://github.com/GEM-benchmark/NL-Augmenter/tree/main/filters/lang#related-work is incorrect and points to a 404 link.
Can we add nltk.download
statements to the code? Should we handle possible exceptions?
It should not disrupt the driver, but just asking to make sure.
Hi, I just found an OS error in the PRs' workflow.
Collecting huggingface-hub<0.1.0
Downloading huggingface_hub-0.0.8-py3-none-any.whl (34 kB)
Collecting sacremoses
Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
Collecting tokenizers<0.11,>=0.10.1
Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
Collecting filelock
Downloading filelock-3.0.12-py3-none-any.whl (7.6 kB)
ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/opt/hostedtoolcache/Python/3.7.10/x64/lib/python3.7/site-packages/importlib_metadata-4.6.0.dist-info/METADATA'
Probably, somebody has some idea about this error that occurred in many PRs recently.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.