kennethenevoldsen / augmenty Goto Github PK

Augmenty is an augmentation library based on spaCy for augmenting texts.

Home Page: https://kennethenevoldsen.github.io/augmenty/

License: MIT License

Python 91.69% TeX 7.68% Makefile 0.63%

augmentation natural-language-processing nlp nlproc python spacy spacy-extension spacy-nlp text-augmentation text-classification training-data

augmenty's Introduction

Kenneth Enevoldsen

Researcher, scholar, teacher

Profiles

Projects

The following are projects I am actively maintaining or contributing to. More might have been added since then.

	Name	Description
	MTEB	The Massive Text Embedding Benchmark for evaluating document embeddings e.g. for RAG systems.
	Scandinavian Embedding Benchmark	A Scandinavian Benchmark for evaluating document embeddings
	DaCy	The State of the Art Danish NLP pipeline for SpaCy
	tomsup	Theory of Mind Simulation using Python. A package that allows for easy agent-based modeling of recursive Theory of Mind agents
	Augmenty	An structured augmentation library for augmenting both the texts and the annotations
	TextDescriptives	A Python library for calculating a large variety of metrics from text
	timeseriesflattener	for converting irregularly spaced time series, such as electronic health records, into statically shaped data frames.
	Asent	An educational library for performing transparent sentiment analysis
	ScandEval	An evaluation benchmark for the Scandinavian and Germanic language models evaluating natural language understanding and generation.
	swift-python-cookiecutter	The cookie-cutter template I actively use for my packages
	UD_Danish-DDT	The Danish Universal Dependencies Treebank, a high quality linguistic resource

Contributions:

A selection of contributions to open-source libraries, besides the ones to which I am actively contributing.

Library	Contribution
Huggingface Libraries:
datasets	Fixes for minor compatibility issue with numpy >=2.0.0
transformers	Bugfixes for training masked language models using flax
SpaCy core libraries:
spacy-transformers	Allow passing arguments to the transformer backend to obtain attention weights
confection	Fixed issue where config where could not be filled
spacy-curated-transformers	Added support for ELECTRA tokenizers
curated-transformers	Added ELECTRA

augmenty's People

Contributors

Stargazers

Watchers

Forkers

koaning hkhdair asbabiy martincjespersen databill86 qeterme lucasalvarengac techthiyanes juliamakogon guoqx2 arfon

augmenty's Issues

List of potentially new augmenters

The following is a list of potentially new augmenters. If you wish a specific augmenter to be added before others please update the issue corresponding to the augmenter (if it doesn't have one feel free to create one).

A variation of existing augmenters:

New augmenters

Batch augmenters

Backtranslation e.g. based on this
Neural paraphraser
MLM augmentation
Summarize article by abstractive summarization augmentation

A combination of existing augmenters

EDA augmenter following the EDA paper

obtain coverage of +90% and coverage for augmenters

I fear the streamlit app is down.

I fear the streamlit app, found here is down.

wordembedding augmentation

Add a demo

Potentially using streamlit

Danish synonym augmenters using DaCy

Please add a LICENSE file

Please add a license file to the repo and also include the license file in the pypi source.

Emoji EOS augmenter

replace end of sentence with an emojis e.g. using tweeteval | Barbieri (2020)

Create an overview of augmenters or demo in the documentation.

maybe using a function to create a list of all augmenters and examples (maybe with a given sentence)?

Maybe even using a searchable table either using augmenters either using https://sphinxcontrib-needs.readthedocs.io/en/latest/directives/needtable.html or https://github.com/crate/sphinx_csv_filter

The overview should include:

A short description of each function
a couple of examples
references
intended use? Evaluation or training

More robust tests

Apply all augmenters with a range of arguments in a grid-like fashion to the books.

Add workflow dependabot workflow

Common Danish spelling errors using DaCy

Spacy 3.5 support

I'm using spacy 3.5 with the new knowledge base API. Is there any plan to bump spacy support of this lib?

Add ignore casing to token_replace

Also check if it is really a duplicate of orth_varaints?

Back translation augmentation

Augmenting of a document using back translation of various languages e.g., using huggingface models: https://huggingface.co/models?pipeline_tag=translation.

Example blog: https://dzlab.github.io/dltips/en/pytorch/text-augmentation/

Example sentence:
Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.

English -> Danish (Google):
Augmenty er et udvidelsesbibliotek baseret på spaCy til forstørrelse af tekster. Augmenty adskiller sig fra andre augmentationsbiblioteker ved, at den korrigerer (så vidt muligt) token-, sætnings- og dokumentetiketterne under augmentationen.

Danish -> English (Google):
Augmenty is an extension library based on spaCy for enlarging texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence, and document labels during augmentation.

IndexError: list index out of range for the documents reconstructed from DocBin

IndexError: list index out of range for the documents reconstructed from DocBin without dependency, in kedro pipeline

.../lib/python3.9/site-packages/augmenty/span/entities.py:56 in ent_augmenter_v1          
     53 │   │   │   tok_anno["POS"][i] = ["PROPN"] * len_ent 
     54 │   │   │                                                                             
     55 │   │   │   tok_anno["MORPH"][i] = [""] * len_ent     
❱  56 │   │   │   tok_anno["DEP"][i] = [tok_anno["DEP"][i][0]] + ["flat"] * (len_ent - 1) 
     57 │   │   │    
     58 │   │   │   tok_anno["SENT_START"][i] = [tok_anno["SENT_START"][i][0]] + [0] * (    
     59 │   │   │   │   len_ent - 1

Augmenter is defined as

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True, # True or False doesn't change the behaviour
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = augmenty.docs(docs, repeated_augmenter, model)

Your Environment

augmenty Version Used: 1.3.2
spaCy version: 3.5.1
Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
Python version: 3.9.16
Pipelines: en_core_anno_floret (3.5.0) # Custom model

Word embedding augmentation not registered

How to reproduce the behaviour

import spacy
import augmenty
nlp = spacy.load('en_core_web_lg')
danish_wordemb = augmenty.load("word_embedding.v1", nlp=nlp)

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/python3.9/site-packages/augmenty/util.py", line 114, in load
    aug = spacy.registry.augmenters.get(augmenter)
  File "/python3.9/site-packages/catalogue/__init__.py", line 96, in get
    raise RegistryError(err.format(name, current_namespace, available))
catalogue.RegistryError: Cant't find 'word_embedding.v1' in registry spacy -> augmenters. Available names: char_replace.v1, char_replace_random.v1, char_swap.v1, conditional_token_casing.v1, da_historical_noun_casing.v1, da_æøå_replace.v1, ents_format.v1, ents_replace.v1, grundtvigian_spacing_augmenter.v1, keystroke_error.v1, per_replace.v1, random_casing.v1, random_starting_case.v1, remove_spacing.v1, spacing_insertion.v1, spacy.lower_case.v1, spacy.orth_variants.v1, spongebob.v1, token_replace.v1, token_swap.v1, upper_case.v1, wordnet_synonym.v1

Your Environment

augmenty Version Used: 0.0.7
Operating System: Mac OS / Windows
Python Version Used: 3.6.5 & 3.9.5
spaCy Version Used: spacy==3.1.1
NLTK Version Used: nltk==3.6.2
Environment Information: conda environment

Current entity augmenters does not handle entity-'links'

Current entity augmenters do not handle entity links as one would use for entity linking.

Ideally, entity formatters should keep the same link, while entity replacers could potentially add a new link.

if you don't care about the links annotation they can be removed using augmenters such as:

def create_remove_links_augmenter() -> Callable[..., Iterator[Any]]:
    def remove_links(nlp: Language, example: Example) -> Iterator[Example]:
        example_dict = example.to_dict()
        example_dict["doc_annotation"].pop("links")
        yield Example.from_dict(example.y, example_dict)
    return remove_links

create an new logo instead of the temporary one

implement an oversampling function

Augmentation can be used to oversample a category.

Imagined usage would look something like this:

aug = augmenty.load(...)

def is_positive(example):
    """return true if the example contains an entity"""
    if example.y.cats["positive"] == 1:
        return True
    return False

upsampled_corpus = augumenty.oversample(corpus, augmenter=aug, conditional=is_positive, n=1000)

names -> usernames augmentation

Random token insertion

Randomly inserts a token. Unsure of how this would even be represented in a dependency tree or what POS tag to assign it to. Could work for span classification tasks though. | Usage: Wei and Zau (2019)

Use of augmenty with spacy config files for training

I didn't see any documentation on how to import these augmenters when using spacy 3.0's config and command line system when training.
Is it possible to use it in this sense?
If so, how?

apon further review, for the command line to register new augmentations, the flag:
-- code <code.py>
Needs to be set when calling the training. I have tried to point to the specific file that contains the keystroke aug that I wanted but it complains about not knowing a parent for relative imports. I also tried the various init.py files but it complained also.
It seems to work when you take the code out and place it in a new file without relative imports and point to that.

Which page or section is this issue related to?

https://spacy.io/usage/training#data-augmentation-custom

https://kennethenevoldsen.github.io/augmenty/tutorials/introduction.html#Applying-the-augmentation

token_swap.v1 does not deal properly with HEAD

Sample fake entities for entity augmenter using Faker package

Add sampling of entities (such as names or adresses) from https://faker.readthedocs.io/en/master/locales/da_DK.html. This tool supports random sampling of entities for numerous of languages.

An overview of what each augmenter respects

A list of what each function respect (i.e. does augmenter X respect POS-tags?) - maybe this can be done with catalogue?

Make wordnet use lang of pipeline of no lang is given

publicity

Announce augmenty:

SpaCy Universe
Twitter
linkedin

Tutorial on how to train using augmenty

Following the PR: #31

We should add a tutorial on how to train using augmenty.

Add guide on how to add augmenter

Link to usage guides in readme is broken

Paragraf subset augmenter

A paragraf subset augmentation which can work on token and sentence level. It will sample a random percentage of included coherent tokens/sentences and a random token/sentence start position ensuring the former constraint is maintained. The augmenter needs to handle annotated entities and avoid breaking them.

Input arguments:
level: how often to apply augmenter
min_paragraf: Minimum percentage of tokens or sentences to include. Ie. 4 sentences with min_paragraf=0.5 means it as a minimum includes 2 sentences.
sentence_level: Boolean to define if token or sentence level to define

Example - sentence level

import augmenty
import spacy
nlp = spacy.load("en_core_web_sm")

# four sentences
texts = [
    "Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
    "for obtaining higher performance on limited data. You can also use it to see how "
    "robust your model is to changes. It will sample subset of the paragraf.",
]
docs = nlp(texts)

augmenter = augmenty.load("paragraf_subset.v1", level=1.0, min_paragraf=0.5, sentence_level=True)

list(augmenty.texts(texts, augmenter, nlp))

Example outputs:

The first section:

Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool 
for obtaining higher performance on limited data.

The middle section:

Augmentation is a wonderful tool for obtaining higher performance on limited data. 
You can also use it to see how robust your model is to changes.

The middle section:

You can also use it to see how robust your model is to changes. It will sample subset 
of the paragraf.

Additional thoughts:

Possibly addition of a reverse augmenter, eg. removing a coherent section of tokens/sentences.

Improve tests

Make test draw from a single util script of fixtures
time limit test_all using timeout
parameterise test_all using pytest.mark.parametrize
- Make a test which check if all augmenters have a test.

Add integration for sense2vec

SpaCy includes a more deliberate sense2vec extension, which might get better word replacements than the word embedding replace.

create yield both augmentation wrapper

create yield both augmenter (yield both augmented and unaugmented example)

Add tutorials workflow

random token deletion

random deletion of token. Unsure of how this would even be represented in a dependency tree or what POS tag to assign it to. Could work for span classification tasks though. | Usage: Wei and Zau (2019)

Resolve spancat dependencies in entity_replacer_v1

In the current function ent_augmenter_v1, resolves all dependencies without resolving the dependency on spancat objects.
When we change the entities in the sentence, please help in resolving spancat labels like "sc", "spans" or custom label. The feature will allow running augmenty on a dataset that has both entities and spans.
For example -

"My name is Srijith Srinath. I work in Spacy."

Ents -
PERSON_NAME - Srijith Srinath
WORKPLACE - Spacy

Spans -
PERSON_DESC - My name is Srijith Srinath
WORKPLACE_DESC - I work in Spacy

If we change the ent "Srijith Srianth" to "John Doe". The changes occur only on ents. Make changes to spans as well, as follows -

Ents -
PERSON_NAME - John Doe
WORKPLACE - Spacy

Spans -
PERSON_DESC - My name is John Doe
WORKPLACE_DESC - I work in Spacy

Historic Danish augmentation

An augmenter intended to reflect historic text patterns of Danish

Add documentation on intended usage of the augmenter.

Add documentation on intended usage of the augmenter. I.e. whether they should be used for training or evaluation.

add example with Example to usage guide

Misaligned Token after Data Augmentation

Hi Kenneth,

first of all, I would like to thank you for publishing the amazing augmentation library Augmenty. It provides a wide range of augmentation possibilities in terms of modifications and modification levels.

I am using Augmenty to modify emails, with the target to test the robustness of my custom (spacy) NER model and to increase the model's robustness. I annotated the named entities (15 labels) of the emails using Prodigy and saved it in the spacy format (DocBin). Subsequently, I trained a German NER model with spacy. The data augmentation of the annotated data was quite straight forward (You find my custom augmenter attached to the email). Here is my code:

nlp = spacy.load("/home/models/model-best/")
corpus = Corpus('data/' + db + '.spacy')
augmented_corpus = [
e for example in corpus(nlp) for e in augmenter(nlp, example)
]
docs: Dict = {"data": []}
for eg in augmented_corpus:
doc = eg.reference
docs["data"].append(doc)

docbin = DocBin(docs=docs["data"],
attrs=["ENT_IOB", "ENT_TYPE"],
store_user_data=True)
docbin.to_disk('data/' + db + '_augmented.spacy')

I encountered a token alignment problem in the training process. Spacy (python -m spacy debug data) returned a warning, that several tokens are misaligned. I am wondering why this happens. In the example “applying augmentation to examples or a corpus” you use a dataset in the CONLL (?) format, while I use a .DocBin file as basis. May this be the reason? Or do I need to change the tokenizer settings?

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

I looked into the augmented dataset and I didn’t find a clear pattern for the misalignment. Here is a short example:

corpus = Corpus('data/' + dbs_train[1] + '_augmented.spacy')
examples_aug = []
for example in corpus(nlp):
examples_aug.append(example)

eg_aug=examples_aug[4]
align_aug = eg_aug.alignment
gold_aug = eg_aug.reference

for token in gold_aug:
output_aug.append(str(token) + ' ' + str(align_aug.x2y.lengths[token.i]))

With align.x2y.lengths[token.i] some numbers are greater than 1, which means misalignment. But I don’t understand the output:
['\n\n 1', 'Von 1', ': 3', 'Miller 1']

In this case the ‘:’ has a number of 3, why is that? Can you help me with this issue?

For reproducability I have sent you the data via email.

Many thanks!

Your Environment

augmenty Version Used: 1.4.3
Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)
Python Version Used: 3.9
spaCy Version Used: 3.7.4
Environment Information: Posit 2023.03.0

finish the tutorial on the website

Add a repeat utility function

Add function which repeats the augmenter n times

import augmenty

aug = augmenter.load(...)
rep_aug = augmenty.repeat(augmenter=aug, n=3)

Danish synonym augmentation return english suggestions

Code:

import spacy
import augmenty
nlp = spacy.load("da_core_news_sm")
danish_synonym_augmenter = augmenty.load("wordnet_synonym.v1", level=1, lang="da")

texts = ["Hej, jeg hedder Person 1 og er fra Lokation 1 og arbejder i Organisation 1, mit cpr er CPR 1, telefon: Telefon 5 og email: Email 1. Person 1 er en 20 årig mand. Person 2 er en person som arbejder i Organisation 2. Person 3 er en mand som bor i Lokation 1 og arbejder i Organisation 4"]

augmented_texts = augmenty.texts(texts, augmenter=danish_synonym_augmenter, nlp=nlp)
for text in augmented_texts:
    diff = [ (x,y) for x,y in zip(text.split(), texts[0].split()) if x != y ]
    print(diff)

Output:

[('federation', 'Organisation'), ('telephone:', 'telefon:'), ('telephone_set', 'Telefon'), ('person', 'Person'), ('adult_male.', 'mand.'), ('individual', 'Person'), ('somebody', 'person'), ('confederation', 'Organisation'), ('man', 'Person'), ('adult_male', 'mand'), ('confederation', 'Organisation')]

Environment information

augmenty Version Used: 0.0.3
Operating System: Mac OS / Windows
Python Version Used: 3.6.5 & 3.9.5
spaCy Version Used: spacy==3.1.1
NLTK Version Used: nltk==3.6.2
Environment Information: conda environment

Add entity, names ... (check NL augment)
- https://github.com/GEM-benchmark/NL-Augmenter/tree/main/nlaugmenter/transformations/gender_culture_diverse_name_two_way

kennethenevoldsen / augmenty Goto Github PK

augmenty's Introduction

Kenneth Enevoldsen

Projects

Contributions:

augmenty's People

Contributors

Stargazers

Watchers

Forkers

augmenty's Issues

A variation of existing augmenters:

New augmenters

Batch augmenters

A combination of existing augmenters

Augmenter is defined as

Your Environment

How to reproduce the behaviour

Your Environment

Which page or section is this issue related to?

Example - sentence level

Example outputs:

Additional thoughts:

Your Environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs