GithubHelp home page GithubHelp logo

kennethenevoldsen / augmenty Goto Github PK

View Code? Open in Web Editor NEW
148.0 148.0 11.0 6.26 MB

Augmenty is an augmentation library based on spaCy for augmenting texts.

Home Page: https://kennethenevoldsen.github.io/augmenty/

License: MIT License

Python 91.69% TeX 7.68% Makefile 0.63%
augmentation natural-language-processing nlp nlproc python spacy spacy-extension spacy-nlp text-augmentation text-classification training-data

augmenty's Introduction

Kenneth Enevoldsen

Researcher, scholar, teacher

 kennethcenevoldsen

Profiles

Projects

The following are projects I am actively maintaining or contributing to. More might have been added since then.

Name Description
MTEB The Massive Text Embedding Benchmark for evaluating document embeddings e.g. for RAG systems.
Scandinavian Embedding Benchmark A Scandinavian Benchmark for evaluating document embeddings
DaCy The State of the Art Danish NLP pipeline for SpaCy
tomsup Theory of Mind Simulation using Python. A package that allows for easy agent-based modeling of recursive Theory of Mind agents
Augmenty An structured augmentation library for augmenting both the texts and the annotations
TextDescriptives A Python library for calculating a large variety of metrics from text
timeseriesflattener for converting irregularly spaced time series, such as electronic health records, into statically shaped data frames.
Asent An educational library for performing transparent sentiment analysis
ScandEval An evaluation benchmark for the Scandinavian and Germanic language models evaluating natural language understanding and generation.
swift-python-cookiecutter The cookie-cutter template I actively use for my packages
UD_Danish-DDT The Danish Universal Dependencies Treebank, a high quality linguistic resource

Contributions:

A selection of contributions to open-source libraries, besides the ones to which I am actively contributing.

Library Contribution
Huggingface Libraries:
datasets Fixes for minor compatibility issue with numpy >=2.0.0
transformers Bugfixes for training masked language models using flax
SpaCy core libraries:
spacy-transformers Allow passing arguments to the transformer backend to obtain attention weights
confection Fixed issue where config where could not be filled
spacy-curated-transformers Added support for ELECTRA tokenizers
curated-transformers  Added ELECTRA

augmenty's People

Contributors

actions-user avatar arfon avatar dependabot[bot] avatar eford36 avatar kennethenevoldsen avatar koaning avatar pre-commit-ci[bot] avatar qeterme avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

augmenty's Issues

List of potentially new augmenters

The following is a list of potentially new augmenters. If you wish a specific augmenter to be added before others please update the issue corresponding to the augmenter (if it doesn't have one feel free to create one).

A variation of existing augmenters:

New augmenters

Batch augmenters

A combination of existing augmenters

  • EDA augmenter following the EDA paper

More robust tests

Apply all augmenters with a range of arguments in a grid-like fashion to the books.

Spacy 3.5 support

I'm using spacy 3.5 with the new knowledge base API. Is there any plan to bump spacy support of this lib?

Back translation augmentation

Augmenting of a document using back translation of various languages e.g., using huggingface models: https://huggingface.co/models?pipeline_tag=translation.

Example blog: https://dzlab.github.io/dltips/en/pytorch/text-augmentation/

Example sentence:
Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.

English -> Danish (Google):
Augmenty er et udvidelsesbibliotek baseret på spaCy til forstørrelse af tekster. Augmenty adskiller sig fra andre augmentationsbiblioteker ved, at den korrigerer (så vidt muligt) token-, sætnings- og dokumentetiketterne under augmentationen.

Danish -> English (Google):
Augmenty is an extension library based on spaCy for enlarging texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence, and document labels during augmentation.

IndexError: list index out of range for the documents reconstructed from DocBin

IndexError: list index out of range for the documents reconstructed from DocBin without dependency, in kedro pipeline

.../lib/python3.9/site-packages/augmenty/span/entities.py:56 in ent_augmenter_v1          
     53 │   │   │   tok_anno["POS"][i] = ["PROPN"] * len_ent 
     54 │   │   │                                                                             
     55 │   │   │   tok_anno["MORPH"][i] = [""] * len_ent     
❱  56 │   │   │   tok_anno["DEP"][i] = [tok_anno["DEP"][i][0]] + ["flat"] * (len_ent - 1) 
     57 │   │   │    
     58 │   │   │   tok_anno["SENT_START"][i] = [tok_anno["SENT_START"][i][0]] + [0] * (    
     59 │   │   │   │   len_ent - 1   

Augmenter is defined as

    augmenter = augmenty.load("ents_replace_v1",
                              level=level,
                              ent_dict={"pers": [[s] for s in ents_as_str]},
                              replace_consistency=True, # True or False doesn't change the behaviour
                              resolve_dependencies=True
                              )
    repeated_augmenter = augmenty.repeat(augmenter=augmenter, n=n_repeat)
    augmented_docs = augmenty.docs(docs, repeated_augmenter, model)

Your Environment

  • augmenty Version Used: 1.3.2
  • spaCy version: 3.5.1
  • Platform: Linux-5.15.0-69-generic-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Pipelines: en_core_anno_floret (3.5.0) # Custom model

Word embedding augmentation not registered

How to reproduce the behaviour

import spacy
import augmenty
nlp = spacy.load('en_core_web_lg')
danish_wordemb = augmenty.load("word_embedding.v1", nlp=nlp)

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/python3.9/site-packages/augmenty/util.py", line 114, in load
    aug = spacy.registry.augmenters.get(augmenter)
  File "/python3.9/site-packages/catalogue/__init__.py", line 96, in get
    raise RegistryError(err.format(name, current_namespace, available))
catalogue.RegistryError: Cant't find 'word_embedding.v1' in registry spacy -> augmenters. Available names: char_replace.v1, char_replace_random.v1, char_swap.v1, conditional_token_casing.v1, da_historical_noun_casing.v1, da_æøå_replace.v1, ents_format.v1, ents_replace.v1, grundtvigian_spacing_augmenter.v1, keystroke_error.v1, per_replace.v1, random_casing.v1, random_starting_case.v1, remove_spacing.v1, spacing_insertion.v1, spacy.lower_case.v1, spacy.orth_variants.v1, spongebob.v1, token_replace.v1, token_swap.v1, upper_case.v1, wordnet_synonym.v1

Your Environment

  • augmenty Version Used: 0.0.7
  • Operating System: Mac OS / Windows
  • Python Version Used: 3.6.5 & 3.9.5
  • spaCy Version Used: spacy==3.1.1
  • NLTK Version Used: nltk==3.6.2
  • Environment Information: conda environment

Current entity augmenters does not handle entity-'links'

Current entity augmenters do not handle entity links as one would use for entity linking.

Ideally, entity formatters should keep the same link, while entity replacers could potentially add a new link.

if you don't care about the links annotation they can be removed using augmenters such as:

def create_remove_links_augmenter() -> Callable[..., Iterator[Any]]:
    def remove_links(nlp: Language, example: Example) -> Iterator[Example]:
        example_dict = example.to_dict()
        example_dict["doc_annotation"].pop("links")
        yield Example.from_dict(example.y, example_dict)
    return remove_links

implement an oversampling function

Augmentation can be used to oversample a category.

Imagined usage would look something like this:

aug = augmenty.load(...)

def is_positive(example):
    """return true if the example contains an entity"""
    if example.y.cats["positive"] == 1:
        return True
    return False

upsampled_corpus = augumenty.oversample(corpus, augmenter=aug, conditional=is_positive, n=1000)

Random token insertion

Randomly inserts a token. Unsure of how this would even be represented in a dependency tree or what POS tag to assign it to. Could work for span classification tasks though. | Usage: Wei and Zau (2019)

Use of augmenty with spacy config files for training

I didn't see any documentation on how to import these augmenters when using spacy 3.0's config and command line system when training.
Is it possible to use it in this sense?
If so, how?

apon further review, for the command line to register new augmentations, the flag:
-- code <code.py>
Needs to be set when calling the training. I have tried to point to the specific file that contains the keystroke aug that I wanted but it complains about not knowing a parent for relative imports. I also tried the various init.py files but it complained also.
It seems to work when you take the code out and place it in a new file without relative imports and point to that.

image

Which page or section is this issue related to?

https://spacy.io/usage/training#data-augmentation-custom

https://kennethenevoldsen.github.io/augmenty/tutorials/introduction.html#Applying-the-augmentation

publicity

Announce augmenty:

  •  SpaCy Universe
  • Twitter
  • linkedin

Paragraf subset augmenter

A paragraf subset augmentation which can work on token and sentence level. It will sample a random percentage of included coherent tokens/sentences and a random token/sentence start position ensuring the former constraint is maintained. The augmenter needs to handle annotated entities and avoid breaking them.

Input arguments:
level: how often to apply augmenter
min_paragraf: Minimum percentage of tokens or sentences to include. Ie. 4 sentences with min_paragraf=0.5 means it as a minimum includes 2 sentences.
sentence_level: Boolean to define if token or sentence level to define

Example - sentence level

import augmenty
import spacy
nlp = spacy.load("en_core_web_sm")

# four sentences
texts = [
    "Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
    "for obtaining higher performance on limited data. You can also use it to see how "
    "robust your model is to changes. It will sample subset of the paragraf.",
]
docs = nlp(texts)

augmenter = augmenty.load("paragraf_subset.v1", level=1.0, min_paragraf=0.5, sentence_level=True)

list(augmenty.texts(texts, augmenter, nlp))

Example outputs:

The first section:

Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool 
for obtaining higher performance on limited data.

The middle section:

Augmentation is a wonderful tool for obtaining higher performance on limited data. 
You can also use it to see how robust your model is to changes.

The middle section:

You can also use it to see how robust your model is to changes. It will sample subset 
of the paragraf.

Additional thoughts:

Possibly addition of a reverse augmenter, eg. removing a coherent section of tokens/sentences.

Improve tests

  • Make test draw from a single util script of fixtures
  • time limit test_all using timeout
  • parameterise test_all using pytest.mark.parametrize
    • Make a test which check if all augmenters have a test.

Add integration for sense2vec

SpaCy includes a more deliberate sense2vec extension, which might get better word replacements than the word embedding replace.

random token deletion

random deletion of token. Unsure of how this would even be represented in a dependency tree or what POS tag to assign it to. Could work for span classification tasks though. | Usage: Wei and Zau (2019)

Resolve spancat dependencies in entity_replacer_v1

In the current function ent_augmenter_v1, resolves all dependencies without resolving the dependency on spancat objects.
When we change the entities in the sentence, please help in resolving spancat labels like "sc", "spans" or custom label. The feature will allow running augmenty on a dataset that has both entities and spans.
For example -

"My name is Srijith Srinath. I work in Spacy."

Ents -
PERSON_NAME - Srijith Srinath
WORKPLACE - Spacy

Spans -
PERSON_DESC - My name is Srijith Srinath
WORKPLACE_DESC - I work in Spacy

If we change the ent "Srijith Srianth" to "John Doe". The changes occur only on ents. Make changes to spans as well, as follows -

Ents -
PERSON_NAME - John Doe
WORKPLACE - Spacy

Spans -
PERSON_DESC - My name is John Doe
WORKPLACE_DESC - I work in Spacy

Misaligned Token after Data Augmentation

Hi Kenneth,

first of all, I would like to thank you for publishing the amazing augmentation library Augmenty. It provides a wide range of augmentation possibilities in terms of modifications and modification levels.

I am using Augmenty to modify emails, with the target to test the robustness of my custom (spacy) NER model and to increase the model's robustness. I annotated the named entities (15 labels) of the emails using Prodigy and saved it in the spacy format (DocBin). Subsequently, I trained a German NER model with spacy. The data augmentation of the annotated data was quite straight forward (You find my custom augmenter attached to the email). Here is my code:

nlp = spacy.load("/home/models/model-best/")
corpus = Corpus('data/' + db + '.spacy')
augmented_corpus = [
e for example in corpus(nlp) for e in augmenter(nlp, example)
]
docs: Dict = {"data": []}
for eg in augmented_corpus:
doc = eg.reference
docs["data"].append(doc)

docbin = DocBin(docs=docs["data"],
attrs=["ENT_IOB", "ENT_TYPE"],
store_user_data=True)
docbin.to_disk('data/' + db + '_augmented.spacy')

I encountered a token alignment problem in the training process. Spacy (python -m spacy debug data) returned a warning, that several tokens are misaligned. I am wondering why this happens. In the example “applying augmentation to examples or a corpus” you use a dataset in the CONLL (?) format, while I use a .DocBin file as basis. May this be the reason? Or do I need to change the tokenizer settings?

[nlp]
lang = "de"
pipeline = ["tok2vec","ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@Tokenizers":"spacy.Tokenizer.v1"}

I looked into the augmented dataset and I didn’t find a clear pattern for the misalignment. Here is a short example:

corpus = Corpus('data/' + dbs_train[1] + '_augmented.spacy')
examples_aug = []
for example in corpus(nlp):
examples_aug.append(example)

eg_aug=examples_aug[4]
align_aug = eg_aug.alignment
gold_aug = eg_aug.reference

for token in gold_aug:
output_aug.append(str(token) + ' ' + str(align_aug.x2y.lengths[token.i]))

With align.x2y.lengths[token.i] some numbers are greater than 1, which means misalignment. But I don’t understand the output:
['\n\n 1', 'Von 1', ': 3', 'Miller 1']

In this case the ‘:’ has a number of 3, why is that? Can you help me with this issue?

For reproducability I have sent you the data via email.

Many thanks!

Your Environment

  • augmenty Version Used: 1.4.3
  • Operating System: Red Hat Enterprise Linux 8.8 (Ootpa)
  • Python Version Used: 3.9
  • spaCy Version Used: 3.7.4
  • Environment Information: Posit 2023.03.0

Add a repeat utility function

Add function which repeats the augmenter n times

import augmenty

aug = augmenter.load(...)
rep_aug = augmenty.repeat(augmenter=aug, n=3)

Danish synonym augmentation return english suggestions

Code:

import spacy
import augmenty
nlp = spacy.load("da_core_news_sm")
danish_synonym_augmenter = augmenty.load("wordnet_synonym.v1", level=1, lang="da")

texts = ["Hej, jeg hedder Person 1 og er fra Lokation 1 og arbejder i Organisation 1, mit cpr er CPR 1, telefon: Telefon 5 og email: Email 1. Person 1 er en 20 årig mand. Person 2 er en person som arbejder i Organisation 2. Person 3 er en mand som bor i Lokation 1 og arbejder i Organisation 4"]

augmented_texts = augmenty.texts(texts, augmenter=danish_synonym_augmenter, nlp=nlp)
for text in augmented_texts:
    diff = [ (x,y) for x,y in zip(text.split(), texts[0].split()) if x != y ]
    print(diff)

Output:

[('federation', 'Organisation'), ('telephone:', 'telefon:'), ('telephone_set', 'Telefon'), ('person', 'Person'), ('adult_male.', 'mand.'), ('individual', 'Person'), ('somebody', 'person'), ('confederation', 'Organisation'), ('man', 'Person'), ('adult_male', 'mand'), ('confederation', 'Organisation')]

Environment information

  • augmenty Version Used: 0.0.3
  • Operating System: Mac OS / Windows
  • Python Version Used: 3.6.5 & 3.9.5
  • spaCy Version Used: spacy==3.1.1
  • NLTK Version Used: nltk==3.6.2
  • Environment Information: conda environment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.