GithubHelp home page GithubHelp logo

yasufumy / sequence-label Goto Github PK

View Code? Open in Web Editor NEW
0.0 3.0 0.0 40 KB

A Tensor Creation and Label Reconstruction for Sequence Labeling

License: MIT License

Python 100.00%
machine-learning named-entity-recognition natural-language-processing nlp sequence-labeling

sequence-label's Introduction

sequence-label

sequence-label is a Python library that streamlines the process of creating tensors for sequence labels and reconstructing sequence labels data from tensors. Whether you're working on named entity recognition, part-of-speech tagging, or any other sequence labeling task, this library offers a convenient utility to simplify your workflow.

Basic Usage

Import the necessary dependencies:

from transformers import AutoTokenizer

from sequence_label import LabelSet, SequenceLabel
from sequence_label.transformers import create_alignments

Start by creating sequence labels using the SequenceLabel.from_dict method. Define your text and associated labels:

text1 = "Tokyo is the capital of Japan."
label1 = SequenceLabel.from_dict(
    tags=[
        {"start": 0, "end": 5, "label": "LOC"},
        {"start": 24, "end": 29, "label": "LOC"},
    ],
    size=len(text1),
)

text2 = "The Monster Naoya Inoue is the who's who of boxing."
label2 = SequenceLabel.from_dict(
    tags=[{"start": 12, "end": 23, "label": "PER"}],
    size=len(text2),
)

texts = [text1, text2]
labels = [label1, label2]

Next, tokenize your texts and create the alignments using the create_alignments method. alignments is a tuple of instances of LabelAlignment that aligns sequence labels with the tokenized result:

tokenizer = AutoTokenizer.from_pretrained("roberta-base")
batch_encoding = tokenizer(texts)

alignments = create_alignments(
    batch_encoding=batch_encoding,
    lengths=list(map(len, texts)),
    padding_token=tokenizer.pad_token
)

Now, create a label_set that will allow you to create tensors from sequence labels and reconstruct sequence labels from tensors. Use the label_set.encode_to_tag_indices method to create tag_indices:

label_set = LabelSet(
    labels={"ORG", "LOC", "PER", "MISC"},
    padding_index=-1,
)

tag_indices = label_set.encode_to_tag_indices(
    labels=labels,
    alignments=alignments,
)

Finally, use the label_set.decode method to reconstruct the sequence labels from tag_indices and alignments:

labels2 = label_set.decode(
    tag_indices=tag_indices, alignments=alignments,
)

assert labels == labels2

Installation

pip install sequence-label

sequence-label's People

Contributors

yasufumy avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.