GithubHelp home page GithubHelp logo

doccano / doccano-transformer Goto Github PK

View Code? Open in Web Editor NEW
106.0 10.0 33.0 124 KB

The official tool for transforming doccano format into common dataset formats.

License: MIT License

Python 100.00%
conll annotation dataset machine-learning natural-language-processing doccano

doccano-transformer's Introduction

doccano

Codacy Badge doccano CI

doccano is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. You can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.

Demo

Try the annotation demo.

Demo image

Documentation

Read the documentation at https://doccano.github.io/doccano/.

Features

  • Collaborative annotation
  • Multi-language support
  • Mobile support
  • Emoji 😄 support
  • Dark theme
  • RESTful API

Usage

There are three options to run doccano:

  • pip (Python 3.8+)
  • Docker
  • Docker Compose

pip

To install doccano, run:

pip install doccano

By default, SQLite 3 is used for the default database. If you want to use PostgreSQL, install the additional dependencies:

pip install 'doccano[postgresql]'

and set the DATABASE_URL environment variable according to your PostgreSQL credentials:

DATABASE_URL="postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}?sslmode=disable"

After installation, run the following commands:

# Initialize database.
doccano init
# Create a super user.
doccano createuser --username admin --password pass
# Start a web server.
doccano webserver --port 8000

In another terminal, run the command:

# Start the task queue to handle file upload/download.
doccano task

Go to http://127.0.0.1:8000/.

Docker

As a one-time setup, create a Docker container as follows:

docker pull doccano/doccano
docker container create --name doccano \
  -e "ADMIN_USERNAME=admin" \
  -e "[email protected]" \
  -e "ADMIN_PASSWORD=password" \
  -v doccano-db:/data \
  -p 8000:8000 doccano/doccano

Next, start doccano by running the container:

docker container start doccano

Go to http://127.0.0.1:8000/.

To stop the container, run docker container stop doccano -t 5. All data created in the container will persist across restarts.

If you want to use the latest features, specify the nightly tag:

docker pull doccano/doccano:nightly

Docker Compose

You need to install Git and clone the repository:

git clone https://github.com/doccano/doccano.git
cd doccano

Note for Windows developers: Be sure to configure git to correctly handle line endings or you may encounter status code 127 errors while running the services in future steps. Running with the git config options below will ensure your git directory correctly handles line endings.

git clone https://github.com/doccano/doccano.git --config core.autocrlf=input

Then, create an .env file with variables in the following format (see ./docker/.env.example):

# platform settings
ADMIN_USERNAME=admin
ADMIN_PASSWORD=password
[email protected]

# rabbit mq settings
RABBITMQ_DEFAULT_USER=doccano
RABBITMQ_DEFAULT_PASS=doccano

# database settings
POSTGRES_USER=doccano
POSTGRES_PASSWORD=doccano
POSTGRES_DB=doccano

After running the following command, access http://127.0.0.1/.

docker-compose -f docker/docker-compose.prod.yml --env-file .env up

One-click Deployment

Service Button
AWS1 AWS CloudFormation Launch Stack SVG Button
Heroku Deploy

FAQ

See the documentation for details.

Contribution

As with any software, doccano is under continuous development. If you have requests for features, please file an issue describing your request. Also, if you want to see work towards a specific feature, feel free to contribute by working towards it. The standard procedure is to fork the repository, add a feature, fix a bug, then file a pull request that your changes are to be merged into the main repository and included in the next release.

Here are some tips might be helpful. How to Contribute to Doccano Project

Citation

@misc{doccano,
  title={{doccano}: Text Annotation Tool for Human},
  url={https://github.com/doccano/doccano},
  note={Software available from https://github.com/doccano/doccano},
  author={
    Hiroki Nakayama and
    Takahiro Kubo and
    Junya Kamura and
    Yasufumi Taniguchi and
    Xu Liang},
  year={2018},
}

Contact

For help and feedback, feel free to contact the author.

Footnotes

  1. (1) EC2 KeyPair cannot be created automatically, so make sure you have an existing EC2 KeyPair in one region. Or create one yourself. (2) If you want to access doccano via HTTPS in AWS, here is an instruction.

doccano-transformer's People

Contributors

codacy-badger avatar dependabot[bot] avatar hironsan avatar yasufumy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

doccano-transformer's Issues

Hello,

How to reproduce the behaviour

Your Environment

  • Operating System:
  • Python Version Used:
  • doccano-transformer Version:

to_conll2003 "B-"flag mark in the wrong place

How to reproduce the behaviour

{"id": 19523, "text": "\"颜姑娘。\"易左古不懂颜幼韶所道万福是什么意思。", "meta": {}, "annotation_approver": null, "labels": [[6, 9, "SPEAKER"]]}

when to_conll2003, the B-SPEAKER is not corresponding to 易, it is in the place before 易 and is "

solution:

I change the function "create_bio_tags" in the "utils.py",let it be as below:

#if i >= n or token_end < labels[i][0]:
if i >= n or token_end <= labels[i][0]:

Your Environment

  • Operating System: MacOS 11.0.1
  • Python Version Used: 3.7.3
  • doccano-transformer Version: 1.0.2

Package is not available on pip

Not able to pip install doccano-transformer
the package is not there on pip
pip search doccano-transformer does not contain the entry of mentioned package.

Bug in Sentence Offset

Hello, I want to report a bug. I have a Sentence like this:

As a administrator, I want to refund sponsorship money that was processed via stripe, so that people get their monies back.

When I try to convert it to CoNLL the span is not converted well. I then debug the library and found that the offset is wrong. Here is the output of the offset:

As 0
a 3
administrator, 3
I 20
want 22
to 25
refund 30
sponsorship 37
money 49
that 55
was 60
processed 64
via 74
stripe, 78
so 86
that 89
people 94
get 101
their 103
monies 111
back. 118

As you can see, in the second and third lines, the offset is the same (3 and 3, while It should be 3 and 5). This behavior makes the span undetected in the conversion process.

It seems that the get_offsets function in utils.py checks the equality in the sequence of characters to decide about the offsets.

def get_offsets(
        text: str,
        tokens: List[str],
        start: Optional[int] = 0) -> List[int]:
    """Calculate char offsets of each tokens.

    Args:
        text (str): The string before tokenized.
        tokens (List[str]): The list of the string. Each string corresponds
            token.
        start (Optional[int]): The start position.
    Returns:
        (List[str]): The list of the offset.
    """
    offsets = []
    i = 0
    for token in tokens:
        for j, char in enumerate(token):
            while char != text[i]:
                i += 1

            if j == 0:
                offsets.append(i + start)
    return offsets

It will be a problem if the last character of the previous word is the same as the first character of the next word. I'm still looking for a fix to this problem.

Cheers

No transformer for RELATION EXTRACTION

Hi,
I was trying for relation annotated data to be converted to spacy binary. Doccano-transformer supports only NERdataset not REDataset.

Is there any way to convert jsonl annotated data of relationship annotation to spacy binary.

Token level output

Hi, my question is related to this one . Is this feature already supported?

I'm using doccano to annotate my files and exporting them in .jsonl format. As an output I get something like this:

{"id":1,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]} {"id":2,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]}

My code looks like this:

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open('trensformed.txt', "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath=r'admin.jsonl', dataset=NERDataset, encoding='latin-1').to_conll2003(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I'm getting this error : KeyError: 'The file should includes either "labels" or "annotations".' What changes do I need to perform on the doccano output file in order to achieve the desired result?

  • Operating System: Windows 11
  • Python Version Used: 3.10.4
  • doccano-transformer Version: 1.0.2

How to save to a file

Thanks for your work,

I want to use the doccano annotated data for spark nlp, so I need to convert it to coll format, but I haven't found out how to save it to a file.

How do I do this part instead.

All suggestions and ideas are welcome.

Not compatible with spacy 3.x

When running the sample code I get the following error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-82-777c977a1d80> in <module>
      1 import doccano_transformer
      2 
----> 3 from doccano_transformer.datasets import NERDataset
      4 from doccano_transformer.utils import read_jsonl
      5 

~/miniconda3/envs/base/lib/python3.7/site-packages/doccano_transformer/datasets.py in <module>
      3 from typing import Any, Callable, Iterable, Iterator, List, Optional, TextIO
      4 
----> 5 from doccano_transformer.examples import Example, NERExample
      6 
      7 

~/miniconda3/envs/base/lib/python3.7/site-packages/doccano_transformer/examples.py in <module>
      2 from typing import Callable, Iterator, List, Optional
      3 
----> 4 from spacy.gold import biluo_tags_from_offsets
      5 
      6 from doccano_transformer import utils

ModuleNotFoundError: No module named 'spacy.gold'

Seems like this is removed from spacy v3.x: https://github.com/explosion/spaCy/releases

fastText format for text classification

Example:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces

No module named spacy.gold

Environment

  • Operating System: Windows
  • Python Version Used: 3.9.13
  • Spacy Version Used: 3.4
  • doccano-transformer Version: module 'doccano_transformer' has no attribute 'version'

When importing from doccano_transformer.datasets import NERDataset I receive an error

ModuleNotFoundError Traceback (most recent call last)
Cell In [8], line 1
----> 1 from doccano_transformer.datasets import NERDataset
2 from doccano_transformer.utils import read_jsonl

File c:\Users\yana.stamenova\work-data\RnD_Models\venv_new\lib\site-packages\doccano_transformer\datasets.py:5
2 import json
3 from typing import Any, Callable, Iterable, Iterator, List, Optional, TextIO
----> 5 from doccano_transformer.examples import Example, NERExample
8 class Dataset:
9 def init(
10 self,
11 filepath: str,
12 encoding: Optional[str] = 'utf-8',
13 transformation_func: Optional[Callable[[TextIO], Iterable[Any]]] = None
14 ) -> None:

File c:\Users\yana.stamenova\work-data\RnD_Models\venv_new\lib\site-packages\doccano_transformer\examples.py:4
1 from collections import defaultdict
2 from typing import Callable, Iterator, List, Optional
----> 4 from spacy.gold import biluo_tags_from_offsets
6 from doccano_transformer import utils
9 class Example:

ModuleNotFoundError: No module named 'spacy.gold'

gold is part of 2.x versions of spacy. After 3.0 it is renamed to training

I have proposed an edit to examples.py

detailed documentation

hI there I tried using this library but I couldn't figure out how to use it. Can you provide detailed documentation?

ConLL format also labelling the next word of the annotated part of the text

Hello, sorry for the other issue I accidentally created it.

After converting to ConLL format I get mislabeled "I-entity" tags. For example: "Obama was the president" results in [B-Person, I-Person, O, O] when it should be [B-Person, O, O, O].

I looked around at the original json1 data and found that this is due to the space after the entity being included in the tag. So instead of "Obama", the original tag is "Obama ", which results in the next word being included as an entity with this module. Is there anyway to get around taht without fixing the original json1 data? I have a lot of data that is already tagged and I can't imagine relabelling each one of them one by one.

If this can't be adressed within your module, can you give me tips on how to edit the original json1 file? I can maybe substract 1 from the end of the annotated part of the text. Thanks!

Terrible Documentation

Feature description

Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...

Doccano transformer - Output files CoNLL and spaCy format

Hello guys,

Thank you for your work on Doccano and Doccano Transformer.
It would be great to have Doccano Transformer incorporated inside Doccano, especially for CoNLL 2003. : )

Or maybe more insights in the following code with clear output files at the end...

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')
dataset.to_conll2003(tokenizer=str.split)
dataset.to_spacy(tokenizer=str.split)

I have two generator objects... But not output files.
<generator object NERDataset.to_conll2003 at 0x000001D0AA7C8A50>
<generator object NERDataset.to_spacy at 0x000001D0AA7CA7A0>

If you want to write in Japanese with details, I could translate them into very simple English for the documentation. I am happy to help as a contributor. : )

宜しくお願いします !!!

Akim

to_conll2003 function returns [ ] when annotations array is blank.

trying with following text string,
{"id": 2580, "text": "RUGBY UNION - CUTTITTA BACK FOR ITALY AFTER A YEAR .\nROME 1996-12-06\nItaly recalled Marcello Cuttitta", "annotations": []}

the output is a blank array, it should give output with 'O' for all entity type.

Tests in test/test_datasets.py are broken?

How to reproduce the behaviour

  • install pytest, spacy
  • run tests

unittest: 1. attempt to run tests

python3 -m unittest discover tests 
EEEE...
======================================================================
ERROR: test_from_labeling_jsonl_to_conll2003 (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 30, in test_from_labeling_jsonl_to_conll2003
    src_path = self.shared_datadir / 'labeling.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_jsonl_to_spacy (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 57, in test_from_labeling_jsonl_to_spacy
    src_path = self.shared_datadir / 'labeling.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_text_label_jsonl_to_conll2003 (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 17, in test_from_labeling_text_label_jsonl_to_conll2003
    src_path = self.shared_datadir / 'labeling_text_label.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_text_label_jsonl_to_spacy (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 43, in test_from_labeling_text_label_jsonl_to_spacy
    src_path = self.shared_datadir / 'labeling_text_label.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

----------------------------------------------------------------------
Ran 7 tests in 0.001s

FAILED (errors=4)

pytest: 2. attempt to run tests

pytest tests
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.2, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/mjj/workspace/doccano-transformer
collected 7 items                                                                                                                                                                                                 

tests/test_datasets.py EEEE                                                                                                                                                                                 [ 57%]
tests/test_utils.py ...                                                                                                                                                                                     [100%]

===================================================================================================== ERRORS ======================================================================================================
_____________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_jsonl_to_conll2003 ______________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 29
      def test_from_labeling_jsonl_to_conll2003(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
_______________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_jsonl_to_spacy ________________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 56
      def test_from_labeling_jsonl_to_spacy(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_text_label_jsonl_to_conll2003 ________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 16
      def test_from_labeling_text_label_jsonl_to_conll2003(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
__________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_text_label_jsonl_to_spacy __________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 42
      def test_from_labeling_text_label_jsonl_to_spacy(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
============================================================================================= short test summary info =============================================================================================
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_jsonl_to_conll2003
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_jsonl_to_spacy
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_text_label_jsonl_to_conll2003
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_text_label_jsonl_to_spacy
=========================================================================================== 3 passed, 4 errors in 0.33s ===========================================================================================

Your Environment

  • Operating System:
    Linux desktop 5.4.0-45-generic #49-Ubuntu SMP Wed Aug 26 13:38:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

  • Python Version Used:

Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux

  • doccano-transformer Version:
    master branch

Not writing all entities in to_conll2003

How to reproduce the behaviour

I can't share the data because its confidential but some entities simply aren't written when using that function over documents!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.