doccano / doccano-transformer Goto Github PK

View Code? Open in Web Editor NEW

106.0 10.0 33.0 124 KB

The official tool for transforming doccano format into common dataset formats.

License: MIT License

Python 100.00%

conll annotation dataset machine-learning natural-language-processing doccano

doccano-transformer's Introduction

doccano

doccano is an open-source text annotation tool for humans. It provides annotation features for text classification, sequence labeling, and sequence to sequence tasks. You can create labeled data for sentiment analysis, named entity recognition, text summarization, and so on. Just create a project, upload data, and start annotating. You can build a dataset in hours.

Demo

Try the annotation demo.

Documentation

Read the documentation at https://doccano.github.io/doccano/.

Features

Collaborative annotation
Multi-language support
Mobile support
Emoji 😄 support
Dark theme
RESTful API

Usage

There are three options to run doccano:

pip (Python 3.8+)
Docker
Docker Compose

pip

To install doccano, run:

pip install doccano

By default, SQLite 3 is used for the default database. If you want to use PostgreSQL, install the additional dependencies:

pip install 'doccano[postgresql]'

and set the DATABASE_URL environment variable according to your PostgreSQL credentials:

DATABASE_URL="postgres://${POSTGRES_USER}:${POSTGRES_PASSWORD}@${POSTGRES_HOST}:${POSTGRES_PORT}/${POSTGRES_DB}?sslmode=disable"

After installation, run the following commands:

# Initialize database.
doccano init
# Create a super user.
doccano createuser --username admin --password pass
# Start a web server.
doccano webserver --port 8000

In another terminal, run the command:

# Start the task queue to handle file upload/download.
doccano task

Go to http://127.0.0.1:8000/.

Docker

As a one-time setup, create a Docker container as follows:

docker pull doccano/doccano
docker container create --name doccano \
  -e "ADMIN_USERNAME=admin" \
  -e "[email protected]" \
  -e "ADMIN_PASSWORD=password" \
  -v doccano-db:/data \
  -p 8000:8000 doccano/doccano

Next, start doccano by running the container:

docker container start doccano

Go to http://127.0.0.1:8000/.

To stop the container, run docker container stop doccano -t 5. All data created in the container will persist across restarts.

If you want to use the latest features, specify the nightly tag:

docker pull doccano/doccano:nightly

Docker Compose

You need to install Git and clone the repository:

git clone https://github.com/doccano/doccano.git
cd doccano

Note for Windows developers: Be sure to configure git to correctly handle line endings or you may encounter status code 127 errors while running the services in future steps. Running with the git config options below will ensure your git directory correctly handles line endings.

git clone https://github.com/doccano/doccano.git --config core.autocrlf=input

Then, create an .env file with variables in the following format (see ./docker/.env.example):

# platform settings
ADMIN_USERNAME=admin
ADMIN_PASSWORD=password
[email protected]

# rabbit mq settings
RABBITMQ_DEFAULT_USER=doccano
RABBITMQ_DEFAULT_PASS=doccano

# database settings
POSTGRES_USER=doccano
POSTGRES_PASSWORD=doccano
POSTGRES_DB=doccano

After running the following command, access http://127.0.0.1/.

docker-compose -f docker/docker-compose.prod.yml --env-file .env up

One-click Deployment

Service	Button
AWS¹
Heroku

FAQ

See the documentation for details.

Contribution

As with any software, doccano is under continuous development. If you have requests for features, please file an issue describing your request. Also, if you want to see work towards a specific feature, feel free to contribute by working towards it. The standard procedure is to fork the repository, add a feature, fix a bug, then file a pull request that your changes are to be merged into the main repository and included in the next release.

Here are some tips might be helpful. How to Contribute to Doccano Project

Citation

@misc{doccano,
  title={{doccano}: Text Annotation Tool for Human},
  url={https://github.com/doccano/doccano},
  note={Software available from https://github.com/doccano/doccano},
  author={
    Hiroki Nakayama and
    Takahiro Kubo and
    Junya Kamura and
    Yasufumi Taniguchi and
    Xu Liang},
  year={2018},
}

Contact

For help and feedback, feel free to contact the author.

(1) EC2 KeyPair cannot be created automatically, so make sure you have an existing EC2 KeyPair in one region. Or create one yourself. (2) If you want to access doccano via HTTPS in AWS, here is an instruction. ↩

doccano-transformer's People

Contributors

Stargazers

Watchers

doccano-transformer's Issues

Hello,

How to reproduce the behaviour

Your Environment

Operating System:
Python Version Used:
doccano-transformer Version:

to_conll2003 "B-"flag mark in the wrong place

How to reproduce the behaviour

{"id": 19523, "text": "\"颜姑娘。\"易左古不懂颜幼韶所道万福是什么意思。", "meta": {}, "annotation_approver": null, "labels": [[6, 9, "SPEAKER"]]}

when to_conll2003, the B-SPEAKER is not corresponding to 易, it is in the place before 易 and is "

solution:

I change the function "create_bio_tags" in the "utils.py",let it be as below:

#if i >= n or token_end < labels[i][0]:
if i >= n or token_end <= labels[i][0]:

Your Environment

Operating System: MacOS 11.0.1
Python Version Used: 3.7.3
doccano-transformer Version: 1.0.2

Package is not available on pip

Not able to pip install doccano-transformer
the package is not there on pip
pip search doccano-transformer does not contain the entry of mentioned package.

Bug in Sentence Offset

Hello, I want to report a bug. I have a Sentence like this:

As a administrator, I want to refund sponsorship money that was processed via stripe, so that people get their monies back.

When I try to convert it to CoNLL the span is not converted well. I then debug the library and found that the offset is wrong. Here is the output of the offset:

As 0
a 3
administrator, 3
I 20
want 22
to 25
refund 30
sponsorship 37
money 49
that 55
was 60
processed 64
via 74
stripe, 78
so 86
that 89
people 94
get 101
their 103
monies 111
back. 118

As you can see, in the second and third lines, the offset is the same (3 and 3, while It should be 3 and 5). This behavior makes the span undetected in the conversion process.

It seems that the get_offsets function in utils.py checks the equality in the sequence of characters to decide about the offsets.

def get_offsets(
        text: str,
        tokens: List[str],
        start: Optional[int] = 0) -> List[int]:
    """Calculate char offsets of each tokens.

    Args:
        text (str): The string before tokenized.
        tokens (List[str]): The list of the string. Each string corresponds
            token.
        start (Optional[int]): The start position.
    Returns:
        (List[str]): The list of the offset.
    """
    offsets = []
    i = 0
    for token in tokens:
        for j, char in enumerate(token):
            while char != text[i]:
                i += 1

            if j == 0:
                offsets.append(i + start)
    return offsets

It will be a problem if the last character of the previous word is the same as the first character of the next word. I'm still looking for a fix to this problem.

Cheers

No transformer for RELATION EXTRACTION

Hi,
I was trying for relation annotated data to be converted to spacy binary. Doccano-transformer supports only NERdataset not REDataset.

Is there any way to convert jsonl annotated data of relationship annotation to spacy binary.

Token level output

Hi, my question is related to this one . Is this feature already supported?

I'm using doccano to annotate my files and exporting them in .jsonl format. As an output I get something like this:

{"id":1,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]} {"id":2,"text":"...","entities":[{"id":123,"label":"Invoice Number Token","start_offset":216,"end_offset":226}],"relations":[{"id":6,"from_id": 123,"to_id": 125,"type": "Invoice Number Relation"}]}

My code looks like this:

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
with open('trensformed.txt', "w", encoding = "utf-8") as file:
    for entry in read_jsonl(filepath=r'admin.jsonl', dataset=NERDataset, encoding='latin-1').to_conll2003(tokenizer=str.split):
        file.write(entry["data"] + "\n")

I'm getting this error : KeyError: 'The file should includes either "labels" or "annotations".' What changes do I need to perform on the doccano output file in order to achieve the desired result?

Operating System: Windows 11
Python Version Used: 3.10.4
doccano-transformer Version: 1.0.2

spaCy format for text classification

Example format:

[
    ('Text', {'cats': {'POSITIVE': 1, 'NEGATIVE': 0}}),
    ('Text', {'cats': {'POSITIVE': 0, 'NEGATIVE': 1}})
]

How to save to a file

Thanks for your work,

I want to use the doccano annotated data for spark nlp, so I need to convert it to coll format, but I haven't found out how to save it to a file.

How do I do this part instead.

All suggestions and ideas are welcome.

Not compatible with spacy 3.x

When running the sample code I get the following error:

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-82-777c977a1d80> in <module>
      1 import doccano_transformer
      2 
----> 3 from doccano_transformer.datasets import NERDataset
      4 from doccano_transformer.utils import read_jsonl
      5 

~/miniconda3/envs/base/lib/python3.7/site-packages/doccano_transformer/datasets.py in <module>
      3 from typing import Any, Callable, Iterable, Iterator, List, Optional, TextIO
      4 
----> 5 from doccano_transformer.examples import Example, NERExample
      6 
      7 

~/miniconda3/envs/base/lib/python3.7/site-packages/doccano_transformer/examples.py in <module>
      2 from typing import Callable, Iterator, List, Optional
      3 
----> 4 from spacy.gold import biluo_tags_from_offsets
      5 
      6 from doccano_transformer import utils

ModuleNotFoundError: No module named 'spacy.gold'

Seems like this is removed from spacy v3.x: https://github.com/explosion/spaCy/releases

fastText format for text classification

Example:

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What’s the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces

No module named spacy.gold

Environment

Operating System: Windows
Python Version Used: 3.9.13
Spacy Version Used: 3.4
doccano-transformer Version: module 'doccano_transformer' has no attribute 'version'

When importing from doccano_transformer.datasets import NERDataset I receive an error

ModuleNotFoundError Traceback (most recent call last)
Cell In [8], line 1
----> 1 from doccano_transformer.datasets import NERDataset
2 from doccano_transformer.utils import read_jsonl

File c:\Users\yana.stamenova\work-data\RnD_Models\venv_new\lib\site-packages\doccano_transformer\datasets.py:5
2 import json
3 from typing import Any, Callable, Iterable, Iterator, List, Optional, TextIO
----> 5 from doccano_transformer.examples import Example, NERExample
8 class Dataset:
9 def init(
10 self,
11 filepath: str,
12 encoding: Optional[str] = 'utf-8',
13 transformation_func: Optional[Callable[[TextIO], Iterable[Any]]] = None
14 ) -> None:

File c:\Users\yana.stamenova\work-data\RnD_Models\venv_new\lib\site-packages\doccano_transformer\examples.py:4
1 from collections import defaultdict
2 from typing import Callable, Iterator, List, Optional
----> 4 from spacy.gold import biluo_tags_from_offsets
6 from doccano_transformer import utils
9 class Example:

ModuleNotFoundError: No module named 'spacy.gold'

gold is part of 2.x versions of spacy. After 3.0 it is renamed to training

I have proposed an edit to examples.py

Functions to convert spacy back to jsonl

You have provided jsonl to spacy function.

Could you add function to transform spacy to jsonl ?

detailed documentation

hI there I tried using this library but I couldn't figure out how to use it. Can you provide detailed documentation?

ConLL format also labelling the next word of the annotated part of the text

Hello, sorry for the other issue I accidentally created it.

After converting to ConLL format I get mislabeled "I-entity" tags. For example: "Obama was the president" results in [B-Person, I-Person, O, O] when it should be [B-Person, O, O, O].

I looked around at the original json1 data and found that this is due to the space after the entity being included in the tag. So instead of "Obama", the original tag is "Obama ", which results in the next word being included as an entity with this module. Is there anyway to get around taht without fixing the original json1 data? I have a lot of data that is already tagged and I can't imagine relabelling each one of them one by one.

If this can't be adressed within your module, can you give me tips on how to edit the original json1 file? I can maybe substract 1 from the end of the annotated part of the text. Thanks!

Terrible Documentation

Feature description

Improve the documentation, how is it possible that I cannot find documentation explaining the different classes? The tool can be as good as you like but I have to read directly the code to understand what features does it have or how to save the datasets once transformed...

Doccano transformer - Output files CoNLL and spaCy format

Hello guys,

Thank you for your work on Doccano and Doccano Transformer.
It would be great to have Doccano Transformer incorporated inside Doccano, especially for CoNLL 2003. : )

Or maybe more insights in the following code with clear output files at the end...

from doccano_transformer.datasets import NERDataset
from doccano_transformer.utils import read_jsonl
dataset = read_jsonl(filepath='example.jsonl', dataset=NERDataset, encoding='utf-8')
dataset.to_conll2003(tokenizer=str.split)
dataset.to_spacy(tokenizer=str.split)

I have two generator objects... But not output files.
<generator object NERDataset.to_conll2003 at 0x000001D0AA7C8A50>
<generator object NERDataset.to_spacy at 0x000001D0AA7CA7A0>

If you want to write in Japanese with details, I could translate them into very simple English for the documentation. I am happy to help as a contributor. : )

宜しくお願いします !!!

Akim

How do i save the transformed data?

How to reproduce the behaviour

Your Environment

Operating System:
Python Version Used:
doccano-transformer Version:

How to export transformed data?

How to export the result into disk after importing jasonl file and performing

dataset.to_conll2003(tokenizer=str.split)

How to reproduce the behaviour

install pytest, spacy
run tests

unittest: 1. attempt to run tests

python3 -m unittest discover tests

EEEE...
======================================================================
ERROR: test_from_labeling_jsonl_to_conll2003 (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 30, in test_from_labeling_jsonl_to_conll2003
    src_path = self.shared_datadir / 'labeling.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_jsonl_to_spacy (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 57, in test_from_labeling_jsonl_to_spacy
    src_path = self.shared_datadir / 'labeling.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_text_label_jsonl_to_conll2003 (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 17, in test_from_labeling_text_label_jsonl_to_conll2003
    src_path = self.shared_datadir / 'labeling_text_label.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

======================================================================
ERROR: test_from_labeling_text_label_jsonl_to_spacy (test_datasets.TestNERDataset)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/mjj/workspace/doccano-transformer/tests/test_datasets.py", line 43, in test_from_labeling_text_label_jsonl_to_spacy
    src_path = self.shared_datadir / 'labeling_text_label.jsonl'
AttributeError: 'TestNERDataset' object has no attribute 'shared_datadir'

----------------------------------------------------------------------
Ran 7 tests in 0.001s

FAILED (errors=4)

pytest: 2. attempt to run tests

pytest tests

=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.2, pytest-6.0.1, py-1.9.0, pluggy-0.13.1
rootdir: /home/mjj/workspace/doccano-transformer
collected 7 items                                                                                                                                                                                                 

tests/test_datasets.py EEEE                                                                                                                                                                                 [ 57%]
tests/test_utils.py ...                                                                                                                                                                                     [100%]

===================================================================================================== ERRORS ======================================================================================================
_____________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_jsonl_to_conll2003 ______________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 29
      def test_from_labeling_jsonl_to_conll2003(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
_______________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_jsonl_to_spacy ________________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 56
      def test_from_labeling_jsonl_to_spacy(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_text_label_jsonl_to_conll2003 ________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 16
      def test_from_labeling_text_label_jsonl_to_conll2003(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
__________________________________________________________________ ERROR at setup of TestNERDataset.test_from_labeling_text_label_jsonl_to_spacy __________________________________________________________________
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 42
      def test_from_labeling_text_label_jsonl_to_spacy(self):
file /home/mjj/workspace/doccano-transformer/tests/test_datasets.py, line 12
      @pytest.fixture(autouse=True)
      def initdir(self, shared_datadir):
E       fixture 'shared_datadir' not found
>       available fixtures: _UnitTestCase__pytest_class_setup, cache, capfd, capfdbinary, caplog, capsys, capsysbinary, doctest_namespace, initdir, monkeypatch, pytestconfig, record_property, record_testsuite_property, record_xml_attribute, recwarn, tmp_path, tmp_path_factory, tmpdir, tmpdir_factory
>       use 'pytest --fixtures [testpath]' for help on them.

/home/mjj/workspace/doccano-transformer/tests/test_datasets.py:12
============================================================================================= short test summary info =============================================================================================
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_jsonl_to_conll2003
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_jsonl_to_spacy
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_text_label_jsonl_to_conll2003
ERROR tests/test_datasets.py::TestNERDataset::test_from_labeling_text_label_jsonl_to_spacy
=========================================================================================== 3 passed, 4 errors in 0.33s ===========================================================================================

Your Environment

Operating System:
Linux desktop 5.4.0-45-generic #49-Ubuntu SMP Wed Aug 26 13:38:52 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Python Version Used:

Python 3.8.2 (default, Jul 16 2020, 14:00:26)
[GCC 9.3.0] on linux

doccano-transformer Version:
master branch

Not writing all entities in to_conll2003

How to reproduce the behaviour

I can't share the data because its confidential but some entities simply aren't written when using that function over documents!

doccano / doccano-transformer Goto Github PK

doccano-transformer's Introduction

doccano

Demo

Documentation

Features

Usage

pip

Docker

Docker Compose

One-click Deployment

FAQ

Contribution

Citation

Contact

Footnotes

doccano-transformer's People

Contributors

Stargazers

Watchers

Forkers

doccano-transformer's Issues

How to reproduce the behaviour

Your Environment

How to reproduce the behaviour

Your Environment

Environment

Feature description

How to reproduce the behaviour

Your Environment

How to reproduce the behaviour

unittest: 1. attempt to run tests

pytest: 2. attempt to run tests

Your Environment

How to reproduce the behaviour

Recommend Projects

Recommend Topics

Recommend Org

Jobs