ink-usc / alpacatag Goto Github PK

View Code? Open in Web Editor NEW

137.0 13.0 22.0 89.39 MB

AlpacaTag: An Active Learning-based Crowd Annotation Framework for Sequence Tagging (ACL 2019 Demo)

Home Page: http://inklab.usc.edu/AlpacaTag/

Python 37.85% CSS 8.42% JavaScript 4.83% HTML 45.92% Jupyter Notebook 2.64% Shell 0.10% Dockerfile 0.24%

sequence-labeling crowdsourcing active-learning named-entity-recognition annotation-framework

alpacatag's People

Contributors

Stargazers

Watchers

alpacatag's Issues

Potential mistake in Active Learning (Acquisition.py)

I believe there is a bug in this line in acquisition.py (which is used to rank and fetch samples based on the confidence score of your model).

Let me explain:

Line 41 generates batches on which the model will iterate upon. Within each batch, data is reshuffled based on its length, putting longer sequences (which have fewer padded tokens) at the top (as shown here).
In order to match scores - > samples, we need to extract the sorting info to undo it. This is done a few lines down via sort_info = data['sort_info']. This returns a tuple which describes the reshuffling that happened in step above. For example, a tuple of the form (3,0,2,1) tells us that the first element in this batch was in fact the 4th one (in the original dataset), the second one was the first and so on.
Finally, because we want the scores to be in the same order as before, the line I mentioned in the beginning, runs the following code command probscores.extend(list(norm_scores[np.array(sort_info)])). The goal of this is to reshuffle the probabilty/confidence scores back so that they respect the original ordering and not this new, length based ordering that is used within each batch.

The issue is that (if I am not missing something obvious), norm_scores[np.array(sort_info)] is not what we want. Let me explain it with the below example:

You ask the model to rank you sentences=[["Hello", "World"], ["This","is","a","big","sentence"], ["Hello", "World", "."]].
These get reshuffled (based on their size) like so ordered_sentences=[["This","is","a","big","sentence"], ["Hello", "World", "."], ["Hello", "World"]], giving us sort_info = (1,2,0).
The model will then score them. Let's say it gives us norm_scores = [0.1, 0.2, 0.3], meaning it gives a score of 0.1 to ["This","is","a","big","sentence"], 0.2 to ["Hello", "World", "."] and 0.3 to ["Hello", "World"].
Current solution list(norm_scores[np.array(sort_info)])) will reshuffle it to [0.2, 0.3,0.1]. In the original dataset, this means that we give a score of 0.2 to ["Hello", "World"], 0.3 to ["This","is","a","big","sentence"] and 0.1 to ["Hello", "World"], which is not the same as above

The root of this problem is that sort_info returns indices (via argsort) that lead to the sorted array. It does not return the indices required to unshuffle it. In essence, what we need is the inverse. One proposed solution for this, is to instead used inverse_sorting = [sort_list.index(i) for i in range(len(sort_list))], and then list(norm_scores[np.array(invere_sorting)]). In the above example, inverse_sorting = [2, 0, 1], which in turn gives a score of [0.3, 0.1, 0.2], which is what we want in the original dataset (0.3 for ["Hello", "World"], 0.1 for ["This","is","a","big","sentence"] and 0.2 for ["Hello", "World", "."].

I stumbled on this error by noticing that sentences that were exactly the same, would be given different confidence scores by the model (because of the mistake in undoing the reshuffle). Nevertheless, the example I gave above should suffice.

onlinelearning

Hi,

I've followed the installation procedures (https://github.com/INK-USC/AlpacaTag/wiki/Installation), everything fine. However, when I turn on active learning, I get the following error:

10/Oct/2019 09:02:33] "PATCH /api/projects/27/docs/2162 HTTP/1.1" 200 956
[10/Oct/2019 09:02:33] "GET /api/projects/27/progress/ HTTP/1.1" 200 25
[10/Oct/2019 09:02:33] "GET /api/projects/27/progress/ HTTP/1.1" 200 25
[10/Oct/2019 09:02:45] "GET /api/projects/27/progress/ HTTP/1.1" 200 25
Internal Server Error: /api/projects/27/onlinelearning/
Traceback (most recent call last):
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/django/core/handlers/base.py", line 126, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/django/core/handlers/base.py", line 124, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/django/views/generic/base.py", line 68, in view
return self.dispatch(request, *args, **kwargs)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/rest_framework/views.py", line 483, in dispatch
response = self.handle_exception(exc)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/rest_framework/views.py", line 443, in handle_exception
self.raise_uncaught_exception(exc)
File "/home/adminuser/anaconda3/lib/python3.7/site-packages/rest_framework/views.py", line 480, in dispatch
response = handler(request, *args, **kwargs)
File "/home/adminuser/alpaca/AlpacaTag/annotation/AlpacaTag/server/api.py", line 582, in post
alpaca_online_learning(train_docs, annotations, setting_data['epoch'], setting_data['batch'])
File "/home/adminuser/alpaca/AlpacaTag/annotation/AlpacaTag/server/api.py", line 69, in alpaca_online_learning
response = alpaca_client.online_learning(train_docs, annotations, epoch, batch)
AttributeError: 'NoneType' object has no attribute 'online_learning'
[10/Oct/2019 09:02:45] "POST /api/projects/27/onlinelearning/ HTTP/1.1" 500 16777

Database migration failed

I just followed the installation steps. when I am running this command:
python manage.py migrate
The program throws an exception：
django.core.exceptions.ImproperlyConfigured: Set the SECRET_KEY environment variable
This is defined in the setting.py file：
SECRET_KEY = env('SECRET_KEY')
then randomly generated a string of text as the SECRET_KEY：
SECRET_KEY='2ap4_#)wk@5)3lsh6idzxwaouy6)*(5z#w(3atk0%a5!+-29j-
but, when I am running the command: python manage.py migrate
The program throws another exception：
django.core.exceptions.ImproperlyConfigured: Set the DATABASE_URL environment variable
I am sorry, I tried to learn the basics of Django, but I didn't know anything about Web development, or database, and after looking at some of the basics of Django, I still didn't know how to Debug this error.

Import already annotated data and handle CoNLL format

Hi,

First of all, thanks a lot for this amazing tool!

I wanted to know if you have planned to:

import already annotated dataset: this feature would be nice to fix a dataset or to add a new label
export/import dataset as CoNLL format: basically a two column CSV file that looks like

-DOCSTART- O

EU B-ORG
rejects O
German B-MISC
call O
to O
boycott O
British B-MISC
lamb O
. O

-DOCSTART- O

Rare O
Hendrix B-PER
song O
draft O
sells O
for O
almost O
$ O
17,000 O
. O

Would be a perfect feature for quickly train/test a model over an annotated dataset.

Thanks!

Modifying model into spaCy pre-trained models

Hey, I'm really impressed by this work and really appreciate it by implementing this in the remarkable annotation tool, Doccano. I'm wondering can we change the existing model of the, for example, an existing model on the architecture using a pre-trained model from spacy? Are there any ways to do that?
Thank you.

Download Error

I don't know much about Django, so how can I solve this problem?

Integrate Active Model

enable/disable sentence orders: enable -> active learning; disable -> online learning

where is the Automatic crowd consolidation code

Thanks for your share!
I want to know where is the Automatic crowd consolidation code?
I know each worker has a pytorch model , then how to consolidation them?
Looking forward for your reply!

AlpacaClient() instance is dead

Hi, After I have at least 10 entries, I am waiting for the model to train but it never does so I ran the code in debug version. I see that alpaca_client instance times out without initialization. Is this a known issue?

The error is particularly generated by this code

def alpaca_online_learning(train_docs, annotations, epoch, batch):
    global alpaca_client
    response = alpaca_client.online_learning(train_docs, annotations, epoch, batch)
    if response == 'error':
        print('error')
        time.sleep(2)
        alpaca_client.online_learning(train_docs, annotations, epoch, batch)

alpaca_client is None so the whole function cannot execute.

\n

Thanks for providing this software.
I'm trying it for NER with text containing a lot of newline characters. Unfortunately these don't get displayed any more.
I see doccano has resolved this in pull request doccano/doccano#654
Can this be added. Newlines are really important in my case.

Annotation on Nested Entities and non-overlapping entities?

Database migration failed

I just followed the installation steps. when I am running this command:
python manage.py migrate
The program throws an exception：
django.core.exceptions.ImproperlyConfigured: Set the SECRET_KEY environment variable
This is defined in the setting.py file：
SECRET_KEY = env('SECRET_KEY')
I don't know how to solve this problem, then randomly generated a string of text as the SECRET_KEY：
SECRET_KEY='2ap4_#)wk@5)3lsh6idzxwaouy6)*(5z#w(3atk0%a5!+-29j-
but, when I am running the command: python manage.py migrate
The program throws another exception：
django.core.exceptions.ImproperlyConfigured: Set the DATABASE_URL environment variable
I am sorry, I tried to learn the basics of Django, but I didn't know anything about Web development, or database, and after looking at some of the basics of Django, I still didn't know how to Debug this error.

Server api logic about alpaca_* such as alpaca_online_learning alpaca_active_learning and so on

the code seems have many

response = alpaca_client.online_learning(train_docs, annotations, epoch, batch
 if response == 'error':
        print('error')
        time.sleep(2)
        alpaca_client.online_learning(train_docs, annotations, epoch, batch)

if the first line return "error" it is more possiable always return "error" after sleep 2s
which will always yield status code 500 of response.
For example, some score of rnn decode may be inf.
This may occur when the downstream tag fitting process not stable (as normal training of
some neutral network)
Users should have a monitor of this incremental training process. If the network fitted bad should
Terminated before going wrong way too long.
Only return status 500 with frontend blocked seems not appropriate.

So i think should add a realtime validation interface by client.

select-by-click according to given boundaries
floating buttons for fast typing

ink-usc / alpacatag Goto Github PK

alpacatag's People

Contributors

Stargazers

Watchers

Forkers

alpacatag's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs