GithubHelp home page GithubHelp logo

multiuser_prodigy's Introduction

multiuser_prodigy

This is a multi-annotator setup for Prodigy, Explosion AI's data annotation tool, that uses a Mongo DB to allocate annotation tasks to annotators working on different Prodigy instances running on seperate ports. This use case focuses on collecting gold standard annotations from a team of annotators using Prodigy, rather than on the active learning, single-annotator setup that Prodigy is primarily intended for.

There are a few examples of annotation interfaces in the repo, including code for annotators working on training an NER model or doing sentence classification with document context. Each annotator works on the Prodigy/port assigned to them, and a new DBStream class handles pulling the examples from Prodigy that are assigned to each worker.

I've used this setup for three major annotation projects now, but you'll need to modify the code to get it working for your project as well.

Mongo database

All tasks are stored in a Mongo DB, which allows different logic for how tasks are assigned to annotators. For instance, examples can go out to annotators until three annotations are collected, examples could go to two predetermined annotators from the wider pool, or annotations can be automatically resubmitted to a third annotator if the first two annotations disagree.

You can start a Mongo DB in a Docker container:

sudo docker run -d -p 127.0.0.1:27017:27017 -v /home/andy/MIT/multiuser_prodigy/db:/data/db  mongo

To load a list of tasks into the database:

python mongo_load.py -i assault_not_assault.jsonl -c "assault_gsr"

where -i is a JSONL file of tasks and -c specifies the collection name to load them into.

"seen" : {"$in" : [0,1]}}, {"coders"

Running

You'll need to modify the code of multiuser_db.py to access the right collection, set the names/ports of annotators, and the desired interface (NER, classification, etc).

Then you should launch the processes either in a screen or in the background:

python multiuser_db.py

Analysis

You can use Streamlit to set up a dashboard so annotators can check their progress. This one pulls results from the Mongo DB, but you could also call the Prodigy DB and show results from there.

A more complicated analysis dashboard setup is in Report.Rmd. This RMarkdown file reads in a CSV of coding information and generates figures in an HTML page that can be served from the annotation server. To record information about how long each task takes, add something like eg['time_loaded'] = datetime.now().isoformat() to your stream code and something like eg['time_returned'] = datetime.now().isoformat() to your update code. report_maker.py exports the DB to CSV and knits the RMarkdown on that CSV.

multiuser_prodigy's People

Contributors

ahalterman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multiuser_prodigy's Issues

Don't hard code processes

Make a function to take a list of tags/processes and have it iterate through to make the processes.

Figure out DB backup

The DB should be backed up at least once per day. Figure out the best way to export, compress, and upload it.

Add custom exclude logic for ner.manual

We want each annotator to be able to see a document (since they each do different tags), but we don't want the same annotator to see the same document twice. This is handled fine with --exclude for ner.teach, but not for ner.manual.

Subset big text into daily chunks

Write code to pull off pieces of the big piece of text and format in JSONL for each day's tasks.

Should it run the model over a bunch and pull out the uncertain ones like for the manifesto project?

Does --exclude prevent the example from being seen again, or the example + label combination?

Make into object

Enough with the passing stuff around. Make it an object so we can keep track of processes.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.