GithubHelp home page GithubHelp logo

isabella232 / events-in-text Goto Github PK

View Code? Open in Web Editor NEW

This project forked from crowdtruth/events-in-text

0.0 0.0 0.0 12.27 MB

Text Events Detection using crowd-sourced annotation data

Python 100.00%

events-in-text's Introduction

Events-in-Text

#Harnessing disagreement in event text classification using CrowdTruth annotation

Gathering new expert annotated data for any machine learning task is a costly and timely approach. Using crowd-sourcing for gathering human interpretation on text, video and audio is a cost and time efficient alternative. There are many crowd-sourcing platforms available, for instance CrowdFlower or Amazon Mechanical Turk. However while experts have specific domain knowledge and incentive to annotate following a strict set of rules, this does not apply to the human annotators or Workers on the crowd-sourcing platforms. In "CrowdTruth methodology, a novel approach for gathering annotated data from the crowd", the authors state that human interpretation is subjective and thus having multiple people performing the same task on a crowd-sourcing platform will lead to multiple interpretations of this task. A common way of dealing with this problem is by using Majority Voting, take the annotation that has highest frequency and disregard all other annotations. However this removes the disagreement which is a signal, not noise. In this paper we introduce a novel way to incorporate the disagreement in the annotators as a way to weigh the features for textual data. We propose multiple novel algorithms to create the crowd-based distribution weights and discuss the pros and cons of each one. We test our weighted features with a modified Multinominal Naive Bayes algorithm. We show that using crowd based distribution weights we are able to perform as well and better than expert-labeled data while learning on only 10% of the training data. Furthermore our approach also has a much faster learning rate.

Annotation task

Fig.1: Task. Fig.2: Task. Fig.2: Task.

Task Workers/Unit Payment/Unit
Task 15 $0.05

Dataset files

Text used in this research orginate from the tempeval2 where all sentences have been annotated with an event and event type by a panel of experts.

This repository contains all the data and code that was used to annotate the text as all features and labels used during training and testing of the classifiers. We distinguish two different folders 'Code' and 'Data'. The folder 'Code' contains all python files used to compute the Sentence-Clarity, Sentence-Event- Clarity, TFIDF scores and to create the features and classifiers. All required jars have been provided. The 'Data' folder contains the input and output files from all crowdsourcing tasks aswell as the output files produced by the python files from 'Code'.

The data and results produced by this research have been achieved by following the following steps:

  1. Gather annotations crowdsourcing task 1 using the input file [https://raw.githubusercontent.com/CrowdTruth/Events-in-Text/master/Data/CrowdTask/sample_only_sentences_0.csv), employing batches with a maximum size of 30 sentences.
  2. Run python code /Code/ to create combinedTrain folder
  3. Run python code Code/Prepoccesing to compute the features
  4. Run python code Code/Bayes on the outputfiles of the previous stepts to train and test the classifier

events-in-text's People

Contributors

morgranth avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.