Harnessing diversity in crowds and machines for better NER performance

This repository contains the experimental results of identifying and typing named entities in English Wikipedia sentences. Even though current named entity recognition tools achieve nearly human-like performance or particular data types or domains, they are still highly dependent on the gold standard used for training and testing. The mainstream approach of gathering ground truth or gold standard for training and evaluating named entity recognition tools is still by means of experts, who are typically expensive and hard to find. Furthermore, for each new input type, or each new domain, new gold standards need to be created. Overall, the experts follow over-generalized annotation guidelines, meant to increase the inter-annotator agreement between experts. Such guidelines are thus prone to denying the intrinsic language ambiguity, multitude of perspectives and interpretations. Thus, ground truth datasets might not always be 'gold' or 'true' in terms of capturing the real text meaning and interpretation diversity. In the last decade crowdsourcing has also proven to be a suitable method for gathering such ground truth, but data ambiguity is still not handled.

However, in our work we focus on capturing the inter-annotator disagreement to provide a new type of ground truth, i.e., crowd truth - by applying the CrowdTruth metrics and methodology, where language features are taken into consideration. All the crowdsourcing experiments were performed through the CrowdTruth platform, while the results were processed and analyzed using the CrowdTruth methodology and metrics. For more information, check the CrowdTruth website. For gathering the annotated data, we used the CrowdFlower marketplace.

We propose a novel approach for extracting and typing named entities in texts, i.e.m a hybrid multi-machine-crowd approach where state-of-the-art NER tools are combined and their aggregated output is validated and improved through crowdsourcing. We report here results of:

Five state-of-the-art named entity recognition tools (SingleNER)
The combined output of the five state-of-the-art named entity recognition tools (MultiNER)
Crowdsourcing experiments for correcting and improving the Multi-NER output and also for improving the expert-based gold standard (MultiNER+Crowd).

Check the Results & Download the Data: Crowdsourcing-Improved-NE-Gold-Standard

Experimental Data:

We performed named entity extraction with five state-of-the-art NER tools: NERD-ML, TextRazor, THD, DBpediaSpotlight, and SemiTags. We performed a comparative analysis of (1) their performance (output) and (2) their combined performance (output), on two ground truth (GT) evaluation datasets used during Task 1 of the Open Knowledge Extraction (OKE) semantic challenge at ESWC in 2015 (OKE2015) and 2016 (OKE2016) respectively. The datasets can be checked here:

OKE2015: Open Knowledge Extraction 2015 (OKE2015) semantic challenge: https://github.com/anuzzolese/oke-challenge
OKE2016: Open Knowledge Extraction 2016 (OKE2016) semantic challenge: https://github.com/anuzzolese/oke-challenge-2016

In summary, there are $156$ Wikipedia sentences with $1007$ annotated named entities of types place, person, organization and role distributed across datasets in the following way:

	OKE2015			OKE2016
	Sentences	Named Entities		Sentences	Named Entities
	101	Place	120	55	Place	44
		Person	304		Person	105
		Organization	139		Organization	105
		Role	103		Role	86
Total	101	664		55	340

Dataset Files:

|--/aggregate

Various aggregated datasets for analyzing the output of multiple state-of-the-art named entity recognition tools (SingleNER), their combined output (MultiNER) and crowdsourcing data for correcting and improving the MultiNER approach and the gold standard.

|--/aggregate/OKE2015/OKE2015_SingleNER_and_MultiNER_eval.csv
|--/aggregate/OKE2016/OKE2016_SingleNER_and_MultiNER_eval.csv

These files contain the results of the five SOTA NER tools and the results of the MultiNER approach on the two gold standards datasets aforementioned. The files contain all the named entities in the gold standards and all the other alternatives (overlapping expressions) that were extracted by any SOTA NER for that entity. The columns are:

Identifier: sentence ID as referenced in the gold standard datasets
Sentence: sentence content as referenced in the gold standard datasets
NamedEntity: a potential named entity extracted by any of the five SOTA NER tools;
StartOffset: start offset of the named entity
EndOffset: end offset of the named entity
GoldEntityType: the type of the named entity as provided in the gold standard
EntityScore: the likelihood of an entity to be in the gold standard based on how many NER tools extracted it. The score is equal to the ratio of NER tools that extracted the entity.
SingleNERCount: the number of SOTA NER tools that extracted the named entity
Gold: binary value describing whether the named entity is contained in the gold standard (1) or not (0)
MultiNER: binary value describing whether any of the NER tools extracted the named entity (1) or not (0)
NERD,TextRazor,SemiTags,THD,DBpediaSpotlight: binary value describing whether the given NER tool extracted the named entity (1) or not (0)
TP_MultiNER: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the MultiNER approach
TP_NERD,TP_TextRazor,TP_SemiTags,TP_THD,TP_DBpediaSpotlight: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the SingleNER approach
TN_MultiNER: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the MultiNER approach
TN_NERD,TN_TextRazor,TN_SemiTags,TN_THD,TN_DBpediaSpotlight: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the SingleNER approach
FP_MultiNER: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the MultiNER approach
FP_NERD,FP_TextRazor,FP_SemiTags,FP_THD,FP_DBpediaSpotlight: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the SingleNER approach
FN_MultiNER: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the MultiNER approach
FN_NERD,FN_TextRazor,FN_SemiTags,FN_THD,FN_DBpediaSpotlight: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the SingleNER approach

|--/aggregate/OKE2015/OKE2015_MultiNER_and_Crowd_eval.csv
|--/aggregate/OKE2016/OKE2016_MultiNER_and_Crowd_eval.csv

These files contain the results of the five SOTA NER tools, the results of the MultiNER approach on the two gold standards and the crowdsourcing results for every named entity in the gold standard that has multiple alternatives. The columns are:

Identifier: sentence ID as referenced in the gold standard datasets
Sentence: sentence content as referenced in the gold standard datasets
NamedEntity: a potential named entity extracted by any of the five SOTA NER tools;
StartOffset: start offset of the named entity
EndOffset: end offset of the named entity
GoldEntityType: the type of the named entity as provided in the gold standard
EntityScore: the likelihood of an entity to be in the gold standard based on how many NER tools extracted it. The score is equal to the ratio of NER tools that extracted the entity.
SingleNERCount: the number of SOTA NER tools that extracted the named entity
Gold: binary value describing whether the named entity is contained in the gold standard (1) or not (0)
CrowdGold: binary value describing whether the named entity is considered a valid named entity by the crowd (1) or not (0)
MultiNER: binary value describing whether any of the NER tools extracted the named entity (1) or not (0)
NERD,TextRazor,SemiTags,THD,DBpediaSpotlight: binary value describing whether the given NER tool extracted the named entity (1) or not (0)
TP_MultiNER: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the MultiNER approach
TP_NERD,TP_TextRazor,TP_SemiTags,TP_THD,TP_DBpediaSpotlight: binary value describing whether the named entity is a TP case (1) or not (0), with regard to the SingleNER approach
TN_MultiNER: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the MultiNER approach
TN_NERD,TN_TextRazor,TN_SemiTags,TN_THD,TN_DBpediaSpotlight: binary value describing whether the named entity is a TN case (1) or not (0), with regard to the SingleNER approach
FP_MultiNER: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the MultiNER approach
FP_NERD,FP_TextRazor,FP_SemiTags,FP_THD,FP_DBpediaSpotlight: binary value describing whether the named entity is a FP case (1) or not (0), with regard to the SingleNER approach
FN_MultiNER: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the MultiNER approach
FN_NERD,FN_TextRazor,FN_SemiTags,FN_THD,FN_DBpediaSpotlight: binary value describing whether the named entity is a FN case (1) or not (0), with regard to the SingleNER approach
MainAlternativeSpan: the largest span extracted by any NER tools that overlaps with a named entity in the gold standard;
AlternativeStartOffset: start offset of the named entity alternative
AlternativeEndOffset: end offset of the named entity alternative
AlternativeCrowdScore: the likelihood of an entity to be in the gold standard based on the crowd assessment. The score is computed using the cosine similarity measure
RoleScore,PersonScore,OrganizationScore,PlaceScore,OtherScore: the likelihood of an entity to refer to the given type based on the crowd assessment. The score is computed using the cosine similarity measure

|--/input
|  |--/Valid Named Entity Expressions
|  |  |--/OKE2015
|  |  |--/OKE2016

The files contain the input for the crowdsourcing tasks for each dataset. An input unit is composed of a sentence and a set of expressions that refer to a named entity.

|--/raw
|  |--/Valid Named Entity Expressions
|  |  |--/OKE2015
|  |  |--/OKE2016

The raw data collected from crowdsourcing tasks for each of the 2 datasets.

Crowdsourcing Experiments:

Overall, the aim of the crowdsourcing experiments is to:

correct the mistakes of the NER tools
identify the ambiguities in the ground truth and provide a better ground truth

Crowdsourcing Experimental Data

We select every entity in the ground truth for which the NER tools provided alternatives. We have the following two cases:

Crowd reduces the number of FP: For each named entity in the ground truth that has multiple alternatives (span alternative) we create an entity cluster. We also add the largest span among all the alternatives.
Crowd reduces the number of FN: For each named entity in the ground truth that was not extracted, we create an entity cluster that contains the FN named entity and the alternatives returned by the NER. Further, we add every other combination of words contained in all the alternatives. This step is necessary because we do not want to introduce bias in the task, i.e., the crowd should see all the possibilities, not only the expected one.

Crowdsourcing Annotation Task

For the two cases described above, the goal of the crowdsourcing task is two-fold:

identification of valid expressions from a list that refer to a highlighted phrase in yellow (Step 2 from the crowdsourcing template below)
selection of the type for each expression in the list, from a predefined set of choices - place, person, organization, role and other (Step 3 from the crowdsourcing template below).

The input of the crowdsourcing task consists of a sentence and a named entity for which multiple expressions were given by the five state-of-the-art NER tools.

Check the crowdsourcing templates below.

![Fig.1: CrowdTruth Workflow for Identifying Valid Named Entity Expressions and their Type.](https://raw.githubusercontent.com/CrowdTruth/Crowdsourcing-NamedEntities-GoldStandard/master/templates/Screen Shot 2016-11-29 at 15.20.34.png) ![Fig.2: CrowdTruth Workflow for Identifying Valid Named Entity Expressions and their Type.](https://raw.githubusercontent.com/CrowdTruth/Crowdsourcing-NamedEntities-GoldStandard/master/templates/Screen Shot 2016-11-29 at 15.26.11.png)

isabella232 / crowdsourcing-namedentities-goldstandard Goto Github PK

crowdsourcing-namedentities-goldstandard's Introduction

Harnessing diversity in crowds and machines for better NER performance

Check the Results & Download the Data: Crowdsourcing-Improved-NE-Gold-Standard

Table of Contents:

Experimental Data:

Dataset Files:

Crowdsourcing Experiments:

Crowdsourcing Experimental Data

Crowdsourcing Annotation Task

crowdsourcing-namedentities-goldstandard's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs