GithubHelp home page GithubHelp logo

isabella232 / entitydisambiguation Goto Github PK

View Code? Open in Web Editor NEW

This project forked from azure-samples/entitydisambiguation

0.0 0.0 0.0 131 KB

This project proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine

License: MIT License

Python 49.72% Jupyter Notebook 50.28%

entitydisambiguation's Introduction

page_type languages name description products
sample
python
Entity Disambiguation Using Azure Search
A methodology to disambiguate misspelled entities using different custom search analyzers in a search engine
azure-cognitive-search

CI

Entity Disambiguation

This document proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine. Hence, even if the provided query contains some misspelled entities, the search engine can respond to the request with higher precision and recall than the default settings. This method can be applied to any search engine service capable of adding custom search analyzers.

Features

This project framework provides the following:

  • An approach to measure the performance of the search engine in the retrieval of the misspelled personaName when the search engine uses specific or multi search analyzers.

Getting Started

Prerequisites

Quick Start

  1. git clone https://github.com/Azure-Samples/EntityDisambiguation.git
  2. python3 -m venv env
  3. (Unix or MacOS) source env/bin/activate (Windows) env\Scripts\activate.bat
  4. python -m pip install -r requirements.txt
  5. cd src
  6. python main.py

Demo

Figure below indicates the overall architecture from user speech to search the data source and respond to the user’s request.

scenario

Methodology

Our approach is to measure the performance of the search engine in the retrieval of the misspelled personaName when the search engine uses specific or multi search analyzers.

Usage

Create a Search Index and insert documents in the index:

from azuresearchclient import AzureSearchClient

AZURE = AzureSearchClient()
# Create Search Index
AZURE.create_index("INDEX_NAME")
# insert documents into the search index (corrected spelled names)
AZURE.insert_documents("INDEX_NAME")

Query the search index by providing misspelled names and calculate the performance

  • Create a set of all analyzers(fields)
  • load misspelled names from names-misspelled.csv
  • load the expected names/results from names-expected.csv
  • for all elements in teh subset:
  • send a query to the search index providing the missepelled name and target field
    • Mark the reponse (e.g. TP, TN, FP, FN)
    • Calculate the Precision, Recall and F1 score
  • statistics will be stored in generated directory
from constants import Constants
from statistics import Statistics

STATS = Statistics()
# target fields to be searched
FIELDS_SET = Constants.name_search_fields
all_subsets = STATS.utils.get_subsets(FIELDS_SET)
# list of correct names (already uploaded to the search index)
correct_list = STATS.utils.read_csv("names-expected.csv")
# list of misspelled names
misspelled_list = STATS.utils.read_csv("names-misspelled.csv")
# making queries (with misspelled names) and measure the result
STATS.calculate_statistics(correct_list, misspelled_list, all_subsets, AZURE, True)

Plot the F1 score for each Analyzer

from statistics import Statistics

STATS = Statistics()
SCORES = STATS.generate_f1()
STATS.create_plot(SCORES)

plot

Experiment Result

Now consider that there is a name in our search index : Tom O'halleran

Our speech recognition or OCR extracted this text with an incorrect spelleing: Tom O Halleran

We are experimenting to disambiguate this name with two set ups:

  • Default analyzer setup (standard_lucene)

  • Analyzer with best performance set up (camelcase,url_email,text_microsoft)

Default analyzer experiment
from azuresearchclient import AzureSearchClient

AZURE = AzureSearchClient()
AZURE.make_search("Tom O Halleran", ["standard_lucene"])

Result:

"Tom Canada"
Analyzer with best performance experiment
AZURE = AzureSearchClient()
AZURE.make_search("Tom O Halleran", ["camelcase", "url_email", "text_microsoft"])

Result:

"Tom O'halleran"

We can see that default setting was not successful to retrieve the most relevant result.

Resources

entitydisambiguation's People

Contributors

microsoft-github-operations[bot] avatar microsoftopensource avatar mokarian avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.