page_type | languages | name | description | products | ||
---|---|---|---|---|---|---|
sample |
|
Entity Disambiguation Using Azure Search |
A methodology to disambiguate misspelled entities using different custom search analyzers in a search engine |
|
Entity Disambiguation
This document proposes a methodology to disambiguate misspelled entities by comparing the search retrieval performance with different custom search analyzers in a search engine. Hence, even if the provided query contains some misspelled entities, the search engine can respond to the request with higher precision and recall than the default settings. This method can be applied to any search engine service capable of adding custom search analyzers.
Features
This project framework provides the following:
- An approach to measure the performance of the search engine in the retrieval of the misspelled personaName when the search engine uses specific or multi search analyzers.
Getting Started
Prerequisites
- Python 3.5.3 and above
- Create an Azure Cognitive Search using azure portal and fill config.yml config file with properties
Quick Start
- git clone https://github.com/Azure-Samples/EntityDisambiguation.git
python3 -m venv env
- (Unix or MacOS)
source env/bin/activate
(Windows)env\Scripts\activate.bat
python -m pip install -r requirements.txt
cd src
python main.py
Demo
Figure below indicates the overall architecture from user speech to search the data source and respond to the user’s request.
Methodology
Our approach is to measure the performance of the search engine in the retrieval of the misspelled personaName when the search engine uses specific or multi search analyzers.
Usage
Create a Search Index and insert documents in the index:
- Search index will be created from the index-schema.json file
- documents are people names sourcing from names.csv
from azuresearchclient import AzureSearchClient
AZURE = AzureSearchClient()
# Create Search Index
AZURE.create_index("INDEX_NAME")
# insert documents into the search index (corrected spelled names)
AZURE.insert_documents("INDEX_NAME")
Query the search index by providing misspelled names and calculate the performance
- Create a set of all analyzers(fields)
- load misspelled names from names-misspelled.csv
- load the expected names/results from names-expected.csv
- for all elements in teh subset:
- send a query to the search index providing the missepelled name and target field
- Mark the reponse (e.g. TP, TN, FP, FN)
- Calculate the Precision, Recall and F1 score
- statistics will be stored in generated directory
from constants import Constants
from statistics import Statistics
STATS = Statistics()
# target fields to be searched
FIELDS_SET = Constants.name_search_fields
all_subsets = STATS.utils.get_subsets(FIELDS_SET)
# list of correct names (already uploaded to the search index)
correct_list = STATS.utils.read_csv("names-expected.csv")
# list of misspelled names
misspelled_list = STATS.utils.read_csv("names-misspelled.csv")
# making queries (with misspelled names) and measure the result
STATS.calculate_statistics(correct_list, misspelled_list, all_subsets, AZURE, True)
Plot the F1 score for each Analyzer
from statistics import Statistics
STATS = Statistics()
SCORES = STATS.generate_f1()
STATS.create_plot(SCORES)
Experiment Result
Now consider that there is a name in our search index : Tom O'halleran
Our speech recognition or OCR extracted this text with an incorrect spelleing: Tom O Halleran
We are experimenting to disambiguate this name with two set ups:
-
Default analyzer setup (standard_lucene)
-
Analyzer with best performance set up (camelcase,url_email,text_microsoft)
Default analyzer experiment
from azuresearchclient import AzureSearchClient
AZURE = AzureSearchClient()
AZURE.make_search("Tom O Halleran", ["standard_lucene"])
Result:
"Tom Canada"
Analyzer with best performance experiment
AZURE = AzureSearchClient()
AZURE.make_search("Tom O Halleran", ["camelcase", "url_email", "text_microsoft"])
Result:
"Tom O'halleran"
We can see that default setting was not successful to retrieve the most relevant result.