ParsiAnalyzer is an analysis plugin for Elasticsearch. Analysis is a process that consists of the following steps:
- Tokenizing a block of text into individual terms
- Normalizing these terms into a standard form
An analyzer is really just a wrapper that combines Character filters, Tokenizer, and Token filters. Elasticsearch provides many Built-in Analyzers but there's still room for improvement especially for Persian language. This plugin provides tools for tokenizing, normalizing and stemming Persian text.
-
Tokenize Persian text
- Convert whitespaces to zero width nonjoiner (
نیمفاصله
) whenever it is necessary. for example,می رود
toمیرود
. - Convert Persian punctuations to their English equivalent. for example,
۳/۱۴
to۳.۱۴
- Tokenize Persian text by whitespaces and punctuations.
- Convert whitespaces to zero width nonjoiner (
-
Normalize Persian tokens into a single canonical form
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
براي
toبرای
. - Convert all Persian and Arabic numbers to their English equivalent. for example,
۱۴۳
to143
. - Remove diacritic (
اِعراب
) from words. for example,اَرّه
toاره
. - Remove Kashida form words. for example,
بادبــــــادک
toبادبادک
.
- Transform all forms of Yeh, Kaf, Heh, and Hamza to a unique form. for example,
-
Remove common Persian stop words
- Persian stop words like
از
,به
and etc will be removed.
- Persian stop words like
-
Stem Persian words
- Remove common Persian suffixes. for example,
ها
orان
. - Stemming reduces precision so this feature is disabled by default.
- Remove common Persian suffixes. for example,
To install the plugin for Elasticsearch 5.6.3, run this command:
bin\elasticsearch-plugin install https://github.com/NarimanN2/ParsiAnalyzer/releases/download/v1.0.0/ParsiAnalyzer-1.0.0.zip
To see how this plugin works, you can use Elasticsearch's analyze
API:
GET _analyze
{
"analyzer" : "parsi",
"text" : "روباه قهوهاي چابك از روی سگ تنبل می پرد"
}
This will give you these tokens: [روباه,قهوهای,چابک,روی,سگ,تنبل,میپرد]
ParsiAnalyzer can be specified directly in the field mapping as follows:
PUT /my_index
{
"mappings": {
"blog": {
"properties": {
"title": {
"type": "string",
"analyzer": "parsi"
}
}
}
}
}
Email: n.esmaielyfard [at] gmail.com