Current version : 1.3.0 (changelog)
Extension of the ExtendedDisMaxQueryParserPlugin that splits queries into a "normal" query and a "synonym" query. This enables proper query-time synonym expansion, with no reindexing required.
This also fixes lots of bugs with how Solr typically handles synonyms using the SynonymFilterFactory.
For more details, read my blog post on the subject.
The following tutorial will set up a working synonym-enabled Solr app using the example/
directory from Solr itself,
running in Jetty.
Download the latest JAR file depending on your Solr version:
- hon-lucene-synonyms-1.2.3-solr-3.x.jar for Solr 3.4.0, 3.5.0, 3.6.0, 3.6.1, and 3.6.2
- hon-lucene-synonyms-1.2.3-solr-4.0.0.jar for Solr 4.0.0
- hon-lucene-synonyms-1.2.3-solr-4.1.0.jar for Solr 4.1.0 and 4.2.0
- hon-lucene-synonyms-1.3.0-solr-4.3.0.jar for Solr 4.3.0 (requires config change)
Download Solr from the Solr home page. For this tutorial, we'll use Solr 3.6.2. You do not need
the sources; the tgz
or zip
file will work fine.
Extract the compressed file and cd to the example/
directory.
Now, you need to bundle the hon-lucene-synonyms-*.jar
file into webapps/solr.war
.
Below is a script that will work quite nicely on UNIX systems. Be sure to change the
/path/to/my/hon-lucene-synonyms-*.jar
part before running this script.
mkdir myjar
cd myjar
jar -xf ../webapps/solr.war
cp /path/to/my/hon-lucene-synonyms-*.jar WEB-INF/lib/
jar -cf ../webapps/solr.war *
cd ..
Note that this plugin will not work in any location other than the WEB-INF/lib/
directory of the solr.war
itself, because of issues with the ClassLoader.
UPDATE: We have tested to run with the jar in $SOLR_HOME/lib
as well, and it works (Jetty).
Download example_synonym_file.txt and copy it to the solr/conf/
directory
(or solr/collection1/conf/
in Solr 4.x).
Edit solr/conf/solrconfig.xml
(solr/collection1/conf/solrconfig.xml
in 4.x) and add these lines near the
bottom (before </config>
):
<queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
<str name="luceneMatchVersion">LUCENE_36</str>
<lst name="synonymAnalyzers">
<lst name="myCoolAnalyzer">
<lst name="tokenizer">
<str name="class">solr.StandardTokenizerFactory</str>
</lst>
<lst name="filter">
<str name="class">solr.ShingleFilterFactory</str>
<str name="outputUnigramsIfNoShingles">true</str>
<str name="outputUnigrams">true</str>
<str name="minShingleSize">2</str>
<str name="maxShingleSize">4</str>
</lst>
<lst name="filter">
<str name="class">solr.SynonymFilterFactory</str>
<str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
<str name="synonyms">example_synonym_file.txt</str>
<str name="expand">true</str>
<str name="ignoreCase">true</str>
</lst>
</lst>
</lst>
</queryParser>
Note that you must modify the luceneMatchVersion
above to match the
<luceneMatchVersion>...</luceneMatchVersion>
tag at the beginning of the solr/conf/solrconfig.xml
file.
From version 1.3.0 for Solr 4.3.0 and beyond, there is a new way of loading Tokenizers and Token filters, and the XML format is somewhat different:
<queryParser name="synonym_edismax" class="solr.SynonymExpandingExtendedDismaxQParserPlugin">
<lst name="synonymAnalyzers">
<lst name="myCoolAnalyzer">
<lst name="tokenizer">
<str name="class">standard</str>
<str name="luceneMatchVersion">LUCENE_43</str>
</lst>
<lst name="filter">
<str name="class">shingle</str>
<str name="luceneMatchVersion">LUCENE_43</str>
<str name="outputUnigramsIfNoShingles">true</str>
<str name="outputUnigrams">true</str>
<str name="minShingleSize">2</str>
<str name="maxShingleSize">4</str>
</lst>
<lst name="filter">
<str name="class">synonym</str>
<str name="luceneMatchVersion">LUCENE_43</str>
<str name="tokenizerFactory">solr.KeywordTokenizerFactory</str>
<str name="synonyms">example_synonym_file.txt</str>
<str name="expand">true</str>
<str name="ignoreCase">true</str>
</lst>
</lst>
</lst>
</queryParser>
Start up the app by running java -jar start.jar
. Jetty may print a ClassNotFoundException
, but
it shouldn't matter.
In your browser, navigate to
http://localhost:8983/solr/select/?q=dog&debugQuery=on&qf=text&defType=synonym_edismax&synonyms=true
You should see a response like this:
<response>
...
<result name="response" numFound="0" start="0"/>
<lst name="debug">
<str name="rawquerystring">dog</str>
<str name="querystring">dog</str>
<str name="parsedquery">
+(DisjunctionMaxQuery((text:dog)) (((DisjunctionMaxQuery((text:canis))
DisjunctionMaxQuery((text:familiaris)))~2) DisjunctionMaxQuery((text:hound))
((DisjunctionMaxQuery((text:man's)) DisjunctionMaxQuery((text:best))
DisjunctionMaxQuery((text:friend)))~3) DisjunctionMaxQuery((text:pooch))))
</str>
<str name="parsedquery_toString">
+((text:dog) ((((text:canis) (text:familiaris))~2) (text:hound)
(((text:man's) (text:best) (text:friend))~3) (text:pooch)))
</str>
<lst name="explain"/>
<str name="QParser">SynonymExpandingExtendedDismaxQParser</str>
...
</lst>
</response>
Note that the input query dog
has been expanded into dog
, hound
, pooch
, canis familiaris
, and man's best friend
.
Boost the non-synonym part to 1.2 and the synonym part to 1.1 by adding synonyms.originalBoost=1.1&synonyms.synonymBoost=1.2
:
+((text:dog)^1.1 (((((text:canis) (text:familiaris))~2) (text:hound)
(((text:man's) (text:best) (text:friend))~3) (text:pooch))^1.2))
Apply a minimum "should" match of 75% by adding mm=75%25
:
+((text:dog) ((((text:canis) (text:familiaris))~1) (text:hound)
(((text:man's) (text:best) (text:friend))~2) (text:pooch)))
Observe how phrase queries are properly handled by using q="dog"
instead of q=dog
:
+((text:dog) ((text:"canis familiaris") (text:hound) (text:"man's best friend") (text:pooch)))
Keep in mind that you must add defType=synonym_edismax
and synonyms=true
to enable
the parser in the first place.
Also, you must either define qf
in the query parameters or defaultSearchField
in solr/conf/schema.xml
,
so that the parser knows which fields to use during synonym expansion.
The following are parameters that you can use to tweak the synonym expansion.
Param | Type | Default | Summary |
synonyms | boolean | false | Enable or disable synonym expansion entirely. True if enabled. |
synonyms.analyzer | String | null | Name of the analyzer defined in solrconfig.xml to use. (E.g. in the examples, it's myCoolAnalyzer). This must be non-null, if you define more than one analyzer (e.g. for more than one language). |
synonyms.originalBoost | float | 1.0 | Boost value applied to the original (non-synonym) part of the query. |
synonyms.synonymBoost | float | 1.0 | Boost value applied to the synonym part of the query. |
synonyms.disablePhraseQueries | boolean | false | True if synonym expansion should be disabled when the user input contains a phrase query (i.e. a quoted query). This option is offered because expansion of phrase queries may be considered non-intuitive to users. |
synonyms.constructPhrases | boolean | false | v1.2.2+: True if expanded synonyms should always be treated like phrases (i.e. wrapped in quotes). This option is offered in case your synonyms contain lots of phrases composed of common words (e.g. "man's best friend" for "dog"). Only affects the expanded synonyms; not the original query. See issue #5 for more discussion. |
Download the code and run:
mvn install
Since there are several branches depending on the Solr version, there's also a build script that will git checkout
each branch, build it, and put it in the target/s3
directory:
./build_all_versions.sh
Basically, my strategy is to maintain a main master
/solr-4.1.0
branch, with offshoot branches (solr-4.0.0
and solr-3.x
) that are git rebase
'd every time I need to build a new version.
Python-based unit tests are in the test/
directory. You can run them using:
# launches Solr on localhost:8983. Alternatively, you can just follow the "Getting Started" directions
./run_solr_for_unit_tests.sh /path/to/my/optional/solr-4.2.0.tgz
# run some Python unit tests against the local Solr on localhost:8983
nosetests test/
Currently I test against Solr 4.2.
- v1.3.0
- Added support for Solr 4.3.0 (#219)
- New way of loading Tokenizers and TokenFilters
- New XML syntax for config in solrconfig.xml
- v1.2.3
- Fixed #16
- Verified support for Solr 4.2.0 with the 4.1.0 branch (unit tests passed)
- Improved automation of unit tests
- v1.2.2
- Added
synonyms.constructPhrases
option to fix issue #5 - Added proper handling for phrase slop settings
- v1.2.1
- Added support for Solr 4.1.0 (#4)
- v1.2
- Added support for Solr 4.0.0 (#3)
- v1.1
- Added support for Solr 3.6.1 and 3.6.2 (#1)
- Added "Getting Started" instructions to clarify plugin usage (#2)
- v1.0
- Initial release