opensextant / solrtexttagger Goto Github PK

A text tagger based on Lucene / Solr, using FST technology

License: Apache License 2.0

Java 100.00%

named-entity-recognition nlp solr solr-plugin tagger

solrtexttagger's Introduction

OpenSextant

The Open Spatial Extraction and Tagging (OpenSextant) software provides an unstructured textual data geotagging and geocoding capability. The U.S. Government Joint Improvised Explosive Device Defeat Organization (JIEDDO) developed this capability in coordination with other U.S. government agencies and is pleased to provide this as open source software using an Apache 2.0 license. The software relies upon the open source General Architecture for Text Engineering (GATE) natural language processing software and the Apache Solr search software. Please see below for instructions on how to access the source code and binaries.

OpenSextant Suite

This suite is various projects for geospatial and temporal extraction. The core module is OpenSextantToolbox which produces a GATE plugin and a toolkit for controlling the overall extraction and geocoding pipeline using that plugin.

Modules

Commons -- Common parent classes, data model and core utilities. TBD

Xponents -- Extractors

XText document conversion (to plain text)
XCoord coordinate extraction
XTemporal date/time extraction
FlexPat

OpenSextantToolbox -- A GATE-based plugin and various main programs for geotagging/geocoding Gazetteer -- A Solr-based gazetteer supporting mainly NGA Geonames, USGS place data, and adhoc catalogs. LanguageResources -- Linguistic tuning data doc -- Documentation, user manuals, developer guides

Peer Projects

SolrTextTagger -- A text tagging solution for high-volume word lists or data sets

GISCore -- An API manages GIS data formats.

geodesy geodetic primitives and routines used by OpenSextant and GISCore
giscore the main GISCore API which supports IO and data manipulation on GIS data

additional content: Testing -- (RELEASE TBD) test data and programs to give you ideas of the possible. GeocoderEval -- (RELEASE TBD) we've developed a framework and ground truth for evaluating OpenSextant and other geotaggers

Getting Started Using OpenSextant

In the OpenSextant binary distribution you will find ./script/default.env It contains OPENSEXTANT_HOME and other useful shell settings. WinOS version is TBD.

To Geocode files and folders please use the reference script:

  $OPENSEXTANT_HOME/script/geocode.sh   <input> <output> <format>

where input is an input file or folder output is an output file or folder; depends on format format is the format of your output: one of GDB, CSV, Shapefile, WKT, KML

Getting Started Integrating OpenSextant

Javadoc is located at OPENSEXTANT_HOME/doc/javadoc ; Typical adhoc integration will be through the o.m.o.apps.SimpleGeocoder class, which leverages o.m.o.processing.TextInput on input and GeocodingResult/Geocoding as output classes.

Integration documentation is in progress, as of April 2013.

The main library JARs of interest are:

OpenSextantToolbox.jar opensextant-apps.jar opensextant-commons.jar

And the various Xponents: xtextjar xcoordjar xtemporaljar flexpatjar

As of release time 2013-Q1, we are working on documenting and honing dependencies with other libraries, as well as our internal dependencies.

Getting Started Developing OpenSextant

For more information see ./doc/OpenSextantToolbox/doc/OpenSextant Developers Guide.docx

Set your maven proxy settings; see ./doc/developer/ for hints.

Ensure that JAVA_HOME environment variable is pointed at a Java 7 JDK.

Otherwise you may encounter Javadoc and/or compilation errors.

In the source tree, run "ant". This will build the various required components and build a release

cd ./opensextant

see that things compile

ant compile

the release step compiles all modules and prepares a release.

ant release

Alternatively, Maven can be used to build Commons, Xponents, and SolrTextTagger. For example:

 cd Xponents
 mvn install

But complete Maven build support is not planned at this time.

solrtexttagger's People

Contributors

Stargazers

Watchers

solrtexttagger's Issues

Configurable stopword handling

Discussion: #11 (comment)

In summary, if posInc > 1 then there was an omitted stopword. What should we do?

What we do now is cause an error at index time, and at query time finish any tags in-progress (i.e. i.e. a tag can't span the gap).

We might want a gap to be effectively ignored -- pretending posInc is the typical 1.

We might want an interesting wildcard-like match in which the tagger can know to accept all possible upcoming terms. At index time, a special wildcard token might be emitted that the tagger knows how to handle.

Summary of options:

error
tag break (query time only)
ignore
wildcard

And you might want different behavior at index & query time.

p.s. I have no need for this right now but I want to record that this should ideally be configurable

Publish SolrTextTagger releases to Maven Central

Publishing SolrTextTagger releases on Maven Central would ease its usage by other components.

Background:

I am in the progress of implementing an Apache Stanbol Enhancement Engine that will use TaggerFstCorpus for in-memory EntityLinking - suggesting Entities for Mentions in a processed text STANBOL-1128.

To use a library in a Apache Project it is preferred (quasi required) that it is available on Maven Central. So having SolrTextTagger available on Maven Central would be really appreciated.

BTW: I would be also interested to know if SolrTextTagger is available on some other maven server ATM.

best
Rupert

Next best postingsFormat for fieldType

This is more of a question than an issue, and not terribly urgent, but would help if you could answer...

I noticed that a default Solr 5 instance (-Xmx 512m) ran out of memory after about ingesting 10 million terms in the FST. I have since increased the -Xmx to 6GB so I have some breathing room (~ 120 million terms by extrapolation), but I was wondering if you could recommend a postingsFormat for the tag fieldType that can spill over into disk (or can work entirely from disk in the worst case). The boxes have SSDs so the disk penalty is not as great as with spinning disks.

I see that the possible values for postingsFormat (according to Dmitry Kan's comment on a page in the Solr ref guide) are - Lucene40, Lucene41, Pulsing41, SimpleText, Memory, BloomFilter, Direct, FSTPulsing41, FSTOrdPulsing41, FST41, and FSTOrd41. Going by the name I thought BloomFilter may be a good choice but Solr gives a runtime error. I tried removing the postingsFormat attribute and it works but I was wondering if there was some setting that is preferable after "Memory".

Also my understanding is that I would have to reindex all the content if I changed the postingsFormat, is that correct?

Thanks in advance for your answers.

API change in Solr 6.3 [SOLR-9592]

The current version of SolrTextTagger does not work with Solr 6.3 because SolrIndexSearcher#getLeafReader was renamed to SolrIndexSearcher#getSlowAtomicReader (SOLR-9592).

Changing the code would mean that the most current version of SolrTextTagger would no longer work with Solr/Lucene versions < 6.3. So most likely this would require a new release to be used with Solr 6.3+

In addition the javadoc indicates that one should use IndexSearcher.leafContexts instead. However this field is protected so I am not sure how to use it.

Jericho 3.4 requires log4j-2.4.1 while Solr still uses 1.2.17

It appears that Solr is locked in still at 1.2.17 (log4j version: https://github.com/apache/lucene-solr/blob/master/lucene/ivy-versions.properties#L83 and slf4j-log4j12 version: https://github.com/apache/lucene-solr/blob/master/lucene/ivy-versions.properties#L296) while Jericho 3.4 uses the latest log4j library. When the SolrTextTagger hits the Jericho lib, it'll throw the error listed below.

Jericho's release notes state:

       - Upgraded to the following logger APIs:
         slf4j-api-1.7.12, log4j-2.4.1

Error:

o.a.s.s.SolrDispatchFilter null:java.lang.RuntimeException: java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
    at org.apache.solr.servlet.HttpSolrCall.sendError(HttpSolrCall.java:618)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:477)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:214)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:179)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:499)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
    at net.htmlparser.jericho.LoggerProviderLog4J.getLogger(LoggerProviderLog4J.java:35)
    at net.htmlparser.jericho.LoggerProviderLog4J.getSourceLogger(LoggerProviderLog4J.java:41)
    at net.htmlparser.jericho.Source.newLogger(Source.java:1685)
    at net.htmlparser.jericho.Source.<init>(Source.java:151)
    at net.htmlparser.jericho.StreamedSource.<init>(StreamedSource.java:235)
    at org.opensextant.solrtexttagger.HtmlOffsetCorrector.<init>(HtmlOffsetCorrector.java:46)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.initOffsetCorrector(TaggerRequestHandler.java:251)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:154)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2068)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:669)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:462)
    ... 22 more
Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.LogManager
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at java.net.FactoryURLClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    ... 34 more

Issue with solrTextTagger2.3 and solr 6.3

Hi David,
I have configured the solr 6.3 to work with sorlTextTagger 2.3. I hope I did everything described in configuration file. I have already indexed cities.csv file.
But when I tried to tag the city name with given example, i got the following error:

curl -X POST 'http://localhost:8983/solr/geonames/tag?fl=id,name,countrycode&wt=json&indent=on' -H 'Content-Type:text/plain' -d 'Hello New York City'

<title>Error 500 Server Error</title>

HTTP ERROR 500

Problem accessing /solr/geonames/tag. Reason:

    Server Error

Caused by:

java.lang.NoSuchMethodError: org.apache.solr.search.SolrIndexSearcher.getLeafReader()Lorg/apache/lucene/index/LeafReader;
	at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:167)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:153)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:2213)
	at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
	at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:460)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:303)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254)
	at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1668)
	at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
	at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
	at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
	at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1160)
	at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
	at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
	at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1092)
	at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
	at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
	at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
	at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
	at org.eclipse.jetty.server.Server.handle(Server.java:518)
	at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:308)
	at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:244)
	at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
	at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
	at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceAndRun(ExecuteProduceConsume.java:246)
	at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:156)
	at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:654)
	at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:572)
	at java.lang.Thread.run(Thread.java:745)

I have created the .jar with build and also tried by downloading the .jar file from given link. But no success.

Thanks a lot in advance,

Shrestha

Add support for PositionIncrementAttribute and PositionLengthAttribute

The SolrTextTagger should consider the PositionIncrementAttribute and PositionLengthAttribute when building the FST model.

The blog post Lucene's TokenStreams are actually graphs! does provide a good overview about how this is intended to work.

The main goal ist to add support for Analyzer chains that create tokens with PositionIncrementAttribute == 0, but with the PositionLengthAttribute one can also correctly create FST arcs for more complex situation, where alternate tokens (as shown by the "wi fi network" example in the linked blog post.

How would it work:

Lets assume the Term with the label "thomas wi fi network" that got similar analyzed as the text in the linked Blog post.

The goal is to have the following three arcs in the FST

thomas wi fi network
thomas wifi network
thomas hotspot

So what one needs to do is to create an own arc for all possible paths through the directed acyclic graph represented by the TokenStream.

This works also for the other example given in the Blog post: ショッピングセンター (shopping center) would result in the following two arcs:

ショッピングセンター
ショッピング センター

Implementation

By using both PositionIncrementAttribute AND PositionLengthAttribute it is possible to generate those arcs based on the tokens in the TokenStream.

In the 1.2 branch this needs to be done in the TaggerFstCorpus#analyze(..) method. This method would need to return an array IntsRef[] results for the paths as described above.

For the 2.0 branch the ConcatenateFilter needs to build the Strings as described above and emits them with a PositionIncrementAttribute == 0 to callers of its incrementToken() method. This should - AFAIK - cause Solr to index them correctly.

My plan is to try an implementation this based on the 1.2 branch in the westei/SolrTextTagger fork.

Problem building SolrTextTagger with Lucene/Solr 4.7.0

I've just downloaded SolrTextTagger and added the required solr and lucene jars from Solr 4.7.0. I'm running into a compilation problem:

Buildfile: /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build.xml
compile:
[javac] /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build.xml:75: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
[javac] Compiling 13 source files to /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/build
[javac] /Users/srosenthal/projects/solrtagger.d/SolrTextTagger/src/main/java/org/opensextant/solrtexttagger/TaggerRequestHandler.java:299: error: cannot find symbol
[javac] docBits = searcher.getDocSet(filterQuery).getBits();
[javac] ^
[javac] symbol: method getBits()
[javac] location: interface DocSet
[javac] 1 error
BUILD FAILED

I see the getBits() method in org.apache.lucene.util.OpenBitSet OK so not sure what's going on.
As a possible clue (and I'm definitely a Lucene newbie ) I'm wondering whether issue https://issues.apache.org/jira/browse/LUCENE-5440 , (Add LongFixedBitSet and replace usage of OpenBitSet) which went into Lu/So 4.7 is relevant here ?

Hope to hear back soon

-Simon

SolrJ extension for SolrTextTagger

Hi all,

Not really an issue, but I wanted to let you know that I implemented SolrParams, SolrRequest and SolrResponse specific for the SolrTextTagger.

The module is available at https://github.com/redlink-gmbh/solrj-text-tagger/
I plan to make it available at Maven Central in the coming days.

Feedback welcome!

best
Westei

Field BoostFilter SearchComponent

I know that the SolrTextTagger is used by CareerBuilder to find interesting things in a user's query to then do other things (like boost or apply a filter). There is a cool Solr plugin by Ted Dunning at LucidWorks here: https://github.com/lucidworks/query-autofiltering-component that does this... although I have a bunch of concerns with it. Relevant blog: https://lucidworks.com/blog/2015/05/13/query-autofiltering-revisited-can-precise/

I think it would be cool to develop a SearchComponent similar to Ted's but based on the SolrTextTagger. It would build a "side-car index" (possibly held in memory -- configurable) and then use its results to either apply "fq" filter queries or dismax "bq" boost queries (or both). In the end, it should be much less code than Ted's and it should have it's analysis configurable via the Solr schema instead of being hard-coded.

Disclaimer: this is just an idea place-holder; I don't yet have plans to do this

Solr 5.3 Refactor Breaks testTagStreaming

In the test testTagStreaming the document response has changed due to SOLR-7662. Per the comments in SOLR-7662, "javabin returns the primitive types of the fields while the text based writers return a IndexableField/StorableFIeld depends on whether you are in branch 5x or trunk".

The change results in the field values being returned with the following call as IndexableField/StorableField instead of the expected primitive value.

 assertEquals("Boston", refDoc.get().getFieldValue("name"));

How do you suggest adjusting for this?

I think this is the last bug for getting 5.3.1 working.

posinc = 1 error

I saw some text files with really long strings, like URLs or MD5 hashes and base64 encoded metadata. The appear to be causing the FST tagger heartburn and tagger throws error:

REF: src/main/java/org/mitre/solr/tagger/TaggerFstCorpus.java
if (posIncAtt.getPositionIncrement() != 1) {
throw new IllegalArgumentException("term: " + text + " analyzed to a token with posinc != 1");
}

My data is my data. I cannot really scrub the data before tagging. The FST tagger issue here may be a valid one, but we should figure out how to handle it more gracefully. Right now the whole document fails in OpSx PlaceNameMatcher.

Example data:
Run XText "convert.sh" on a simple PDF or other doc. find the cached text file for that run. The bottom of the text file will have a XT:xxxxxxxxxxxxx ... long base64 encoded label.

I hope to reproduce the situation shortly.

Release version 2.0

Version 2.0-SNAPSHOT has been used quite a bit at MITRE and has had some enhanced tests -- it's solid. The memory usage is better and it's more flexible to build/maintain indices since it's based directly off of Lucene instead of building a custom persisted memory structure, and it's simpler code too. The tagging performance wasn't specifically measured, but it was in aggregate with the rest of OpenSextant and there was no marked difference -- that's success as far as I'm concerned.

What 2.0 lacks as of this writing is v1.2's ability to use more rich text analysis, thanks to @westei. See #20. I'm a little conflicted on wether to just release it. I was about to announce that it'll happen anyway next week but I think this big feature mis-match and no truly pressing reason to release 2.0 quite yet so I should hold up, and document some remaining issues.

When 2.0 does get released, the master branch will be renamed to 1_x, and MemPF will become master.

Port PhraseBuilder from v1.2 branch to master

I'd love to see the improvements made on the 1.2 branch ported to the MemPF branch (v2.0).

MemPF seems to work well in OpenSextant but it hasn't been as throughly evaluated. I suspect if Stanbol ports to MemPF, and if Rupert does his measurements as he's done before, it will be more clear through its tests, etc. how well the MemPF does.

NullPointerException in TaggerRequestHandler.java:199

When I do:

curl -XPOST \
  'http://localhost:8983/solr/test/tag?overlaps=NO_SUB&tagsLimit=5000&fl=*' \
  -H 'Content-Type:text/plain' -d @example.txt

The core name is test. An unrelated question: the url in the README.md is <host>:<port>/solr/tag, which however, returns 404. In my case, <host>:<port>/solr/<core_name>/tag works.

The server returns(I extracted the trace from the XML result):

java.lang.NullPointerException
    at org.opensextant.solrtexttagger.TaggerRequestHandler$1.&lt;init&gt;(TaggerRequestHandler.java:199)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:168)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:640)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:436)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
    at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
    at java.lang.Thread.run(Thread.java:745)

I am using:

Solr: 5.2.1
SolrTextTagger 2.2
JRE: 1.8

schema.xml:

<schema name="test" version="1.5">
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true"/>

    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="float" class="solr.TrieFloatField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="double" class="solr.TrieDoubleField" precisionStep="0" positionIncrementGap="0"/>

    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0"/>

        <fieldType name="tag" class="solr.TextField" positionIncrementGap="100" postingsFormat="Memory"
                           omitTermFreqAndPositions="true" omitNorms="true">
          <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />

                <filter class="org.opensextant.solrtexttagger.ConcatenateFilterFactory" />
          </analyzer>
          <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory" />
                <filter class="solr.ASCIIFoldingFilterFactory"/>
                <filter class="solr.LowerCaseFilterFactory" />
          </analyzer>
        </fieldType>

        <field name="_version_" type="long" indexed="true" stored="true"/>
        <field name="surface_name" type="tag" indexed="true" stored="true"/>
        <field name="occurrences" type="int" indexed="false" stored="true"/>
        <field name="log_occurrences" type="double" indexed="false" stored="true"/>
</schema>

Part of solrconfig.xml:

  <requestHandler name="/tag" class="org.opensextant.solrtexttagger.TaggerRequestHandler">
        <lst name="defaults">
      <str name="field">surface_name</str>
      <str name="fq">*:*</str>
        </lst>
  </requestHandler>

Use Lucene MemoryIndex postings format instead of explicit FSTs

In my presentation on the text tagger at Lucene Revolution, I indicated that an experimental test of a single FST surprisingly had better compression than the pair that the tagger uses now. Using Lucene's "Memory" postings format puts all the terms into an FST and it also uses a compact encoding for the docId postings to save memory there. This could be used in place of the TaggerCorpus. There are other advantages too, such as not having a single expensive build moment -- it's effectively amortized during indexing. I'm not sure how it would affect tagging performance; we'll see.

I'll post more on this when I get started; probably tomorrow. This is a large internal change, so it'll go to a new branch, and a 2.0-SNAPSHOT version.

Write a getting-started script using geonames

We've already got some quick-start instructions for geonames, but lets put it into a script (bash?) -- one that can be run after the docker based image is running as described in #56.

Supplementing filter query with request filter query.

Apologies if this the wrong forum for such a question, but I didn't a forum or mailing list for the project. We're using the text tagger along with the gazetteer with a configuration that sets some defaults for filter query. The configuration per se was taken from [here|https://github.com/OpenSextant/Xponents/blob/master/solr/gazetteer/conf/solrconfig.xml#L810].

However, we need to be able to specify an additional constraint "on the fly". For example to constrain tagging to a specific country code. The behaviour as I understand it is that if there is a configured filter query in solrconfig.xml it trumps any filter query supplied by the request.

I couldn't find a way (that wasn't a total hack) to join the two filter queries. What I've done for now is patch TaggerRequestHandler and made setTopInitArgsAsInvariants() protected so I can subclass it and do the "ANDing" there.

I guess my questions are (a) is there is a way to do this already that I am missing, and (b) if not, is something like what I've done acceptable?

Change TaggerFstCorpus class from default to public visibility

Within SolrTextTagger the TaggerFstCorpus is only used by the TaggerRequestHandler (the SolrRequestHandler), but when this module is used as a library it would be nice to have access this core class also from other packages.

Sentence segmentation

It would be neat to add some sort of sentence segmentation to the query time text analysis to trigger a break in tagging. For example (a very silly one!) the input document text is:
" I want to buy something new. England is a nice place to visit. " Then assuming "New England" is in the dictionary (and possibly England but that doesn't matter) then the tagger will currently find "New England" which is undesirable. Of course this is a "naive tagger" as put to me when I joined the project; but nonetheless this sort of rule seems to me a good one to have at this layer in an overall system.

This could be implemented with a tokenizer that tokenizes sentences using Java's BreakIterator. It would set a new attribute that indicates the starting and ending offset of the sentence. Then the token would get split by other standard lucene components like WordDelimiterFilter which breaks on whitespace. Ultimately the Tagger could look for the custom attribute, and check if the last word's offsets don't fall within the current sentence as indicated by the attribute.

But maybe sentence segmentation isn't aggressive enough. After all, shouldn't there be a tag break at nearly any punctuation?

Video of talk of system using SolrTextTagger at Spark Summit EU 2015

Thought you might want to know of this URL. This is my talk at Spark Summit EU 2015 where I describe a system that uses SolrTextTagger for exact case-sensitive and insensitive matching of dictionary entities in text to do NER.

Video: https://www.youtube.com/watch?v=gOe0aYAS8Do
Slides: http://www.slideshare.net/sujitpal/sseu-2015soda

Thanks,

Remove deprecated NoSerializeEmbeddedSolrServer and EmbeddedSolrUpdater

Decide fate of ant build.xml

The official build is the maven pom. The ant build.xml is legacy and I forgot to remove it as part of an OpenSextant reshuffle but at least one user voiced to me a strong preference for it.

Possible actions:

remove it
update it (not likely; I don't want to maintain it). Update with disclaimer on out of date?
generate it automatically with maven (possible?).

solr 5.2

Hi David

Thanks for the good work with SolrTextTagger.
Just wanted to say there seems to be a problem when updating to Solr 5.2.
Below is a snippet of the problem.

Cheers
A.

java.lang.NoSuchMethodError: org.apache.lucene.index.Terms.iterator(Lorg/apache/lucene/index/TermsEnum;)Lorg/apache/lucene/index/TermsEnum;
    at org.opensextant.solrtexttagger.Tagger.process(Tagger.java:160)
    at org.opensextant.solrtexttagger.TaggerRequestHandler.handleRequestBody(TaggerRequestHandler.java:223)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)

Maven build fails due to UnsupportedTokenException

FYI -- just pulled MemPF branch to build for myself. I was looking for latest bug fixes in 2.0 snapshot. No rush. But I did see this build test failure:

Results :

Failed tests: testUnsupportedMultiTokenSynonyms(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): expects an UnsupportedTokenException!

Tests in error:
testWhitespaceTokenizerWithWordDelimiterFilter(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testRemovalOfAlternateTokens(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testWordDelimiter(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testAlternates(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testSynonymsAndDelimiterCombined(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'
testStopWords(org.opensextant.solrtexttagger.PosIncPosLenTaggerTest): Null Request Handler '/tag2'

SolrTextTagger directly with Lucene?

Hi David!

On this blog post you mentionned that:

If Solr adds more weight then you want, then you can just depend on Lucene, since most of the functionality doesn't depend on Solr.

Would you have general guidelines as how it could be used directly within Lucene? (how should it be called, etc).

HDFS support for SolrTextTagger when using EmbeddedSolrServer

Hi David,
v2.0 still pumping away here at MITRE.

This is a request for an example of how to use STT in a read-only mode in an Hadoop Mapper or Spark situation. The use of EmbeddedSolrServer is crucial there as one would want to minimize network I/O that the Http server gets. However, EmbeddedSolrServer is impossible to get working in a simple Hadoop Mapper.

I wonder if you have encountered any requests for support for SolrTextTagger in BigData environments using this approach? I did see Suit Pal's post on his SODA work -- however that appears to use a bench of RESTFul instances of SolrTextTagger.

... The power we would have if we could deploy SolrTextTagger + EmbeddedSolrServer -- I fired off a 1000 mappers yesterday, each with about 10docs/sec (well, tweets). 10,000 tweets/sec would be good, But,... the Solr mechanics in this situation are impenetrable.

This forum vs. Apache Solr: From the gist of "EmbeddedSolrServer" in the Solr camp, I sense its not well-supported or cared for. So I don't feel posting this as an issue there is worthwhile. The driving force would be SolrTextTagger + EmbeddedSolrServer + BigData scaling. Hence I'm here.

Marc

Multi-word synonyms

I'm experimenting with different analysis that exercises your PhraseBuilder; specifically using multi-word synonyms — e.g. Input dictionary name "DNS" mapping to alternate "domain name service". Based on the latest code, simply replace the DNS entry with this:

# Note: when expand=true both are synonyms of each other, but when
#  expand=false then the first term (DNS) is the target replacement for
#  the remainder (Domain Name Service).
DNS, Domain Name Service

So it doesn't work which I suspect you already knew:

10:30:20.558 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] INFO o.o.solrtexttagger.TaggerFstCorpus - Building TaggerFstCorpus
10:30:20.558 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] DEBUG o.o.solrtexttagger.TaggerFstCorpus - Building word dict FST...
10:30:20.559 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] DEBUG o.o.solrtexttagger.TaggerFstCorpus - Building temporary phrase working set...
10:30:20.560 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: dns, posInc: 1, posLen: 1, offset: [0,3], termId 0
10:30:20.560 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: domain, posInc: 0, posLen: 1, offset: [0,3], termId 1
10:30:20.561 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] TRACE o.o.solrtexttagger.TaggerFstCorpus - Token: name, posInc: 1, posLen: 1, offset: [0,3], termId 2
10:30:20.561 [TEST-PosIncPosLenTaggerTest.testAlternates-seed#[4C6FB28D9C934DC]] ERROR o.o.solrtexttagger.PhraseBuilder - Unable to append term[offset: [0,3], posInc: 1, posLen 1] to any phrase.

I can see how there could be ambiguity on what to do with 'name' looking at the raw token metadata. Might it be possible to simply append the token "name" to the newly created partial phrase "domain" on the grounds that "domain" was the last token that was emitted? That seems like a practical solution. I don't know if it would break something else but it appears worth trying.

htmlOffsetAdjust option

I should have an option similar to xmlOffsetAdjust but for parsing HTML. It'll use the Jericho HTML parser (EPL & LGPL dual licensed) for the tagging. When this option is enabled, there need not be a top level element to contain the text, and some tags (e.g. BR) are assumed to self-close even if not done so.

Tagging UpdateRequestProcessor

It would be cool to have a Solr URP that does text tagging and applies the results as fields. The referenced documents from tagging might include metadata that can be copied to the current document going through the URP. It might very well be demonstration in nature, as applications are likely to have specific needs here. Nonetheless it's great to start with something instead of from scratch.

Disclaimer: This is just a wish-list feature at this time. No plans yet.

just starting out

Hello,

I have downloaded the SolrTextTagger, and built my jar. I also have a current solr instance with the settings you have suggested.

I wanted to try out a sample dictionary (gazetteer), but don't see one.
Is the format
"foo","bar"

This is amazingly cool code, I hope to get something running soon.

Thanks,
Evan

Solr 6.0 support

A first cut at compiling against Lucene 6.x showed the lowest level of APIs have changed, specifically in the Tagging Attribute implementation.
org.apache.lucene.util.AttributeImpl has changed substantially, impacting org.opensextant.solrtexttagger.TaggingAttributeImpl

this was just a quick look at Solr 6. Not urgent at all.

prefix word phrase matching

Hi David,

First of all I want to thank you for the contribution. Solr text tagger has really helped in building the solution we wanted.

However, as an extension, I am looking to use ShingleFilterFactory instead of ConcatenateFilter. And the reason is that I also want to enable partial matches as suggestions.

But I want to enable suggestions which only match from left edge and not in the middle.

For Ex - if the text is "Quick brown fox jumped"
Then the expected tokens should be -
"Quick"
"Quick brown"
"Quick brown fox"
"Quick brown fox jumped"

But using ShingleFilter produces extra token such as -
"brown fox"
"fox jumped"
etc

I would be really grateful if you can guide me on how to achieve it.

Best,
Amit

nonTaggableTags option

Sometimes when submitting HTML markup to tag, you don't want tags to enclose certain elements (also called "tags" confusingly). The elements "script" and "style" are already stripped out by Lucene's HTMLStripCharFilter. But you might want to not tag text in "a" (anchor) link elements because your application is going to insert links and doesn't want such links to interfere with existing ones (no overlaps).

I'll add a nonTaggableTags option that is a comma-delimited list of HTML element (tag) names that, if found to overlap with a candidate tagger tag, will cause that tagger tag to be omitted. For now, this option will only work when htmlOffsetAdjust is true, but could be easily modified later for xmlOffsetAdjust likewise.

Release version 1.2

I think the current master branch, version 1.2-SNAPSHOT is ready to be released as 1.2. Rupert, let me know when you concur and I'll push a release to Maven central.

Feature: for stacked document references (same location), specify doc limit & doc sort

Hi David,

Following our conversation on thread #62, I am opening this as a new enhancement request.
The request is for following -
1). Limit the number of matching documents for a particular word in the sentence.
2). Ability to specify a sort field present in the document.

Best,
Amit

Possible bug in merging default initArgs for Tagger request handler with request params

I noticed that our consultant who was working on ou deployment of the tagger made the following change to TaggerRequesthandler.java

*** 354,360 ****
return;//short circuit; nothing to do
SolrParams topInvariants = new MapSolrParams(map);
// By putting putting the top level into the 1st arg, it overrides request params in 2nd arg.
! req.setParams(SolrParams.wrapDefaults(topInvariants, req.getParams()));
}

--- 354,362 ----
return;//short circuit; nothing to do
SolrParams topInvariants = new MapSolrParams(map);
// By putting putting the top level into the 1st arg, it overrides request params in 2nd arg.
! // Fixed, this was merging in the wrong direction, Francois Schiettecatte
! // req.setParams(SolrParams.wrapDefaults(topInvariants, req.getParams()));
! req.setParams(SolrParams.wrapDefaults(req.getParams(), topInvariants));
}
As far as I can see, it's a legitimate bug and fix, not corrected in current master (our source shapshot was from a year ago)

cheers

-Simon

Adjust offsets for balancing Html/Xml elements in source text

If the input is XML/HTML, we can strip it via HtmlStripCharFilter provided by Lucene text analysis. This ensures that the tagging won't try and tag the XML markup itself. Lucene takes care of mapping offsets such that the tagger's offsets returned are into the original text (XML in this case). However, if you were to try and use this information to insert new markup to reflect the tagger match, then there is the distinct possibility that doing so would result in an imbalanced DOM structure that would result in an error. For example if the source text was:

Hello David <b>Wayne Smiley</b>.

And if "David Wayne" was in the gazetteer / corpus, then the tagger would give offsets to the obvious offsets above that, if you were to insert, say an anchor element, then it would result in incorrect XML:

Hello <a>David <b>Wayne</a> Smiley</b>.

I'd like to add a feature such that at least in this case, the tag would be omitted. In other cases, the offsets need to be adjusted around opening/closing elements as appropriate. For example if this was the input:

<p>David <b>Wayne</b></p>

Then the offsets should be adjusted such that inserting an anchor tag would yield:

<p><a>David <b>Wayne</b></a></p>

Multilingual support

This aims to discuss things related the usage of the SolrTextTagger to process texts in different languages and tag them against a vocabulary with labels in multiple languages (e.g. freebase.org).

Multilingual Vocabularies

Expected properties of the vocabulary (numbered to allow referring them later in the text)

(1) defines labels in different languages
(2) labels without language tag should be used for all languages
(3) not all entities define labels in all languages
(4) for non common languages only a few entities do define labels

Within the Solr index labels of different language will be stored in different fields (as user will want to configure different Analyzers). For some languages a dynamic field with a generic text analyzer could be used - e.g.

<field name="label-en" type="text-en" ... />
<field name="label-de" type="text-de" ... />
<!-- other label fields for specific languages -->
<!-- finally the field for labels without language and
       a dynamic field for other languages -->
<field name="label" type="text-gen" ... />
<dynamicField name="label-*" type="text-gen ... />

Multilingual Tagging Process

Assuming that we do know the language of the processed text (parsed or detected) we would like to tag the content by using labels of the detected language as well as default labels (2).

For achieving this I see several solutions:

Building language specific FST corpora and calling the SolrTextTagger twice: To allow this the TaggerFstCorpus needs to be adapted to NOT throwing an RuntimeExcpetion on documents where the storedField is not present as this will happen because of (3). Also building the FST is inefficient for (4) as it iterates over all documents in the index and most of them will be skipped because they do not define a label in that language. An other potential drawback is that the TagClusterReducer will only work within a single language. Results of the the two calls will still need to be merged / reduced.
Building language specific FST corpora that do include default labels (2): While this would allow to use a single FST corpus for tagging a text based on a multi lingual vocabulary it would cause a lot of duplication. Especially for vocabularies that would contain a lot of default labels. TaggerFstCorpus would need to learn some new tricks as it would need to be built based on two fields with potential different analyzers. The problem of different Analyzers would also affect the Tagger - as it does use the same Analyzer to process the parsed text. If the Tagger would only use the Analyzer as defined by the Field of the language of the parsed text one would risk miss matches for default labels.
Building a multi lingual FST corpora: This would require to merge labels in different labels (stored in different fields using different analyzers) to a single FST corpora. This corpora would need to be aware of the languages Phrases are present so that it can only suggest matches with labels of the language of the text as well as default labels. Same as with the 2nd option one would also need to solve the problem of supporting two Analyzers in the Tagger.

For now I am aiming for the first option, as it requires the least changes to SolrTextTagger, but I would be eager to have an opinion/feedback on the other two options.

best
Rupert

Update to Solr 4.4

The update to Solr 4.4 needs some minor code changes because of changed APIs

In addition Solr 4.4 does force StopwordFilter to use posInc > 1 values (see LUCENE-4963.

This might cause existing configurations to no longer work with SolrTextTagger as

at FST generation time SolrTextTagger will throw Exceptions when encountering such posInc values
at tagging time any advancing tags are completed on posInc values > 1. Meaning that Entities with stopwords will no longer be tagged.

Enhance README

Convert to Markdown
Add note on Solr version support
Add note on how to use Embedded, and need to create a special query class. Probably just point to the test.
When applicable, merge relevant feature docs in #20 (text analysis)

Write a "getting started" how-to.

Asume the user knows nothing about Solr but can nonetheless be directed to install Solr following Solr's installation instructions. The user might not know anything about text analysis either, but we can provide a sample.

Can you recognize sentence or paragraph boundaries when tagging a large text field ?

Onelarge text field which we tag is yielding a lot of erroneous multiword tags due mostly to a large number of embedded newline characters. A simple (contrived) example of what we see.

I like my vitamin \n
A good time was had by all.

Since 'vitamin A' is in our tag dictionary, it will be tagged in this text if we use the standard tokenizer or the whitespace tokenizer. I've been playing around adding a MappingCharFilter to the query analyzer, which will substitute an arbitrary non-space character for a newline (I'm using Hebrew aleph) that can't occur in the English text or in our tag dictionary, followed by the Standard tokenizer. This inserts a junk character between 'vitamin' and 'a' so no tag will be found. However, this seems to be exquisitely sensitive to the presence or absence of spaces around the '\n' so I don't think it's robust enough

In an ideal world, I'd like the tagger to be able to recognize a new (tagger specific) Lucene token attribute ENDHERE, which would signal to the FST that this token is a boundary/terminal and not to look beyond it when a partial tag has been discovered. Obviously one would need some way of attaching this attribute to a token, (presumably by extending existing Tokenizers and filters). I'm not a Lucene expert so i have no idea if this is even feasible, which is why I'm reaching out here.

If all else fails I'll have to segment the text somehow upstream - there will porbably be a performance hit (our workflow is all in Python) but there will be fewer constraints compared to working within the Lucene analysis framework.

Comments welcome - maybe someone has solved this problem already

Enhancement suggestion: tagging multiple text fields concurrently in a single request

Would it be feasible to extend the API so that one could submit several separate text fields to be tagged in a single HTTP request, and do the tagging concurrently (in multiple threads). I suggest this because
a) we have a use case for this and b) it would be nice to take advantage of multicore/multiprocessor environments where possible.

At the moment I'm achieving concurrency in my application (written in Python) , but that involves creating threads, each of which then has to issue its own HTTP request.

Thoughts ?

Distributed Requests

Hi,

I have a question concerning SolrCloud.
Is the TaggerRequestHandler capable of performing distributed requests over multiple shards?
I know that the standard solr select handler does it and can be adjusted using the shards query parameter.

Thanks, Martin

Tagger should support non-Integer unique key

The tagger currently requires an integer unique key. Ideally this could be any valid type (especially string!)

The readme refers to: OPENSEXTANT-73, I assume that is now moving here?

Change History, etc

David,

greetings. Hope all is well with you.

Could you please summarize any functional changes in going to 2.1?
I see Solr 5.2+ is about where Solr rev sits along with Java 7+.

Really basic change history and integration requirements (versions) would help the README or a second file on change/integration.

I'd like to start looking at Solr 5.3 in Xponents taggers (and so SolrTT 2.1)

thanks,
Marc

Publish a Docker image of the text tagger.

It won't have data but it'll be Solr (from some other base image) with the tagger component added in.

The purpose is mostly to help people try this out.

TermPrefixCursor incompatible with 5.3.0

In Solr (Lucene) 5.3.0 the way that deleted docs are detected changed. The issue is https://issues.apache.org/jira/browse/LUCENE-6553 and their comment was "The postings, spans and scorer APIs no longer take an acceptDocs parameter. Live docs are now always checked on top of these APIs.".

So in TermPrefixCursor.java the following call no longer uses the liveDocs. This causes a few tests to fail.

postingsEnum = termsEnum.postings(liveDocs, postingsEnum, PostingsEnum.NONE);

SolrTextTagger/src/main/java/org/opensextant/solrtexttagger/TermPrefixCursor.java

Line 116 in b5f634f

postingsEnum = termsEnum.postings(liveDocs, postingsEnum, PostingsEnum.NONE);

Any advice on how to implement LUCENE-6553 within SolrTextTagger?

Solr 5.4 needs hppc updated

hppc should be 0.7.1 for version 5.4 per https://issues.apache.org/jira/browse/SOLR-7791