I'm evaluating the use of Mensa to perform keyword matching against textual data. I'm impressed with the advertised functionality, but I have a question about how the punctuation and case insensitivity works.
I've written a small sample program that will match against strings following the examples provided in the Mensa Wiki. The only way I am able to get the punctuation and case insensitivity to work is if my implementation of ITextSource tokenizes to include/ignore punctuation and I explicitly define my keywords as all lowercase and convert the symbols returned by ITextSource to lowercase.
Based on the documentation, it seems like I should be able to configure the IKeywords to be case insensitive (which is what it claims to be configured as, by default) and have it work without my ITextSource doing anything special. But this does not appear to be the case.
Below is my sample test program. The second matching iterator does not successfully match the keywords unless I modify the MyTextSource to convert the parsed symbol to all lowercase.
import java.io.IOException;
import java.util.Iterator;
import com.dell.mensa.IFactory;
import com.dell.mensa.IKeyword;
import com.dell.mensa.IKeywords;
import com.dell.mensa.IMatch;
import com.dell.mensa.ITailBuffer;
import com.dell.mensa.ITextSource;
import com.dell.mensa.impl.generic.AbstractTextSource;
import com.dell.mensa.impl.generic.AhoCorasickMachine;
import com.dell.mensa.impl.generic.Factory;
import com.dell.mensa.impl.generic.Keyword;
import com.dell.mensa.impl.generic.Keywords;
public class MensaTest {
public static void main(String[] args) throws Exception {
IFactory factory = new Factory<>();
AhoCorasickMachine machine = new AhoCorasickMachine<>(factory);
IKeywords keywords = new Keywords<>();
Keyword k1 = new Keyword<>(new String[] {"free", "buffet", "breakfast"});
System.out.println("k1 - case sensitive: " + k1.isCaseSensitive());
System.out.println("k1 - punctuation: " + k1.isPunctuationSensitive());
keywords.add(k1);
IKeyword k2 = new Keyword<>(new String[] {"free", "breakfast"});
System.out.println("k2 - case sensitive: " + k2.isCaseSensitive());
System.out.println("k2 - punctuation: " + k2.isPunctuationSensitive());
keywords.add(k2);
IKeyword k3 = new Keyword<>(new String[] {"parking"});
System.out.println("k3 - case sensitive: " + k3.isCaseSensitive());
System.out.println("k3 - punctuation: " + k3.isPunctuationSensitive());
keywords.add(k3);
machine.build(keywords);
String text1 = "free breakfast, and free;buffet,breakfast plus \tparking\t";
System.out.println("\nTesting punctuation insensitivity with text: " + text1);
ITextSource textSource = new MyTextSource(text1);
try {
textSource.open();
Iterator> iterator = machine.matchIterator(textSource);
while (iterator.hasNext()) {
System.out.println("Match found: " + iterator.next());
}
}
finally {
textSource.close();
}
String text2 = "Free Breakfast, Free Buffet Breakfast and Parking";
System.out.println("\nTesting case insensitivity with text: " + text2);
ITextSource textSource2 = new MyTextSource(text2);
try {
textSource2.open();
Iterator> iterator = machine.matchIterator(textSource2);
while (iterator.hasNext()) {
System.out.println("Match found: " + iterator.next());
}
}
finally {
textSource2.close();
}
}
public static class MyTextSource extends AbstractTextSource {
/**
* The input text parsed into {@link String} words.
*/
private String[] symbols;
/**
* The index of the next available symbol to be read.
*/
private int position;
private String text;
public MyTextSource(String text) {
this.text = text;
}
@Override
protected void closeImpl() throws IOException
{
symbols = null;
}
@Override
protected void openImpl() throws IOException
{
symbols = text.split("[-,.; \\t\\n]+");
position = 0;
}
@Override
protected String readImpl(final ITailBuffer buffer_) throws IOException
{
if (position == symbols.length)
{
return null; // eof reached
}
final String symbol = symbols[position++];
buffer_.add(symbol);
return symbol;
}
}
}
Can you let me know if I should be able to get the case insensitivity to work properly without having my MyTextSource explicitly convert the text to lowercase? If not, then what is the purpose of the IKeywords having a caseSensitive and punctuationSensitive setting?
Thanks!