GithubHelp home page GithubHelp logo

qsft / mensa Goto Github PK

View Code? Open in Web Editor NEW
94.0 94.0 19.0 3.73 MB

Mensa is a generic, flexible, enhanced, and efficient Java implementation of a pattern matching state machine as described by the 1975 paper by Alfred V. Aho and Margaret J. Corasick: Efficient string matching: An aid to bibliographic search.

License: Apache License 2.0

Java 100.00%

mensa's People

Contributors

faseidl avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mensa's Issues

Save automaton to file

Hi Mensa author,

I have used Mensa for matching strings and it works very well. I am trying build a large automaton with several millions words. As this takes time I would like to save the automaton once it is built so that I can reuse it on multiple projects.

Mensa seems to rely on Java Generics which seem to have difficulties with serialization/deserialization. I have added "implements Serializable" to several classes in the project and that did the work to save a machine. The problem is I cannot deserialize the machine when once it is saved in a file.
Can you please help me to fix that issue ? Or give me some elements that will help me find a solution.

Thanks in advance.

Bug Report

I noticed that last line of the Apache 2.0 license is missing from
the license version you are using.

compilation error on java 1.7

Hello
When I compile mensa using java 1.7 it bombs with this error

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project mensa: Compilation failure
[ERROR] /Users/rob/code/Mensa/mensa/src/main/java/com/dell/mensa/impl/generic/AhoCorasickMachine.java:[807,17] com.dell.mensa.impl.generic.AhoCorasickMachine.MatchIterator is not abstract and does not override abstract method remove() in java.util.Iterator

If I use java 1.8 it compiles ok.

See here for an explanation:
http://stackoverflow.com/questions/5425130/java-iterator-implementation-compile-error-does-not-override-abstract-method-re

You can insert a default implementation such as this at somewhere like line 862 of AhoCorasickMachine.java:

@OverRide
public void remove() {
throw new UnsupportedOperationException();
}

Then it compiles and all tests pass under java 1.7 (for me)

Punctuation and Case Insensitivity Question

I'm evaluating the use of Mensa to perform keyword matching against textual data. I'm impressed with the advertised functionality, but I have a question about how the punctuation and case insensitivity works.

I've written a small sample program that will match against strings following the examples provided in the Mensa Wiki. The only way I am able to get the punctuation and case insensitivity to work is if my implementation of ITextSource tokenizes to include/ignore punctuation and I explicitly define my keywords as all lowercase and convert the symbols returned by ITextSource to lowercase.

Based on the documentation, it seems like I should be able to configure the IKeywords to be case insensitive (which is what it claims to be configured as, by default) and have it work without my ITextSource doing anything special. But this does not appear to be the case.

Below is my sample test program. The second matching iterator does not successfully match the keywords unless I modify the MyTextSource to convert the parsed symbol to all lowercase.

import java.io.IOException;
import java.util.Iterator;

import com.dell.mensa.IFactory;
import com.dell.mensa.IKeyword;
import com.dell.mensa.IKeywords;
import com.dell.mensa.IMatch;
import com.dell.mensa.ITailBuffer;
import com.dell.mensa.ITextSource;
import com.dell.mensa.impl.generic.AbstractTextSource;
import com.dell.mensa.impl.generic.AhoCorasickMachine;
import com.dell.mensa.impl.generic.Factory;
import com.dell.mensa.impl.generic.Keyword;
import com.dell.mensa.impl.generic.Keywords;

public class MensaTest {
    
    public static void main(String[] args) throws Exception {
        IFactory factory = new Factory<>();
        AhoCorasickMachine machine = new AhoCorasickMachine<>(factory);
        
        IKeywords keywords = new Keywords<>();
        Keyword k1 = new Keyword<>(new String[] {"free", "buffet", "breakfast"});
        System.out.println("k1 - case sensitive: " + k1.isCaseSensitive());
        System.out.println("k1 - punctuation: " + k1.isPunctuationSensitive());
        keywords.add(k1);

        IKeyword k2 = new Keyword<>(new String[] {"free", "breakfast"});
        System.out.println("k2 - case sensitive: " + k2.isCaseSensitive());
        System.out.println("k2 - punctuation: " + k2.isPunctuationSensitive());
        keywords.add(k2);

        IKeyword k3 = new Keyword<>(new String[] {"parking"});
        System.out.println("k3 - case sensitive: " + k3.isCaseSensitive());
        System.out.println("k3 - punctuation: " + k3.isPunctuationSensitive());
        keywords.add(k3);

        machine.build(keywords);
        
        String text1 = "free    breakfast, and free;buffet,breakfast plus \tparking\t";
        System.out.println("\nTesting punctuation insensitivity with text: " + text1);
        ITextSource textSource = new MyTextSource(text1);
        try {
            textSource.open();
            Iterator> iterator = machine.matchIterator(textSource);
            while (iterator.hasNext()) {
                System.out.println("Match found: " + iterator.next());
            }
        }
        finally {
            textSource.close();
        }
        
        String text2 = "Free Breakfast, Free Buffet Breakfast and Parking";
        System.out.println("\nTesting case insensitivity with text: " + text2);
        ITextSource textSource2 = new MyTextSource(text2);
        try {
            textSource2.open();
            Iterator> iterator = machine.matchIterator(textSource2);
            while (iterator.hasNext()) {
                System.out.println("Match found: " + iterator.next());
            }
        }
        finally {
            textSource2.close();
        }       
    }
    
    public static class MyTextSource extends AbstractTextSource {
        /**
         * The input text parsed into {@link String} words.
         */
        private String[] symbols;

        /**
         * The index of the next available symbol to be read.
         */
        private int position;
        
        private String text;
        
        public MyTextSource(String text) {
            this.text = text;
        }

        @Override
        protected void closeImpl() throws IOException
        {
            symbols = null;
        }

        @Override
        protected void openImpl() throws IOException
        {
            symbols = text.split("[-,.; \\t\\n]+");
            position = 0;
        }

        @Override
        protected String readImpl(final ITailBuffer buffer_) throws IOException
        {
            if (position == symbols.length)
            {
                return null; // eof reached
            }

            final String symbol = symbols[position++];
            buffer_.add(symbol);

            return symbol;
        }   
    }
}

Can you let me know if I should be able to get the case insensitivity to work properly without having my MyTextSource explicitly convert the text to lowercase? If not, then what is the purpose of the IKeywords having a caseSensitive and punctuationSensitive setting?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.