GithubHelp home page GithubHelp logo

oluies / scala-aho-corasick Goto Github PK

View Code? Open in Web Editor NEW

This project forked from corruptmemory/scala-aho-corasick

1.0 2.0 0.0 382 KB

A reasonably efficient implementation of Aho-Corasick in Scala

Home Page: http://www.corruptmemory.com

License: Apache License 2.0

scala-aho-corasick's Introduction

A reasonably efficient implementation of Aho-Corasick in Scala

This is an imperative implementation of the Aho-Corasick string-matching algorithm written entirely in Scala. It is reasonably efficient, and is character-oriented rather than byte-oriented.

Usage

It is extremely easy to use the library since basically all it does is find matching strings in an input document. There are two ways to use the builder, either using the factory constructor that takes in a Seq example:

import com.corruptmemory.aho_corasick.AhoCorasickBuilder

val builder = AhoCorasickBuilder[Unit](List(("he",()),("she",()),("his",()),("hers",()),("her",())))
val finder = builder.build()

val results = finder.find("Several ushers rushed over to aid her in finding a seat.")
// => Vector(Match(10,he,he,()), Match(9,she,she,()), Match(10,her,her,()), Match(10,hers,hers,()), Match(18,he,he,()), Match(17,she,she,()), Match(34,he,he,()), Match(34,her,her,()))

You can also use the += operator on the builder to add elements

builder += "it" -> ()

build()

When the build() method is invoked returns a finder and clears out all the data in the builder. It is possible to reuse the builder without interfering with already generated finders.

Matched results

Results are returned in a Match value that is defined as follows:

case class Match[T](start:Int,target:String,actual:String,data:T)

The start value is the offset in characters from the beginning of the input string to the first letter of the match. target is the string to match, actual was the actual string matched. It is possible that target and actual can differ (for example in case) because one of the optional arguments to the builder is a character map function that gets applied to each character in the dictionary (trie) and to each character during find. The default character map function maps everything to lower-case therefore case-insensitive matching is the default. Other interesting character map functions could include removing diacritical or accent marks from characters. data is arbitrary data associated with the dictionary entries that matched. You can supply the data when you add a word to the dictionary:

builder += "word" -> <data>

The data added must conform to a type, in the above example the type was Unit so () was supplied as the value.

License

This library is released under the Apache 2.0 license.

scala-aho-corasick's People

Contributors

corruptmemory avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.