GithubHelp home page GithubHelp logo

libtrie's Introduction

Libtrie

This is an implementation of the prefix tree structure usable for looking up string keys and associated string data. The main goal of this library is to provide a mean to query a structure without actually reading all the data in memory first.

This is achieved by first precomputing the trie from a plain text file. The created structure is than saved into a binary file on disk. When loading the file, it is mmaped and lazily read into memory when needed.

Therefore, this library provides an O(1) lookup in terms of number of keys.

The size of the compiled trie on disk varies widely to the original file based on the actual keys and values. It can be as much as 15 time smaller or 10 times as large. It is good for the keys to have lots of common prefixes and for values to be equal.

The library is provided under BSD 3-clause license. For details, see the LICENSE file.

Command line utilities

Apart from the shared object, this library provides two command line utilities. While there is no inherent problem with using the tools with data in any encoding, it was not tested and probably won't work out of the box. You should use UTF-8 everywhere anyway.

list-compile

This utility is used to compile the original file into a trie and serialize it. The input file should be a text file with unix line endings and two columns delimited by any byte. The delimiter can be specified via -d command line argument. The other available option is -e. This signifies that no data will be associated with the keys. Whole line will be stored in the trie as a key.

The library by default tries to compress the data. This works by not storing the common prefix with a key and only storing its length. This works very well for morphological data. It can be disabled with -u argument.

The arguments can be reviewed by running the utility with -h option.

If you pass - as input filename, the data will be read from standard input.

Should there be more occurrences of the same key, the data will be concatenated as lines and stored together.

list-query

This tool can be used to query the compiled files. It has no command line options, just give it a file name to work with. It will read keys from standard input (one on a line) and for each output the associated data.

Python interface

There is a Python module interface for the library written using the ctypes foreign function library. It exposes a single module Trie with one useful method lookup.

The constructor of the class expects one positional argument – the filename to be loaded – and one optional keyword argument encoding specifying the encoding of the file. The default encoding is UTF-8. Should the loading fail, an IOError with detailed description is thrown.

The lookup method needs one positional argument – the unicode key to be looked up in the trie. This method returns a list of strings associated with the key. The list is empty if the key was not present in the trie.

C API

For details of exported C functions, see the trie.h header file.

Building

Building libtrie from tarball only needs C99 compliant compiler. Unpack the tarball, and run the classic ./configure, make and make install. Note that libtrie supports out-of-tree builds. You can also use the make check target to run the tests (which are admittedly not very good).

This setup will by default install the command line tools as well as the shared library and Python bindings.

To build from Git, you will need autotools installed. After cloning the repository, run autoreconf -i and continue as though building from tarball.

libtrie's People

Contributors

lubomir avatar

Stargazers

 avatar

Watchers

 avatar  avatar

libtrie's Issues

Use proper build system

Instead of an ad-hoc Makefile, use some proper build system (autotools maybe?).

It should support:

  • conditional building of command line utilities
  • conditianal building of shared library

Python module is not installed

The Python module does not get installed with make install. Even more, it does not follow proper conventions for a python module. There should be __init__.py file somewhere. The location of the shared library should be supplied by autotools to support custom prefix.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.