GithubHelp home page GithubHelp logo

pombredanne / autocomplete Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jermp/autocomplete

0.0 1.0 0.0 1.72 MB

The code for the experiments in the paper "Efficient and Effective Query Auto-Completion" by Simon Gog, Giulio Ermanno Pibiri, and Rossano Venturini, published in ACM SIGIR 2020.

License: MIT License

CMake 0.77% C++ 94.54% Python 3.12% HTML 0.37% CSS 0.28% JavaScript 0.42% Shell 0.50%

autocomplete's Introduction

Autocomplete

A Query Auto-Completion system based on the paper Efficient and Effective Query Auto-Completion, by Simon Gog, Giulio Ermanno Pibiri, and Rossano Venturini, published in ACM SIGIR 2020.

Table of contents
  1. Installation and quick start
  2. Compiling the code
  3. Input data format
  4. Running the unit tests
  5. Building an index
  6. Benchmarks
  7. Live demo

Installation and quick start

Just run

bash ./install.sh

from the parent directory. The script builds the code; prepare the test data in the folder test_data/trec_05_efficiency_queries for indexing; executes the unit tests.

After that, for having a minimal running example, just run

bash ./example.sh

and then access the service here.

Compiling the code

The code has been tested on Linux with gcc 7.4.0, 8.3.0, 9.0.0 and on Mac 10.14 with clang 10.0.0. To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/autocomplete.git

If you have cloned the repository without --recursive, you will need to perform the following commands before compiling:

git submodule init
git submodule update

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make

Hint: Use make -j to compile the library in parallel using all available threads.

For the best of performance, we recommend compiling with:

cmake .. -DCMAKE_BUILD_TYPE=Release -DUSE_SANITIZERS=Off -DUSE_INTRINSICS=On -DUSE_PDEP=On

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_SANITIZERS=On
make

Input data format

The input file should list all completions in lexicographical order. For example, see the the file test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions.

The first column represent the ID of the completion; the other columns contain the tokens separated by white spaces. (The IDs for the file trec_05_efficiency_queries.completions are fake, i.e., they do not take into account any particular assignment.)

The script preprocess.sh in the directory test_data helps in preparing the data for indexing. Thus, from within the directory test_data, it is sufficient to do:

bash preprocess.sh <test_collection> <num_queries>

Therefore, for our example with trec_05_efficiency_queries, it would be:

bash preprocess.sh trec_05_efficiency_queries/trec_05_efficiency_queries.completions 300

The second argument in the example, i.e., 300, represents the number of completions (per completion size) that are drawn at random and could be used to query the indexes.

If you run the script, you will get:

  • trec_05_efficiency_queries.completions.dict: lists all the distinct tokens in the completions sorted in lexicographical order.

  • trec_05_efficiency_queries.completions.mapped: lists all completions whose tokens have been mapped to integer ids as assigned by a lexicographically-sorted string dictionary (that should be built from the tokens listed in trec_05_efficiency_queries.completions.dict). Each completion terminates with the id 0.

  • trec_05_efficiency_queries.completions.mapped.stats contains some statistics about the datasets, needed to build the data structures more efficiently.

  • trec05_efficiency_queries.completions.inverted is the inverted file.

  • trec_05_efficiency_queries.completions.forward is the forward file. Note that each list is not sorted, thus the lists are the same as the ones contained in trec_05_efficiency_queries.completions.mapped but sorted in docID order.

Running the unit tests

The unit tests are written using doctest.

After compilation and preparation of the data for indexing (see Section Input data format), it is advised to run the unit tests with:

make test

Building an index

After compiling the code, run the program ./build to build an index. You can specify the type of the index and the name of the file where the index will be written.

For example, with

./build ef_type1 ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions -o trec05.ef_type1.bin

we can build an index of type ef_type1 from the test file ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions, that will be serialized to the file trec05.ef_type1.bin.

Possible types are ef_type1, ef_type2, ef_type3 and ef_type4.

Note: the type ef_type4 requires an extra parameter to be specified, c. Use for example: -c 0.0001.

Benchmarks

To run the top-k benchmarks in the /benchmark directory, we first need some query logs. They should have been created already if you have run the script preprocess.sh, otherwise you can use

python partition_queries_by_length.py trec_05_efficiency_queries/trec_05_efficiency_queries.completions trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries 300

to partition the input completions by number of query terms and retain 300 queries at random. Query files are placed in the output directory trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries. (By default, 7 shards will be created: the ones having [1,6] query terms and the one collecting all completions with at least 7 query terms).

Then the command

./benchmark_topk ef_type1 10 trec05.ef_type1.bin 3 300 0.25 < ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions.queries/queries.length=3.shuffled

will execute 1000 top-10 queries with 3 terms, from which only 25% of the prefix of the last token is retained.

We automated the collection of results with the script script/collected_topk_results_by_varying_percentage.py. From within the /build directory, run

python ../script/collect_results_by_varying_percentage.py ef_type1 topk trec05.ef_type1.bin ../test_data/trec_05_efficiency_queries/trec_05_efficiency_queries.completions 10 300

To benchmark the dictionaries (Front-Coding and trie), just run the following script from within the script directory:

bash benchmark_dictionaries.sh

Live demo

Start the web server with the program ./web_server <port> <index_filename> and access the demo at localhost:<port>.

autocomplete's People

Contributors

jermp avatar simongog avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.