elshize / irkit Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 2.0 18.94 MB

Information Retrieval tools intended for academic research.

Home Page: https://elshize.github.io/irkit/

License: MIT License

Python 2.04% CMake 2.16% Shell 0.25% C++ 93.22% Vim Script 0.01% Tcl 2.31%

irkit's People

Contributors

Stargazers

Watchers

Forkers

amallia pombredanne

irkit's Issues

Produce and use document mappings when partitioning

Two-way mapping between global and local IDs should be available.
There should term maps and tables at cluster level written.

Implement major functionalities for 0.1

Note: the list isn't final.

Here's a rough list of the functionalities to implement before releasing 0.1:

Consider replacing std::function with a template for scoring function

Might improve efficiency, not sure how much. Not enough to care right now, but maybe enough when running experiments for a paper.

prefix_map fails on terms.txt

The integration test fails on terms.txt, possibly due to some UTF-8/16 characters. I'm not sure how that should affect it since everything is coded byte by byte but that's the only idea I have so far.

Note that hutucker_codec works but haven't tested prefix encoding, that might be the first step.

Why is ++posting_list_iterator slow?

Add enforce_exists for score-related files in sources

Fix varbyte_codec

The implementation of varbyte_codec looks awful. It needs to be much simpler and readable.

Also, a benchmark should be written to make sure that changes don't slow it down.

Figure out a consistent way of passing argument by value or reference

Mixing passing ranges by values and references might cause bugs (see #2).

A consistent way of passing argument must be designed to avoid these problems. Passing both containers and views should be seamless, ideally. But it must be efficient. I see a few options but not sure which is the best:

Use advanced SFINAE magic to automatically figure out the right signatures at compile time. No sure if possible, definitely hard.
Always pass by value and provide an easy view wrapper (e.g., gsl::span):
a. Still accept containers---more generic but a user can copy a lot by mistake.
b. Enforce a common type or, preferably, constrain types.
Provide by-copy and by-reference set of functions. Might get messy. Should be constrained.

Write a "cluster" property file after partitioning index

{
    "avg_document_size": 758,
    "documents": 328644,
    "max_document_size": 24083,
    "occurrences": 249345847,
    "shards": 123
}

Extract generic template stuff into irtl

Create a new include directory irtl.
Use a new namespace irtl.
Move generic template stuff, like algorithms, to the new directory.

Some document IDs are being duplicated when building index

Sanitizers show errors for boost::filesystem

Investigate and turn on the sanitizers for tests again

Add timers to most irk-* command line tools

By most, I mean anything that processes a single thing, like builds an index, partitions, scores, etc.

Use <score>-<bits> for precomputed scores, and <score> for dynamic

Change for irk-query, irk-threshold, and others.

Write benchmarks for query processing

For a chosen processing type. For now, only exhaustive:

DAAT
TAAT

For a given score or precomputed score.

Fast assert-type fail for expected

nonstd::expected is sometimes cumbersome to use. Whenever you require it to has value, certain boilerplate is involved. But why not replace it with a specific type of assert that aborts the program if the variable has unexpected value.

Should work for any printable unexpected type.

Example:

auto result = something_returning_expected().get_value_or_abort();

Use FakeIt with find_package

Fix merger unit test

For now, index integration test should ensure it's correct. Nevertheless, this should be fixed.

Conan: fetch entire boost

clang-tidy on TravisCI

clang-tidy doesn't find the configuration file
implement checking if there are any warnings

Taily extractor

Write irk-taily tool that extracts Taily scores from an index cluster.

Implement DAAT scoring

When scoring on-the-fly, it might matter how many look-ups to the size table is made, so implementing a DAAT-specific scoring---which scores entire document at once---might be a good idea.

However, I have doubts it matters much, as the intersection might not be big enough to make a difference.

Another reason for this could be finding out how precision changes if we score with a smoothed language model, and always consider all terms, even those for which a document doesn't exist in the posting list.

Write irk-vector

It will read/map a file and create a span of T to it, and perform lookups, print, etc.

irk-score doesn't save properties to disk

noexcept move constructor of disk_memory_source

The below should be noexcept but it fails on clang on Travis (but not locally or moa). Investigate and make it noexcept if possible.

disk_memory_source(disk_memory_source&&) = default;
disk_memory_source& operator=(disk_memory_source&&) = default;

Implement empty posting list

Use execve in favor of system call

Compact table lookup doesn't need a binary search

As opposed to reverse lookup, which does.

Export libraries

libstreamvbyte.so

Remove irk-part

There is a much better tool for doing this in bash: split.

Design Offset Format

Offsets, such as those to document IDs, frequencies and scores, should ideally follow the following rules:

Independent: Since the number of types of offsets is a variable, we should be able to easily read any subset of the offsets.
Fast: It should be fast to read any offset given a term ID; ideally, even from disk.
Small: They shouldn't take up too much space, especially in memory.

The question is: can we satisfy all of the above?

Note: ClueWeb09B has over 80 mln terms.

"irk-prefixmap lookup" returns wrong IDs

Looks like there's something wrong across block edges. See the integration test.

It's all correct until the first element of the second block.

block: 0
515 ; idx: 515
block: 1
515 ; idx: 516

Implement bash auto-completion

https://www.gnu.org/software/bash/manual/bash.html#Programmable-Completion

DAAT queries with dynamic scores are slow

It performs well for precomputed scores. Also, there is much smaller difference between dynamic and precomputed for TAAT.

Maybe implementing #61 will make difference?

Fix clang on Travis and enable back

Offset table with 8-byte values fails during encoding on vbyte

irk-score fails with memory corruption

Index

moa:/data/index/irkit/cw09b-nospam

Command

~/irkit/build/bin/irk-score

Log

[2018-11-05 17:30:36.058] [score] [info] Initiating scoring using 8 threads
[2018-11-05 17:30:36.810] [score] [info] Calculating max score
[2018-11-05 17:45:45.419] [score] [info] Max score: 8.48841e+07; Min score: -1.09753e+07
malloc(): memory corruption
Aborted (core dumped)

Move block ownership to list instead of iterator

Reading WARC files

Write a tool that takes WARC (possibly gzipped) files and print outs the documents.

Output Format

Title<--->URL<--->size (bytes)<--->term term term...

Options

-s/--stem lang: use stemming in the selected language

Tool for gathering Taily stats

A tool, e.g., irk-scorestats, should take a parameter score=["bm25", "ql"], and compute, for each term, its max, mean, and variance, and store in files:

<score>.max
<score>.mean
<score>.var

Use find_package for zlib

For some reason, if I use find_package to find zlib dependency within conan packages, it fails despite linking against the library:

/usr/bin/g++-7   -Wall -pedantic -fno-strict-aliasing -march=native  -O3   CMakeFiles/irk-warc.dir/irk-warc.cpp.o  -o ../bin/irk-warc -Wl,-rpath,/home/travis/.conan/data/zlib/1.2.11/conan/stable/package/5246c0bd84cb3855ffc2a458086a0813344953bf/lib:/home/travis/.conan/data/TBB/2018_U5/conan/stable/package/160b4bac5177b0f9a5f4857d00317fe0862a8a02/lib:/home/travis/.conan/data/streamvbyte/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib:/home/travis/.conan/data/gumbo-parser/1.0/elshize/stable/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib: /home/travis/.conan/data/zlib/1.2.11/conan/stable/package/5246c0bd84cb3855ffc2a458086a0813344953bf/lib/libz.so -lpthread /usr/lib/x86_64-linux-gnu/libboost_iostreams.a /usr/lib/x86_64-linux-gnu/libboost_regex.a /usr/lib/x86_64-linux-gnu/libboost_filesystem.a /usr/lib/x86_64-linux-gnu/libboost_system.a /home/travis/.conan/data/fmt/5.1.0/bincrafters/stable/package/66c5327ebdcecae0a01a863939964495fa019a06/lib/libfmt.a /home/travis/.conan/data/TBB/2018_U5/conan/stable/package/160b4bac5177b0f9a5f4857d00317fe0862a8a02/lib/libtbb.so /home/travis/.conan/data/rax/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/librax.a /home/travis/.conan/data/streamvbyte/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/libstreamvbyte.so /home/travis/.conan/data/gumbo-parser/1.0/elshize/stable/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/libgumbo.so

Check out Travis build for details: https://travis-ci.org/elshize/irkit/builds/449604806

traverse_list for gsl::span

Segmentation fault occurs when compiling with -g. Investigate and fix.

Resolve path issues for tests

unit_test_standard_block_list.cpp needs path fix and to be re-enabled.

Turn on sanitizers on Travis

Right now, for some reason, rax has problems with sanitizers. Investigate and fix.

Write benchmarks for posting traversal

Documents
Frequencies
Scores
(doc, freq)
(doc, score)
scored(doc, freq)

1, 2, and 3 should be about the same.
Same goes for 4 and 5, which will take twice the time (?) of 1.
6 will be slower than 5 but should be reasonable (probably should at some point find out what that means).