GithubHelp home page GithubHelp logo

elshize / irkit Goto Github PK

View Code? Open in Web Editor NEW
6.0 6.0 2.0 18.94 MB

Information Retrieval tools intended for academic research.

Home Page: https://elshize.github.io/irkit/

License: MIT License

Python 2.04% CMake 2.16% Shell 0.25% C++ 93.22% Vim Script 0.01% Tcl 2.31%

irkit's People

Contributors

elshize avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

irkit's Issues

Implement major functionalities for 0.1

Note: the list isn't final.

Here's a rough list of the functionalities to implement before releasing 0.1:

  • Index builder
    • In-memory index builder and writer
    • Index merger
  • Term-wise scorers (both query-time and as pre-processors):
    • BM25
    • Query Likelihood
  • Query processing
    • TAAT (optimizations & max-score optional)
    • DAAT (max-score, WAND optional)
    • SAAT
  • Feature extractors
  • Command line
    • Build Index
    • Merge parts
    • Reading index
      • Document features: size
      • Single terms: documents, scores
    • Score: precalculate scores for postings.

prefix_map fails on terms.txt

The integration test fails on terms.txt, possibly due to some UTF-8/16 characters. I'm not sure how that should affect it since everything is coded byte by byte but that's the only idea I have so far.

Note that hutucker_codec works but haven't tested prefix encoding, that might be the first step.

Fix varbyte_codec

The implementation of varbyte_codec looks awful. It needs to be much simpler and readable.

Also, a benchmark should be written to make sure that changes don't slow it down.

Figure out a consistent way of passing argument by value or reference

Mixing passing ranges by values and references might cause bugs (see #2).

A consistent way of passing argument must be designed to avoid these problems. Passing both containers and views should be seamless, ideally. But it must be efficient. I see a few options but not sure which is the best:

  1. Use advanced SFINAE magic to automatically figure out the right signatures at compile time. No sure if possible, definitely hard.
  2. Always pass by value and provide an easy view wrapper (e.g., gsl::span):
    a. Still accept containers---more generic but a user can copy a lot by mistake.
    b. Enforce a common type or, preferably, constrain types.
  3. Provide by-copy and by-reference set of functions. Might get messy. Should be constrained.

Fast assert-type fail for expected

nonstd::expected is sometimes cumbersome to use. Whenever you require it to has value, certain boilerplate is involved. But why not replace it with a specific type of assert that aborts the program if the variable has unexpected value.

Should work for any printable unexpected type.

Example:

auto result = something_returning_expected().get_value_or_abort();

Fix merger unit test

For now, index integration test should ensure it's correct. Nevertheless, this should be fixed.

clang-tidy on TravisCI

  • clang-tidy doesn't find the configuration file
  • implement checking if there are any warnings

Taily extractor

Write irk-taily tool that extracts Taily scores from an index cluster.

Implement DAAT scoring

When scoring on-the-fly, it might matter how many look-ups to the size table is made, so implementing a DAAT-specific scoring---which scores entire document at once---might be a good idea.

However, I have doubts it matters much, as the intersection might not be big enough to make a difference.

Another reason for this could be finding out how precision changes if we score with a smoothed language model, and always consider all terms, even those for which a document doesn't exist in the posting list.

Write irk-vector

It will read/map a file and create a span of T to it, and perform lookups, print, etc.

noexcept move constructor of disk_memory_source

The below should be noexcept but it fails on clang on Travis (but not locally or moa). Investigate and make it noexcept if possible.

disk_memory_source(disk_memory_source&&) = default;
disk_memory_source& operator=(disk_memory_source&&) = default;

Remove irk-part

There is a much better tool for doing this in bash: split.

Design Offset Format

Offsets, such as those to document IDs, frequencies and scores, should ideally follow the following rules:

  • Independent: Since the number of types of offsets is a variable, we should be able to easily read any subset of the offsets.
  • Fast: It should be fast to read any offset given a term ID; ideally, even from disk.
  • Small: They shouldn't take up too much space, especially in memory.

The question is: can we satisfy all of the above?

Note: ClueWeb09B has over 80 mln terms.

"irk-prefixmap lookup" returns wrong IDs

Looks like there's something wrong across block edges. See the integration test.

It's all correct until the first element of the second block.

block: 0
515 ; idx: 515
block: 1
515 ; idx: 516

irk-score fails with memory corruption

Index

moa:/data/index/irkit/cw09b-nospam

Command

~/irkit/build/bin/irk-score

Log

[2018-11-05 17:30:36.058] [score] [info] Initiating scoring using 8 threads
[2018-11-05 17:30:36.810] [score] [info] Calculating max score
[2018-11-05 17:45:45.419] [score] [info] Max score: 8.48841e+07; Min score: -1.09753e+07
malloc(): memory corruption
Aborted (core dumped)

Reading WARC files

Write a tool that takes WARC (possibly gzipped) files and print outs the documents.

Output Format

Title<--->URL<--->size (bytes)<--->term term term...

Options

  • -s/--stem lang: use stemming in the selected language

Tool for gathering Taily stats

A tool, e.g., irk-scorestats, should take a parameter score=["bm25", "ql"], and compute, for each term, its max, mean, and variance, and store in files:

<score>.max
<score>.mean
<score>.var

Use find_package for zlib

For some reason, if I use find_package to find zlib dependency within conan packages, it fails despite linking against the library:

/usr/bin/g++-7   -Wall -pedantic -fno-strict-aliasing -march=native  -O3   CMakeFiles/irk-warc.dir/irk-warc.cpp.o  -o ../bin/irk-warc -Wl,-rpath,/home/travis/.conan/data/zlib/1.2.11/conan/stable/package/5246c0bd84cb3855ffc2a458086a0813344953bf/lib:/home/travis/.conan/data/TBB/2018_U5/conan/stable/package/160b4bac5177b0f9a5f4857d00317fe0862a8a02/lib:/home/travis/.conan/data/streamvbyte/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib:/home/travis/.conan/data/gumbo-parser/1.0/elshize/stable/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib: /home/travis/.conan/data/zlib/1.2.11/conan/stable/package/5246c0bd84cb3855ffc2a458086a0813344953bf/lib/libz.so -lpthread /usr/lib/x86_64-linux-gnu/libboost_iostreams.a /usr/lib/x86_64-linux-gnu/libboost_regex.a /usr/lib/x86_64-linux-gnu/libboost_filesystem.a /usr/lib/x86_64-linux-gnu/libboost_system.a /home/travis/.conan/data/fmt/5.1.0/bincrafters/stable/package/66c5327ebdcecae0a01a863939964495fa019a06/lib/libfmt.a /home/travis/.conan/data/TBB/2018_U5/conan/stable/package/160b4bac5177b0f9a5f4857d00317fe0862a8a02/lib/libtbb.so /home/travis/.conan/data/rax/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/librax.a /home/travis/.conan/data/streamvbyte/master/elshize/testing/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/libstreamvbyte.so /home/travis/.conan/data/gumbo-parser/1.0/elshize/stable/package/6ae331b72e7e265ca2a3d1d8246faf73aa030238/lib/libgumbo.so

Check out Travis build for details: https://travis-ci.org/elshize/irkit/builds/449604806

Write benchmarks for posting traversal

  1. Documents
  2. Frequencies
  3. Scores
  4. (doc, freq)
  5. (doc, score)
  6. scored(doc, freq)

1, 2, and 3 should be about the same.
Same goes for 4 and 5, which will take twice the time (?) of 1.
6 will be slower than 5 but should be reasonable (probably should at some point find out what that means).

Remove include_directories

Remove include_directories in favor of the modern approach with target_include_directories. Figure out how it works and if we can 'inherit' it from irkit target. If so, then find out what was wrong.

Index source should be stored in shared_ptr

This needs to be further thought through, but I think shared_ptr is a better way to manage object's lifetime.

Alternative: can it be done with a move semantics? Should it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.