pisa-engine / pisa Goto Github PK

View Code? Open in Web Editor NEW

863.0 863.0 61.0 36.25 MB

PISA: Performant Indexes and Search for Academia

Home Page: https://pisa-engine.github.io/pisa/book

License: Apache License 2.0

CMake 1.11% Python 6.92% C++ 89.30% Shell 1.78% Dockerfile 0.89%

information-retrieval inverted-index search search-engine

pisa's People

Contributors

Stargazers

Watchers

pisa's Issues

Query processing for precision evaluation

Describe the solution you'd like
As discussed, queries should be able to run eval queries and print out results in TREC format.

Additional context
We need document mapping to be provided, in a similar fashion to terms: #97

Replace boost::optional with std::optional

https://github.com/pisa-engine/pisa/blob/3bfbdec0da399dcb278d61d496ae348b999644fe/src/create_freq_index.cpp

https://github.com/pisa-engine/pisa/blob/61261d3b9c2515168be0bb420559227a53cb0468/src/profile_queries.cpp

https://github.com/pisa-engine/pisa/blob/61261d3b9c2515168be0bb420559227a53cb0468/src/queries.cpp

Consider tlx/tlx to completely replace Boost

Since https://github.com/tlx/tlx has some cool functionality we will need anyway at some point (B+trees, RadixHeap, Math Functions,...) it probably makes sense to replace some of the boost string algorithms with the one provided there.

Replace boost::variant with std::variant

https://github.com/pisa-engine/pisa/blob/3bfbdec0da399dcb278d61d496ae348b999644fe/include/sequence/strict_sequence.hpp

https://github.com/pisa-engine/pisa/blob/3bfbdec0da399dcb278d61d496ae348b999644fe/include/sequence/indexed_sequence.hpp

Remove Boost.Test

Rewrite existing tests and get rid of Boost.Test

Segfault when parsing Clueweb in WARC format

When run:

zcat /data/collections/CW09b/**/*gz | ./bin/parse_collection -f warc -b 10000 -o /data/index/pisa/tmp/cw09b

I get segfault:

2019-01-13 01:10:57: [Batch 99] Processed documents [990000, 1000000)
2019-01-13 01:11:01: [Batch 100] Processed documents [1000000, 1010000)
2019-01-13 01:11:05: [Batch 102] Processed documents [1020000, 1030000)
2019-01-13 01:11:18: [Batch 103] Processed documents [1030000, 1040000)
2019-01-13 01:11:29: [Batch 108] Processed documents [1080000, 1090000)
2019-01-13 01:11:34: [Batch 104] Processed documents [1040000, 1050000)
2019-01-13 01:11:34: [Batch 105] Processed documents [1050000, 1060000)
2019-01-13 01:11:36: [Batch 107] Processed documents [1070000, 1080000)
2019-01-13 01:11:41: [Batch 106] Processed documents [1060000, 1070000)
2019-01-13 01:11:44: [Batch 109] Processed documents [1090000, 1100000)
2019-01-13 01:12:04: [Batch 110] Processed documents [1100000, 1110000)
2019-01-13 01:12:16: [Batch 111] Processed documents [1110000, 1120000)
Segmentation fault (core dumped)

Needs investigating.

Manage dependencies well

Opening this for discussion.

Support using string terms with forward index

For example, if we pass --terms <file> or something of the sort, we assume that input is terms to be later translated to IDs.

While we're at it, a program to map terms to IDs in a file would be useful as well.

~~Lexicon and lexicon tools #89~~ will not do it for now
Support lexicon for queries, thresholds, etc. with std::unordered_map

Integration test for parsing WARC

Enable and possibly modify the test

Add a test warc file
Do the proper asserts

Support pipeline of operations

pisa/src/parse_collection.cpp

Lines 48 to 69 in e3a3121

 } else if (format == "warc") { 

 Forward_Index_Builder<warcpp::Warc_Record> builder; 

 builder.build(std::cin, 

 output_filename, 

 [](std::istream &in) -> std::optional<warcpp::Warc_Record> { 

 warcpp::Warc_Record record; 

 if (warcpp::read_warc_record(in, record)) { 

 return std::make_optional(record); 

 } 

 return std::nullopt; 

 }, 

 [&](std::string &&term) -> std::string { 

 std::transform(term.begin(), 

 term.end(), 

 term.begin(), 

 [](unsigned char c) { return std::tolower(c); }); 

 return stem::Porter2{}.stem(term); 

 }, 

 parse_html_content, 

 batch_size, 

 threads); 

 }

I would be nice if we could build a pipeline of actions to perform on the input stream.

We could decide to: parse_warc | strip_http_header | parse_html | stem | ...
It will give us much more freedom and will simplify the code. The pipeline will be first build according to what we are performing (parse_warc or parse_plaintext) and then passed to the forward index builder.

The only library to do so that I am aware of is https://github.com/ericniebler/range-v3

Get rid of Boost iostreams

Consider to substitute Boost iostreams with https://github.com/mandreyel/mio

Add new scoring algorithms

Consider the following:

LMDS
tf.idf
Probability score
Bose-Einstein
DPH
DFR
TP-Score

Discrepancies in query result set sizes

pisa/test/test_ranked_queries.cpp

Line 45 in 8177a47

ranked_or_query<WandType> or_q(wdata, 10);

There seems to be something wrong with query results when I vary k. (Or I don't understand somethinig.)

To show what I mean, I change this to:

            ranked_or_query<WandType> or_q(wdata, 1);

            for (auto const& q: queries) {
                or_q(index, q);
                op_q(index, q);
                //BOOST_REQUIRE_EQUAL(or_q.topk().size(), op_q.topk().size());
                std::cerr << or_q.topk().size() << '\n';
                for (size_t i = 0; i < or_q.topk().size(); ++i) {
                    BOOST_REQUIRE_CLOSE(or_q.topk()[i].first, op_q.topk()[i].first, 0.1); // tolerance is % relative
                }
            }

I printed out ranked_or_query result sizes for all queries with both k = 1 and k = 10. In the former case, there are some non-zero sizes (nothing equal 10, though). In the latter, all are zeroes when I expected some ones.

Here's the output files: k10.txt and k1.txt

If I set k = 1, I should stil get the same top result as for k = 10, shouldn't I?

Parallel Network BP output repeated document IDs

There seems to be something wrong but it needs further investigating. I open this as a reminder for myself to provide more information and validate the input network I used.

Add progress bar to index construction task #9

Add progress bar to the verify index task

Some weird comment

pisa/src/queries.cpp

Line 22 in 8177a47

void op_perftest(Functor query_func, // XXX!!!

Is this something we should be aware of? Or can it be removed?

Upgrade CLI11 to 1.8 once released

Additional context

In master, they added something we should definitely use, as it's clean and useful:
CLIUtils/CLI11#222

Essentially, you can add a set of available values, e.g.,

stemmer_option->check(CLI::IsMember{"porter2", "krovetz", "none"});

This feature is tagged for version 1.8.

A list of options for parsing

https://github.com/lexborisov/myhtml
https://github.com/google/gumbo-parser
https://github.com/zacharygrafton/gumbopp
Apache Tika in server mode

Anything else?

Enable clang-tidy

Figure out what's the problem with gumbo not building before its dependents

Add Clang to Travis

Dead code block

Isn't this a dead block? The UB scores are computed in the constructor of the partition and returned as t unless I am mistaken.

pisa/include/pisa/wand_data_compressed.hpp

Lines 119 to 126 in 3672c4c

 std::vector<std::tuple<uint64_t, float, bool>> doc_score_top; 

 for (size_t i = 0; i < seq.docs.size(); ++i) { 

 uint64_t docid = *(seq.docs.begin() + i); 

 uint64_t freq = *(seq.freqs.begin() + i); 

 float score = Scorer::doc_term_weight(freq, norm_lens[docid]); 

 doc_score_top.emplace_back(docid, score, false); 

 max_score = std::max(max_score, score); 

 }

queries.hpp should include algorithms outside of namespace

queries.hpp includes a bunch of other headers, but it does it inside a namespace. This is a problem when you need to add an external header to one of these files. Then, you're including, say, tbb namespace inside ds2i, which makes it ds2i::tbb.

Rather, we should put the namespace in each of those files separately.

Caching dependencies in Travis

Since we have a few dependencies we build every time, can we still cache them even if we use them as subdirectories?

Or, here's an unpopular opinion, maybe we shouldn't build them from subdirectories.

Use Catch2 CMake functions to discover tests

We can enable Catch2 helpers with:

include(CTest)
include(Catch)

and then add tests with

catch_discover_tests(executable_name)

It's nicer because it exposes all individual test cases to ctest instead of each file only.

NOTE: This automatically worked with find_package(Catch2), not sure if you need to do some extra include when you use a subdirectory, but I'm sure it's very easy either way.

Parsing collection and queries

We need to have term/query/document mappings to release data after ECIR, among other things.

I suggest two programs:

Collection parser that takes text collection and produces three things:
a. binary_collection to use for tests
b. a two-way term/ID mapping
c. a two-way document/ID mapping
Query parser that uses the above to translate terms in queries to IDs.

I got some code in my repo that I can use, so I'll try to incorporate it here.

Parsing CC-NEWS

When parsing WARC, we assume there is a TREC ID field. This is not true for CC-NEWS, so it needs to be taken into account.

WARC/1.0
WARC-Record-ID: <urn:uuid:4e7b712c-f7fd-4e31-81ad-b6b8b9a190c8>
Content-Length: 38840
WARC-Date: 2016-12-01T13:28:43Z
WARC-Type: response
WARC-Target-URI: http://www.banker.bg/finansov-dnevnik/read/parlamentut-prie-biudjetite-na-vsichki-ministerstva?utm_source=rss&utm_medium=click&utm_campaign=rss
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:NKWIH3RIPHRCTJ7F7VM45KZFUR5DCXMW
WARC-Block-Digest: sha1:EOKDPEU3LQ7XESEPNRQO34YDAEDADIL6

Convert into a constexpr unordered_map

pisa/include/index_types.hpp

Lines 50 to 56 in c421f3d

 #define DS2I_INDEX_TYPES \ 

 (ef)(single)(uniform)(opt)(block_optpfor)(block_varintg8iu)(block_streamvbyte)( \ 

 block_maskedvbyte)(block_interpolative)(block_qmx)(block_varintgb)(block_simple8b)( \ 

 block_simple16)(block_simdbp)(block_mixed) 

 #define DS2I_BLOCK_INDEX_TYPES \ 

 (block_optpfor)(block_varintg8iu)(block_streamvbyte)(block_maskedvbyte)(block_interpolative)( \ 

 block_qmx)(block_varintgb)(block_simple8b)(block_simple16)(block_simdbp)(block_mixed)

This can be converted into a constexpr map, s.t. the following can be implemented with a table lookup.

pisa/benchmarks/selective_queries.cpp

Lines 67 to 78 in 8177a47

 if (false) { 

 #define LOOP_BODY(R, DATA, T) \ 

 } else if (type == BOOST_PP_STRINGIZE(T)) { \ 

 selective_queries<BOOST_PP_CAT(T, _index)> \ 

 (index_filename, type); 

 /**/ 

 BOOST_PP_SEQ_FOR_EACH(LOOP_BODY, _, DS2I_INDEX_TYPES); 

 #undef LOOP_BODY 

 } else { 

 logger() << "ERROR: Unknown type " << type << std::endl; 

 }

`invert` produces an empty list

To Reproduce

Parse Clueweb09B with parse_collection (mg4j version)
Invert using invert
Try creating the index

./bin/create_freq_index -t block_simdbp -c /data/index/pisa/cw09b/cw09b -o /data/index/pisa/cw09b/cw09b.block_simdbp

Error message

[2019-02-11 13:59:36.478] [info] Processing 83616447 documents
Create index: 96% [18m 41s]terminate called after throwing an instance of 'std::invalid_argument'
  what():  List must be nonempty
Aborted (core dumped)

Expected behavior
invert should never create empty lists. There is a bug. invert should have a check at the end to verify that.

Define a Scorer function API

Describe the solution you'd like
A Scorer should have a generic constructor and an operator() which takes docid and freq as parameters.
The constructor will take care of storing any additional information needed by the scorer in order to perform the computation, the operator() will actually perform it and return the score.

Additional context
All the query algorithms need to be changed in order to take a vector of Scorer instances (one for each term), previously generate according to the Scorer chosen by the user.

Store properties and statistics of an index

I was thinking that maybe we should use an additional file, say, JSON format, to store some additional information about an index, such as document/term count, no. postings, etc.

Additional context

For one, this is good to have. But I also have an inspirational example: When parsing a collection, we produce a forward index in binary_collection without any information about number of terms. But later on, inverting can be significantly simplified if term count is known, so we pass it in command line. This is (a) annoying, (b) error prone, and (c) makes it more difficult (or less efficient) if we'd like to automate inverting all shards at a time (either from the program or a script). It would be nice if we had a file we can read those stats from.

Backwards compatibility

I'd say we can make it a mandatory file for an inverted index, and only an optional for inverted index for now. But even if we don't use it from code compressing index, we can use those stats.

Use spdlog for logging

We should use this: https://github.com/gabime/spdlog

It's fast, modern, header-only, flexible, powerful, and uses {fmt} library for formatting, which is very convenient, too.

Support Windows

Unless there is a very good reason not to, we should support VS compilation as well.

Retore Singleton blocks

Singleton blocks have been removed since they break tests.
Restore when bug is identified.

#pragma once

#include "global_parameters.hpp"
#include "util/util.hpp"

namespace ds2i {

    struct all_ones_sequence {

        inline static uint64_t
        bitsize(global_parameters const& /* params */, uint64_t universe, uint64_t n)
        {
            return (universe == n || n == 1) ? 0 : uint64_t(-1);
        }

        template <typename Iterator>
        static void write(succinct::bit_vector_builder&,
                          Iterator begin,
                          uint64_t universe, uint64_t n,
                          global_parameters const&)
        {
            assert(universe == n || n == 1); (void)universe; (void)n;
            assert(*std::next(begin, n - 1) == universe - 1); (void)begin;
        }

        class enumerator {
        public:

            typedef std::pair<uint64_t, uint64_t> value_type; // (position, value)

            enumerator(succinct::bit_vector const&, uint64_t,
                       uint64_t universe, uint64_t n,
                       global_parameters const&)
                : m_n(n)
                , m_universe(universe)
                , m_position(size())
            {
                assert(universe == n || n == 1); (void)n;
            }

            value_type move(uint64_t position)
            {
                assert(position <= size());
                m_position = position;
                return value();
            }

            value_type next_geq(uint64_t lower_bound)
            {
                assert(lower_bound <= m_universe);
                if (m_n == 1) {
                    m_position = 0;
                } else {
                    m_position = lower_bound;
                }
                return value();
            }

            value_type next()
            {
                m_position += 1;
                return value();
            }

            uint64_t size() const
            {
                return m_n;
            }

            uint64_t prev_value() const
            {
                if (m_position == 0) {
                    return 0;
                }
                if (m_n == 1) {
                    return m_universe - 1;
                }
                return m_position - 1;
            }

        private:
            value_type value() const {
                if (m_n == 1) {
                    return value_type(m_position, m_universe - 1);
                } else {
                    return value_type(m_position, m_position);
                }
            }

            uint64_t m_n;
            uint64_t m_universe;
            uint64_t m_position;
        };
    };
}

test_bit_vector fails nondeterministically

At the very least, we should:

Draw a random seed
Print the seed out
Use the seed for generating bit vectors
Debug using the seed that fails

Also should consider running ctest -VV on Travis to be able to capture the output from there as well.

Let's talk about type safety

I think it would be nice to start using some strong types. I made the first attempt here: https://github.com/pisa-engine/pisa/blob/master/include/pisa/invert.hpp#L29

We can start with something like this, and then think about using type_safe library. Or use it right away. Not sure. Anyway, I'd like to start seeing some strong typing in the code. What do you think?

Add a feature extraction tool from a forward index

The first step to integrate Learning to Rank in Pisa is to have a feature extraction tool from the forward index.

`invert` doesn't seem to remove batches

Describe the bug
Batch files are not deleted after invert program exits.

Improve topk_queue::insert() to require only one pass down the heap

In case when topk_queue::size() == k, we perform a pop and a push on the heap, which can be avoided.

Thanks to the property that the queue stays the same length, we can do a little trick.

We make sure that our internal array is at least k + 1 long; or, in other words, the capacity is this much.
When pushing to the heap, we do the following:
2.1. If we're at the threshold or above, we push/emplace back the current entry (even if heap full).
2.2. If the size after push is at most k, then we simply do std::push_heap.
2.3. If we overflow by one, we perform a std::pop_heap on the entire vector, and remove the last element.

Explanation

The pop instruction in 2.3 will get rid of the top entry, replace it with the new one, and fix the heap. Only one heap traversal is necessary. This might not matter for k = 10 but for k = 1000 and beyond, it very well might. And I don't see any drawbacks of doing it this way. It should always be faster.

Get rid of Boost unit_test_framework

Consider substituting Boost thread with Gtest

Replace boost::filesystem (or not)

So I'm using boost::filesystem right now for convenience, but we wanted to get rid of Boost. It seems that std::filesystem compiler support is still very much spotty. We could use std::experimental::filesystem, which as far as I know is supported in both gcc and clang, and I think visual studio as well. Not sure about those, though.

Another option is to find an external library but I'm not sure that's a good path to take.

I suggest we stay with boost::filesystem right now (we have other Boost dependencies anyway), and once we're closer to getting rid of Boost, we should probably migrate to std::experimental::filesystem for wider support (I'm guessing MS will be late to move it to std::filesystem).

Tests fail when run in parallel

This one is low-priority but when I run, e.g., ctest -j8, one of the tests usually fails with Bus error. I suspect several tests write to the same file or something.

But as I said, this one is low priority.

Easier use of standard command line options

Describe the solution you'd like

Some cmd options will repeat throughout many tools, e.g., stem/nostem or index type. I suggest implementing some shortcut to defining these, so we don't have to repeatedly define a variable and then define the option. For one, it's annoying, and for two, any changes we want to make to the same option need to be manually propagated.

Additional context

I believe you can inherit from App to define custom bases, but that's limited (as inheritance is).

I've done something in irkit that we might reuse or adapt. For each commonly used option, you'd define something like this:

struct k_opt {
    int k = 1000;

    template<class Args>
    void set(CLI::App& app, Args& args)
    {
        app.add_option("-k", args->k, "Number of documents to retrieve", true);
    }
};

Then, you'd use it like this:

auto [app, args] = irk::cli::app(
        "Query inverted index",
        index_dir_opt{},
        nostem_opt{},
        id_range_opt{},
        score_function_opt{with_default<std::string>{"bm25"}},
        traversal_type_opt{with_default<Traversal_Type>{Traversal_Type::DAAT}},
        k_opt{},
        trec_run_opt{},
        trec_id_opt{},
        terms_pos{optional});
/* Potentially adding other custom options to app */
CLI11_PARSE(*app, argc, argv);
/* Parsed arguments are in args, e.g., args->index_dir */

Use google/benchmark

We should start to use https://github.com/google/benchmark

Add Catch2

Add Catch2 for the new testing framework. So we can write new tests already in it

Add range-v3 dependency

Describe the solution you'd like
range-v3 should be added as dependency in cmake.

Motivation
We've talked about using this in a big way but I think we should start with just adding it and using it from that point forward improve our code base. Simple examples include:

for (auto [idx, value] : enumerate(std::vector<int>{5, 5, 2, 1})) {}
for (auto [left, right] : zip(...) {}

Caution
We should exercise caution when using these in performance-sensitive places. I'm not saying it's inefficient, but we should make sure it is before merging such a thing.

Get rid of Boost thread

Consider substituting Boost thread with TBB

Compilation speed-up

https://github.com/include-what-you-use/include-what-you-use
create a translation unit in src that, say, compiles query processing stuff once, and exposes an easy-to-use interface from each src file
extern templates
forward declarations
precompiled headers https://github.com/nanoant/CMakePCHCompiler
?

Enable using threshold/threshold estimates for performance/correctness test

Threshold should be in a file, passed with an optional argument, and propagated as an optional vector. Decision whether or not to use threshold should be compile-time, so it doesn't slow down other types of queries.

Parsing not working with Clueweb12

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:51:20Z
WARC-TREC-ID: clueweb12-0000tw-00-00013
WARC-Target-URI: http://cheapcosthealthinsurance.com/2012/01/25/what-is-hiv-aids/
WARC-Payload-Digest: sha1:YZUOJNSUMFG3JVUKM6LBHMRMMHWLVNQ4
WARC-IP-Address: 100.42.59.15
WARC-Record-ID: <urn:uuid:74edc71e-a881-4942-81fc-a40db4bf1fb9>
Content-Type: application/http; msgtype=response
Content-Length: 71726

HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:51:22 GMT
Server: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.32
X-Powered-By: PHP/5.2.17
X-Pingback: http://cheapcosthealthinsurance.com/xmlrpc.php
Link: <http://cheapcosthealthinsurance.com/?p=711>; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8

	} else if (format == "warc") {
	Forward_Index_Builder<warcpp::Warc_Record> builder;
	builder.build(std::cin,
	output_filename,
	[](std::istream &in) -> std::optional<warcpp::Warc_Record> {
	warcpp::Warc_Record record;
	if (warcpp::read_warc_record(in, record)) {
	return std::make_optional(record);
	}
	return std::nullopt;
	},
	[&](std::string &&term) -> std::string {
	std::transform(term.begin(),
	term.end(),
	term.begin(),
	[](unsigned char c) { return std::tolower(c); });
	return stem::Porter2{}.stem(term);
	},
	parse_html_content,
	batch_size,
	threads);
	}

	std::vector<std::tuple<uint64_t, float, bool>> doc_score_top;
	for (size_t i = 0; i < seq.docs.size(); ++i) {
	uint64_t docid = *(seq.docs.begin() + i);
	uint64_t freq = *(seq.freqs.begin() + i);
	float score = Scorer::doc_term_weight(freq, norm_lens[docid]);
	doc_score_top.emplace_back(docid, score, false);
	max_score = std::max(max_score, score);
	}

	#define DS2I_INDEX_TYPES \
	(ef)(single)(uniform)(opt)(block_optpfor)(block_varintg8iu)(block_streamvbyte)( \
	block_maskedvbyte)(block_interpolative)(block_qmx)(block_varintgb)(block_simple8b)( \
	block_simple16)(block_simdbp)(block_mixed)
	#define DS2I_BLOCK_INDEX_TYPES \
	(block_optpfor)(block_varintg8iu)(block_streamvbyte)(block_maskedvbyte)(block_interpolative)( \
	block_qmx)(block_varintgb)(block_simple8b)(block_simple16)(block_simdbp)(block_mixed)

	if (false) {
	#define LOOP_BODY(R, DATA, T) \
	} else if (type == BOOST_PP_STRINGIZE(T)) { \
	selective_queries<BOOST_PP_CAT(T, _index)> \
	(index_filename, type);
	/**/

	BOOST_PP_SEQ_FOR_EACH(LOOP_BODY, _, DS2I_INDEX_TYPES);
	#undef LOOP_BODY
	} else {
	logger() << "ERROR: Unknown type " << type << std::endl;
	}

pisa-engine / pisa Goto Github PK

pisa's People

Contributors

Stargazers

Watchers

Forkers

pisa's Issues

Explanation

Recommend Projects

Recommend Topics

Recommend Org

Jobs