GithubHelp home page GithubHelp logo

pisa-engine / pisa Goto Github PK

View Code? Open in Web Editor NEW
863.0 863.0 61.0 36.25 MB

PISA: Performant Indexes and Search for Academia

Home Page: https://pisa-engine.github.io/pisa/book

License: Apache License 2.0

CMake 1.11% Python 6.92% C++ 89.30% Shell 1.78% Dockerfile 0.89%
information-retrieval inverted-index search search-engine

pisa's People

Contributors

amallia avatar elshize avatar gustingonzalez avatar jmmackenzie avatar ot avatar pombredanne avatar yuhongzhang98 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pisa's Issues

Query processing for precision evaluation

Describe the solution you'd like
As discussed, queries should be able to run eval queries and print out results in TREC format.

Additional context
We need document mapping to be provided, in a similar fashion to terms: #97

Segfault when parsing Clueweb in WARC format

When run:

zcat /data/collections/CW09b/**/*gz | ./bin/parse_collection -f warc -b 10000 -o /data/index/pisa/tmp/cw09b

I get segfault:

2019-01-13 01:10:57: [Batch 99] Processed documents [990000, 1000000)
2019-01-13 01:11:01: [Batch 100] Processed documents [1000000, 1010000)
2019-01-13 01:11:05: [Batch 102] Processed documents [1020000, 1030000)
2019-01-13 01:11:18: [Batch 103] Processed documents [1030000, 1040000)
2019-01-13 01:11:29: [Batch 108] Processed documents [1080000, 1090000)
2019-01-13 01:11:34: [Batch 104] Processed documents [1040000, 1050000)
2019-01-13 01:11:34: [Batch 105] Processed documents [1050000, 1060000)
2019-01-13 01:11:36: [Batch 107] Processed documents [1070000, 1080000)
2019-01-13 01:11:41: [Batch 106] Processed documents [1060000, 1070000)
2019-01-13 01:11:44: [Batch 109] Processed documents [1090000, 1100000)
2019-01-13 01:12:04: [Batch 110] Processed documents [1100000, 1110000)
2019-01-13 01:12:16: [Batch 111] Processed documents [1110000, 1120000)
Segmentation fault (core dumped)

Needs investigating.

Support using string terms with forward index

For example, if we pass --terms <file> or something of the sort, we assume that input is terms to be later translated to IDs.

While we're at it, a program to map terms to IDs in a file would be useful as well.

  • Lexicon and lexicon tools #89 will not do it for now
  • Support lexicon for queries, thresholds, etc. with std::unordered_map

Support pipeline of operations

} else if (format == "warc") {
Forward_Index_Builder<warcpp::Warc_Record> builder;
builder.build(std::cin,
output_filename,
[](std::istream &in) -> std::optional<warcpp::Warc_Record> {
warcpp::Warc_Record record;
if (warcpp::read_warc_record(in, record)) {
return std::make_optional(record);
}
return std::nullopt;
},
[&](std::string &&term) -> std::string {
std::transform(term.begin(),
term.end(),
term.begin(),
[](unsigned char c) { return std::tolower(c); });
return stem::Porter2{}.stem(term);
},
parse_html_content,
batch_size,
threads);
}

I would be nice if we could build a pipeline of actions to perform on the input stream.

We could decide to: parse_warc | strip_http_header | parse_html | stem | ...
It will give us much more freedom and will simplify the code. The pipeline will be first build according to what we are performing (parse_warc or parse_plaintext) and then passed to the forward index builder.

The only library to do so that I am aware of is https://github.com/ericniebler/range-v3

Discrepancies in query result set sizes

ranked_or_query<WandType> or_q(wdata, 10);

There seems to be something wrong with query results when I vary k. (Or I don't understand somethinig.)

To show what I mean, I change this to:

            ranked_or_query<WandType> or_q(wdata, 1);

            for (auto const& q: queries) {
                or_q(index, q);
                op_q(index, q);
                //BOOST_REQUIRE_EQUAL(or_q.topk().size(), op_q.topk().size());
                std::cerr << or_q.topk().size() << '\n';
                for (size_t i = 0; i < or_q.topk().size(); ++i) {
                    BOOST_REQUIRE_CLOSE(or_q.topk()[i].first, op_q.topk()[i].first, 0.1); // tolerance is % relative
                }
            }

I printed out ranked_or_query result sizes for all queries with both k = 1 and k = 10. In the former case, there are some non-zero sizes (nothing equal 10, though). In the latter, all are zeroes when I expected some ones.

Here's the output files: k10.txt and k1.txt

If I set k = 1, I should stil get the same top result as for k = 10, shouldn't I?

Upgrade CLI11 to 1.8 once released

Additional context

In master, they added something we should definitely use, as it's clean and useful:
CLIUtils/CLI11#222

Essentially, you can add a set of available values, e.g.,

stemmer_option->check(CLI::IsMember{"porter2", "krovetz", "none"});

This feature is tagged for version 1.8.

Dead code block

Isn't this a dead block? The UB scores are computed in the constructor of the partition and returned as t unless I am mistaken.

std::vector<std::tuple<uint64_t, float, bool>> doc_score_top;
for (size_t i = 0; i < seq.docs.size(); ++i) {
uint64_t docid = *(seq.docs.begin() + i);
uint64_t freq = *(seq.freqs.begin() + i);
float score = Scorer::doc_term_weight(freq, norm_lens[docid]);
doc_score_top.emplace_back(docid, score, false);
max_score = std::max(max_score, score);
}

queries.hpp should include algorithms outside of namespace

queries.hpp includes a bunch of other headers, but it does it inside a namespace. This is a problem when you need to add an external header to one of these files. Then, you're including, say, tbb namespace inside ds2i, which makes it ds2i::tbb.

Rather, we should put the namespace in each of those files separately.

Caching dependencies in Travis

Since we have a few dependencies we build every time, can we still cache them even if we use them as subdirectories?

Or, here's an unpopular opinion, maybe we shouldn't build them from subdirectories.

Use Catch2 CMake functions to discover tests

We can enable Catch2 helpers with:

include(CTest)
include(Catch)

and then add tests with

catch_discover_tests(executable_name)

It's nicer because it exposes all individual test cases to ctest instead of each file only.

NOTE: This automatically worked with find_package(Catch2), not sure if you need to do some extra include when you use a subdirectory, but I'm sure it's very easy either way.

Parsing collection and queries

We need to have term/query/document mappings to release data after ECIR, among other things.

I suggest two programs:

  1. Collection parser that takes text collection and produces three things:
    a. binary_collection to use for tests
    b. a two-way term/ID mapping
    c. a two-way document/ID mapping
  2. Query parser that uses the above to translate terms in queries to IDs.

I got some code in my repo that I can use, so I'll try to incorporate it here.

Parsing CC-NEWS

When parsing WARC, we assume there is a TREC ID field. This is not true for CC-NEWS, so it needs to be taken into account.

WARC/1.0
WARC-Record-ID: <urn:uuid:4e7b712c-f7fd-4e31-81ad-b6b8b9a190c8>
Content-Length: 38840
WARC-Date: 2016-12-01T13:28:43Z
WARC-Type: response
WARC-Target-URI: http://www.banker.bg/finansov-dnevnik/read/parlamentut-prie-biudjetite-na-vsichki-ministerstva?utm_source=rss&utm_medium=click&utm_campaign=rss
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:NKWIH3RIPHRCTJ7F7VM45KZFUR5DCXMW
WARC-Block-Digest: sha1:EOKDPEU3LQ7XESEPNRQO34YDAEDADIL6

Convert into a constexpr unordered_map

#define DS2I_INDEX_TYPES \
(ef)(single)(uniform)(opt)(block_optpfor)(block_varintg8iu)(block_streamvbyte)( \
block_maskedvbyte)(block_interpolative)(block_qmx)(block_varintgb)(block_simple8b)( \
block_simple16)(block_simdbp)(block_mixed)
#define DS2I_BLOCK_INDEX_TYPES \
(block_optpfor)(block_varintg8iu)(block_streamvbyte)(block_maskedvbyte)(block_interpolative)( \
block_qmx)(block_varintgb)(block_simple8b)(block_simple16)(block_simdbp)(block_mixed)

This can be converted into a constexpr map, s.t. the following can be implemented with a table lookup.

if (false) {
#define LOOP_BODY(R, DATA, T) \
} else if (type == BOOST_PP_STRINGIZE(T)) { \
selective_queries<BOOST_PP_CAT(T, _index)> \
(index_filename, type);
/**/
BOOST_PP_SEQ_FOR_EACH(LOOP_BODY, _, DS2I_INDEX_TYPES);
#undef LOOP_BODY
} else {
logger() << "ERROR: Unknown type " << type << std::endl;
}

`invert` produces an empty list

To Reproduce

  1. Parse Clueweb09B with parse_collection (mg4j version)
  2. Invert using invert
  3. Try creating the index
./bin/create_freq_index -t block_simdbp -c /data/index/pisa/cw09b/cw09b -o /data/index/pisa/cw09b/cw09b.block_simdbp

Error message

[2019-02-11 13:59:36.478] [info] Processing 83616447 documents
Create index: 96% [18m 41s]terminate called after throwing an instance of 'std::invalid_argument'
  what():  List must be nonempty
Aborted (core dumped)

Expected behavior
invert should never create empty lists. There is a bug. invert should have a check at the end to verify that.

Define a Scorer function API

Describe the solution you'd like
A Scorer should have a generic constructor and an operator() which takes docid and freq as parameters.
The constructor will take care of storing any additional information needed by the scorer in order to perform the computation, the operator() will actually perform it and return the score.

Additional context
All the query algorithms need to be changed in order to take a vector of Scorer instances (one for each term), previously generate according to the Scorer chosen by the user.

Store properties and statistics of an index

I was thinking that maybe we should use an additional file, say, JSON format, to store some additional information about an index, such as document/term count, no. postings, etc.

Additional context

For one, this is good to have. But I also have an inspirational example: When parsing a collection, we produce a forward index in binary_collection without any information about number of terms. But later on, inverting can be significantly simplified if term count is known, so we pass it in command line. This is (a) annoying, (b) error prone, and (c) makes it more difficult (or less efficient) if we'd like to automate inverting all shards at a time (either from the program or a script). It would be nice if we had a file we can read those stats from.

Backwards compatibility

I'd say we can make it a mandatory file for an inverted index, and only an optional for inverted index for now. But even if we don't use it from code compressing index, we can use those stats.

Support Windows

Unless there is a very good reason not to, we should support VS compilation as well.

Retore Singleton blocks

Singleton blocks have been removed since they break tests.
Restore when bug is identified.

#pragma once

#include "global_parameters.hpp"
#include "util/util.hpp"

namespace ds2i {

    struct all_ones_sequence {

        inline static uint64_t
        bitsize(global_parameters const& /* params */, uint64_t universe, uint64_t n)
        {
            return (universe == n || n == 1) ? 0 : uint64_t(-1);
        }

        template <typename Iterator>
        static void write(succinct::bit_vector_builder&,
                          Iterator begin,
                          uint64_t universe, uint64_t n,
                          global_parameters const&)
        {
            assert(universe == n || n == 1); (void)universe; (void)n;
            assert(*std::next(begin, n - 1) == universe - 1); (void)begin;
        }

        class enumerator {
        public:

            typedef std::pair<uint64_t, uint64_t> value_type; // (position, value)

            enumerator(succinct::bit_vector const&, uint64_t,
                       uint64_t universe, uint64_t n,
                       global_parameters const&)
                : m_n(n)
                , m_universe(universe)
                , m_position(size())
            {
                assert(universe == n || n == 1); (void)n;
            }

            value_type move(uint64_t position)
            {
                assert(position <= size());
                m_position = position;
                return value();
            }

            value_type next_geq(uint64_t lower_bound)
            {
                assert(lower_bound <= m_universe);
                if (m_n == 1) {
                    m_position = 0;
                } else {
                    m_position = lower_bound;
                }
                return value();
            }

            value_type next()
            {
                m_position += 1;
                return value();
            }

            uint64_t size() const
            {
                return m_n;
            }

            uint64_t prev_value() const
            {
                if (m_position == 0) {
                    return 0;
                }
                if (m_n == 1) {
                    return m_universe - 1;
                }
                return m_position - 1;
            }

        private:
            value_type value() const {
                if (m_n == 1) {
                    return value_type(m_position, m_universe - 1);
                } else {
                    return value_type(m_position, m_position);
                }
            }

            uint64_t m_n;
            uint64_t m_universe;
            uint64_t m_position;
        };
    };
}

test_bit_vector fails nondeterministically

At the very least, we should:

  • Draw a random seed
  • Print the seed out
  • Use the seed for generating bit vectors
  • Debug using the seed that fails

Also should consider running ctest -VV on Travis to be able to capture the output from there as well.

Improve topk_queue::insert() to require only one pass down the heap

In case when topk_queue::size() == k, we perform a pop and a push on the heap, which can be avoided.

Thanks to the property that the queue stays the same length, we can do a little trick.

  1. We make sure that our internal array is at least k + 1 long; or, in other words, the capacity is this much.
  2. When pushing to the heap, we do the following:
    2.1. If we're at the threshold or above, we push/emplace back the current entry (even if heap full).
    2.2. If the size after push is at most k, then we simply do std::push_heap.
    2.3. If we overflow by one, we perform a std::pop_heap on the entire vector, and remove the last element.

Explanation

The pop instruction in 2.3 will get rid of the top entry, replace it with the new one, and fix the heap. Only one heap traversal is necessary. This might not matter for k = 10 but for k = 1000 and beyond, it very well might. And I don't see any drawbacks of doing it this way. It should always be faster.

Replace boost::filesystem (or not)

So I'm using boost::filesystem right now for convenience, but we wanted to get rid of Boost. It seems that std::filesystem compiler support is still very much spotty. We could use std::experimental::filesystem, which as far as I know is supported in both gcc and clang, and I think visual studio as well. Not sure about those, though.

Another option is to find an external library but I'm not sure that's a good path to take.

I suggest we stay with boost::filesystem right now (we have other Boost dependencies anyway), and once we're closer to getting rid of Boost, we should probably migrate to std::experimental::filesystem for wider support (I'm guessing MS will be late to move it to std::filesystem).

Tests fail when run in parallel

This one is low-priority but when I run, e.g., ctest -j8, one of the tests usually fails with Bus error. I suspect several tests write to the same file or something.

But as I said, this one is low priority.

Easier use of standard command line options

Describe the solution you'd like

Some cmd options will repeat throughout many tools, e.g., stem/nostem or index type. I suggest implementing some shortcut to defining these, so we don't have to repeatedly define a variable and then define the option. For one, it's annoying, and for two, any changes we want to make to the same option need to be manually propagated.

Additional context

I believe you can inherit from App to define custom bases, but that's limited (as inheritance is).

I've done something in irkit that we might reuse or adapt. For each commonly used option, you'd define something like this:

struct k_opt {
    int k = 1000;

    template<class Args>
    void set(CLI::App& app, Args& args)
    {
        app.add_option("-k", args->k, "Number of documents to retrieve", true);
    }
};

Then, you'd use it like this:

auto [app, args] = irk::cli::app(
        "Query inverted index",
        index_dir_opt{},
        nostem_opt{},
        id_range_opt{},
        score_function_opt{with_default<std::string>{"bm25"}},
        traversal_type_opt{with_default<Traversal_Type>{Traversal_Type::DAAT}},
        k_opt{},
        trec_run_opt{},
        trec_id_opt{},
        terms_pos{optional});
/* Potentially adding other custom options to app */
CLI11_PARSE(*app, argc, argv);
/* Parsed arguments are in args, e.g., args->index_dir */

Add Catch2

Add Catch2 for the new testing framework. So we can write new tests already in it

Add range-v3 dependency

Describe the solution you'd like
range-v3 should be added as dependency in cmake.

Motivation
We've talked about using this in a big way but I think we should start with just adding it and using it from that point forward improve our code base. Simple examples include:

for (auto [idx, value] : enumerate(std::vector<int>{5, 5, 2, 1})) {}
for (auto [left, right] : zip(...) {}

Caution
We should exercise caution when using these in performance-sensitive places. I'm not saying it's inefficient, but we should make sure it is before merging such a thing.

Parsing not working with Clueweb12

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:51:20Z
WARC-TREC-ID: clueweb12-0000tw-00-00013
WARC-Target-URI: http://cheapcosthealthinsurance.com/2012/01/25/what-is-hiv-aids/
WARC-Payload-Digest: sha1:YZUOJNSUMFG3JVUKM6LBHMRMMHWLVNQ4
WARC-IP-Address: 100.42.59.15
WARC-Record-ID: <urn:uuid:74edc71e-a881-4942-81fc-a40db4bf1fb9>
Content-Type: application/http; msgtype=response
Content-Length: 71726

HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:51:22 GMT
Server: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.32
X-Powered-By: PHP/5.2.17
X-Pingback: http://cheapcosthealthinsurance.com/xmlrpc.php
Link: <http://cheapcosthealthinsurance.com/?p=711>; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.