pisa-engine / pisa Goto Github PK
View Code? Open in Web Editor NEWPISA: Performant Indexes and Search for Academia
Home Page: https://pisa-engine.github.io/pisa/book
License: Apache License 2.0
PISA: Performant Indexes and Search for Academia
Home Page: https://pisa-engine.github.io/pisa/book
License: Apache License 2.0
Describe the solution you'd like
As discussed, queries
should be able to run eval queries and print out results in TREC format.
Additional context
We need document mapping to be provided, in a similar fashion to terms: #97
Since https://github.com/tlx/tlx has some cool functionality we will need anyway at some point (B+trees, RadixHeap, Math Functions,...) it probably makes sense to replace some of the boost string algorithms with the one provided there.
Rewrite existing tests and get rid of Boost.Test
When run:
zcat /data/collections/CW09b/**/*gz | ./bin/parse_collection -f warc -b 10000 -o /data/index/pisa/tmp/cw09b
I get segfault:
2019-01-13 01:10:57: [Batch 99] Processed documents [990000, 1000000)
2019-01-13 01:11:01: [Batch 100] Processed documents [1000000, 1010000)
2019-01-13 01:11:05: [Batch 102] Processed documents [1020000, 1030000)
2019-01-13 01:11:18: [Batch 103] Processed documents [1030000, 1040000)
2019-01-13 01:11:29: [Batch 108] Processed documents [1080000, 1090000)
2019-01-13 01:11:34: [Batch 104] Processed documents [1040000, 1050000)
2019-01-13 01:11:34: [Batch 105] Processed documents [1050000, 1060000)
2019-01-13 01:11:36: [Batch 107] Processed documents [1070000, 1080000)
2019-01-13 01:11:41: [Batch 106] Processed documents [1060000, 1070000)
2019-01-13 01:11:44: [Batch 109] Processed documents [1090000, 1100000)
2019-01-13 01:12:04: [Batch 110] Processed documents [1100000, 1110000)
2019-01-13 01:12:16: [Batch 111] Processed documents [1110000, 1120000)
Segmentation fault (core dumped)
Needs investigating.
Opening this for discussion.
For example, if we pass --terms <file>
or something of the sort, we assume that input is terms to be later translated to IDs.
While we're at it, a program to map terms to IDs in a file would be useful as well.
queries
, thresholds
, etc. with std::unordered_map
Enable and possibly modify the test
Lines 48 to 69 in e3a3121
I would be nice if we could build a pipeline of actions to perform on the input stream.
We could decide to: parse_warc | strip_http_header | parse_html | stem | ...
It will give us much more freedom and will simplify the code. The pipeline will be first build according to what we are performing (parse_warc or parse_plaintext) and then passed to the forward index builder.
The only library to do so that I am aware of is https://github.com/ericniebler/range-v3
Consider to substitute Boost iostreams with https://github.com/mandreyel/mio
Consider the following:
pisa/test/test_ranked_queries.cpp
Line 45 in 8177a47
There seems to be something wrong with query results when I vary k
. (Or I don't understand somethinig.)
To show what I mean, I change this to:
ranked_or_query<WandType> or_q(wdata, 1);
for (auto const& q: queries) {
or_q(index, q);
op_q(index, q);
//BOOST_REQUIRE_EQUAL(or_q.topk().size(), op_q.topk().size());
std::cerr << or_q.topk().size() << '\n';
for (size_t i = 0; i < or_q.topk().size(); ++i) {
BOOST_REQUIRE_CLOSE(or_q.topk()[i].first, op_q.topk()[i].first, 0.1); // tolerance is % relative
}
}
I printed out ranked_or_query
result sizes for all queries with both k = 1
and k = 10
. In the former case, there are some non-zero sizes (nothing equal 10, though). In the latter, all are zeroes when I expected some ones.
Here's the output files: k10.txt and k1.txt
If I set k = 1
, I should stil get the same top result as for k = 10
, shouldn't I?
There seems to be something wrong but it needs further investigating. I open this as a reminder for myself to provide more information and validate the input network I used.
Add progress bar to the verify index task
Line 22 in 8177a47
Is this something we should be aware of? Or can it be removed?
Additional context
In master
, they added something we should definitely use, as it's clean and useful:
CLIUtils/CLI11#222
Essentially, you can add a set of available values, e.g.,
stemmer_option->check(CLI::IsMember{"porter2", "krovetz", "none"});
This feature is tagged for version 1.8
.
Anything else?
Isn't this a dead block? The UB scores are computed in the constructor of the partition and returned as t unless I am mistaken.
pisa/include/pisa/wand_data_compressed.hpp
Lines 119 to 126 in 3672c4c
queries.hpp
includes a bunch of other headers, but it does it inside a namespace. This is a problem when you need to add an external header to one of these files. Then, you're including, say, tbb
namespace inside ds2i
, which makes it ds2i::tbb
.
Rather, we should put the namespace in each of those files separately.
Since we have a few dependencies we build every time, can we still cache them even if we use them as subdirectories?
Or, here's an unpopular opinion, maybe we shouldn't build them from subdirectories.
We can enable Catch2 helpers with:
include(CTest)
include(Catch)
and then add tests with
catch_discover_tests(executable_name)
It's nicer because it exposes all individual test cases to ctest
instead of each file only.
NOTE: This automatically worked with find_package(Catch2)
, not sure if you need to do some extra include when you use a subdirectory, but I'm sure it's very easy either way.
We need to have term/query/document mappings to release data after ECIR, among other things.
I suggest two programs:
binary_collection
to use for testsI got some code in my repo that I can use, so I'll try to incorporate it here.
When parsing WARC, we assume there is a TREC ID field. This is not true for CC-NEWS, so it needs to be taken into account.
WARC/1.0
WARC-Record-ID: <urn:uuid:4e7b712c-f7fd-4e31-81ad-b6b8b9a190c8>
Content-Length: 38840
WARC-Date: 2016-12-01T13:28:43Z
WARC-Type: response
WARC-Target-URI: http://www.banker.bg/finansov-dnevnik/read/parlamentut-prie-biudjetite-na-vsichki-ministerstva?utm_source=rss&utm_medium=click&utm_campaign=rss
Content-Type: application/http; msgtype=response
WARC-Payload-Digest: sha1:NKWIH3RIPHRCTJ7F7VM45KZFUR5DCXMW
WARC-Block-Digest: sha1:EOKDPEU3LQ7XESEPNRQO34YDAEDADIL6
Lines 50 to 56 in c421f3d
This can be converted into a constexpr map, s.t. the following can be implemented with a table lookup.
pisa/benchmarks/selective_queries.cpp
Lines 67 to 78 in 8177a47
To Reproduce
parse_collection
(mg4j version)invert
./bin/create_freq_index -t block_simdbp -c /data/index/pisa/cw09b/cw09b -o /data/index/pisa/cw09b/cw09b.block_simdbp
Error message
[2019-02-11 13:59:36.478] [info] Processing 83616447 documents
Create index: 96% [18m 41s]terminate called after throwing an instance of 'std::invalid_argument'
what(): List must be nonempty
Aborted (core dumped)
Expected behavior
invert
should never create empty lists. There is a bug. invert
should have a check at the end to verify that.
Describe the solution you'd like
A Scorer should have a generic constructor and an operator()
which takes docid
and freq
as parameters.
The constructor will take care of storing any additional information needed by the scorer in order to perform the computation, the operator()
will actually perform it and return the score.
Additional context
All the query algorithms need to be changed in order to take a vector of Scorer instances (one for each term), previously generate according to the Scorer chosen by the user.
I was thinking that maybe we should use an additional file, say, JSON format, to store some additional information about an index, such as document/term count, no. postings, etc.
Additional context
For one, this is good to have. But I also have an inspirational example: When parsing a collection, we produce a forward index in binary_collection
without any information about number of terms. But later on, inverting can be significantly simplified if term count is known, so we pass it in command line. This is (a) annoying, (b) error prone, and (c) makes it more difficult (or less efficient) if we'd like to automate inverting all shards at a time (either from the program or a script). It would be nice if we had a file we can read those stats from.
Backwards compatibility
I'd say we can make it a mandatory file for an inverted index, and only an optional for inverted index for now. But even if we don't use it from code compressing index, we can use those stats.
We should use this: https://github.com/gabime/spdlog
It's fast, modern, header-only, flexible, powerful, and uses {fmt}
library for formatting, which is very convenient, too.
Unless there is a very good reason not to, we should support VS compilation as well.
Singleton blocks have been removed since they break tests.
Restore when bug is identified.
#pragma once
#include "global_parameters.hpp"
#include "util/util.hpp"
namespace ds2i {
struct all_ones_sequence {
inline static uint64_t
bitsize(global_parameters const& /* params */, uint64_t universe, uint64_t n)
{
return (universe == n || n == 1) ? 0 : uint64_t(-1);
}
template <typename Iterator>
static void write(succinct::bit_vector_builder&,
Iterator begin,
uint64_t universe, uint64_t n,
global_parameters const&)
{
assert(universe == n || n == 1); (void)universe; (void)n;
assert(*std::next(begin, n - 1) == universe - 1); (void)begin;
}
class enumerator {
public:
typedef std::pair<uint64_t, uint64_t> value_type; // (position, value)
enumerator(succinct::bit_vector const&, uint64_t,
uint64_t universe, uint64_t n,
global_parameters const&)
: m_n(n)
, m_universe(universe)
, m_position(size())
{
assert(universe == n || n == 1); (void)n;
}
value_type move(uint64_t position)
{
assert(position <= size());
m_position = position;
return value();
}
value_type next_geq(uint64_t lower_bound)
{
assert(lower_bound <= m_universe);
if (m_n == 1) {
m_position = 0;
} else {
m_position = lower_bound;
}
return value();
}
value_type next()
{
m_position += 1;
return value();
}
uint64_t size() const
{
return m_n;
}
uint64_t prev_value() const
{
if (m_position == 0) {
return 0;
}
if (m_n == 1) {
return m_universe - 1;
}
return m_position - 1;
}
private:
value_type value() const {
if (m_n == 1) {
return value_type(m_position, m_universe - 1);
} else {
return value_type(m_position, m_position);
}
}
uint64_t m_n;
uint64_t m_universe;
uint64_t m_position;
};
};
}
At the very least, we should:
Also should consider running ctest -VV
on Travis to be able to capture the output from there as well.
I think it would be nice to start using some strong types. I made the first attempt here: https://github.com/pisa-engine/pisa/blob/master/include/pisa/invert.hpp#L29
We can start with something like this, and then think about using type_safe
library. Or use it right away. Not sure. Anyway, I'd like to start seeing some strong typing in the code. What do you think?
The first step to integrate Learning to Rank in Pisa is to have a feature extraction tool from the forward index.
Describe the bug
Batch files are not deleted after invert
program exits.
In case when topk_queue::size() == k
, we perform a pop and a push on the heap, which can be avoided.
Thanks to the property that the queue stays the same length, we can do a little trick.
k + 1
long; or, in other words, the capacity is this much.k
, then we simply do std::push_heap
.std::pop_heap
on the entire vector, and remove the last element.The pop instruction in 2.3 will get rid of the top entry, replace it with the new one, and fix the heap. Only one heap traversal is necessary. This might not matter for k = 10
but for k = 1000
and beyond, it very well might. And I don't see any drawbacks of doing it this way. It should always be faster.
Consider substituting Boost thread with Gtest
So I'm using boost::filesystem
right now for convenience, but we wanted to get rid of Boost. It seems that std::filesystem
compiler support is still very much spotty. We could use std::experimental::filesystem
, which as far as I know is supported in both gcc and clang, and I think visual studio as well. Not sure about those, though.
Another option is to find an external library but I'm not sure that's a good path to take.
I suggest we stay with boost::filesystem
right now (we have other Boost dependencies anyway), and once we're closer to getting rid of Boost, we should probably migrate to std::experimental::filesystem
for wider support (I'm guessing MS will be late to move it to std::filesystem
).
This one is low-priority but when I run, e.g., ctest -j8
, one of the tests usually fails with Bus error
. I suspect several tests write to the same file or something.
But as I said, this one is low priority.
Describe the solution you'd like
Some cmd options will repeat throughout many tools, e.g., stem/nostem or index type. I suggest implementing some shortcut to defining these, so we don't have to repeatedly define a variable and then define the option. For one, it's annoying, and for two, any changes we want to make to the same option need to be manually propagated.
Additional context
I believe you can inherit from App
to define custom bases, but that's limited (as inheritance is).
I've done something in irkit
that we might reuse or adapt. For each commonly used option, you'd define something like this:
struct k_opt {
int k = 1000;
template<class Args>
void set(CLI::App& app, Args& args)
{
app.add_option("-k", args->k, "Number of documents to retrieve", true);
}
};
Then, you'd use it like this:
auto [app, args] = irk::cli::app(
"Query inverted index",
index_dir_opt{},
nostem_opt{},
id_range_opt{},
score_function_opt{with_default<std::string>{"bm25"}},
traversal_type_opt{with_default<Traversal_Type>{Traversal_Type::DAAT}},
k_opt{},
trec_run_opt{},
trec_id_opt{},
terms_pos{optional});
/* Potentially adding other custom options to app */
CLI11_PARSE(*app, argc, argv);
/* Parsed arguments are in args, e.g., args->index_dir */
We should start to use https://github.com/google/benchmark
Add Catch2 for the new testing framework. So we can write new tests already in it
Describe the solution you'd like
range-v3 should be added as dependency in cmake.
Motivation
We've talked about using this in a big way but I think we should start with just adding it and using it from that point forward improve our code base. Simple examples include:
for (auto [idx, value] : enumerate(std::vector<int>{5, 5, 2, 1})) {}
for (auto [left, right] : zip(...) {}
Caution
We should exercise caution when using these in performance-sensitive places. I'm not saying it's inefficient, but we should make sure it is before merging such a thing.
Consider substituting Boost thread with TBB
src
that, say, compiles query processing stuff once, and exposes an easy-to-use interface from each src
fileThreshold should be in a file, passed with an optional argument, and propagated as an optional vector. Decision whether or not to use threshold should be compile-time, so it doesn't slow down other types of queries.
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-10T21:51:20Z
WARC-TREC-ID: clueweb12-0000tw-00-00013
WARC-Target-URI: http://cheapcosthealthinsurance.com/2012/01/25/what-is-hiv-aids/
WARC-Payload-Digest: sha1:YZUOJNSUMFG3JVUKM6LBHMRMMHWLVNQ4
WARC-IP-Address: 100.42.59.15
WARC-Record-ID: <urn:uuid:74edc71e-a881-4942-81fc-a40db4bf1fb9>
Content-Type: application/http; msgtype=response
Content-Length: 71726
HTTP/1.1 200 OK
Date: Fri, 10 Feb 2012 21:51:22 GMT
Server: Apache/2.2.21 (Unix) mod_ssl/2.2.21 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635 mod_jk/1.2.32
X-Powered-By: PHP/5.2.17
X-Pingback: http://cheapcosthealthinsurance.com/xmlrpc.php
Link: <http://cheapcosthealthinsurance.com/?p=711>; rel=shortlink
Connection: close
Content-Type: text/html; charset=UTF-8
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.