GithubHelp home page GithubHelp logo

simongog / sdsl-lite Goto Github PK

View Code? Open in Web Editor NEW
2.2K 2.2K 344.0 9.3 MB

Succinct Data Structure Library 2.0

License: Other

CMake 1.64% R 2.00% Makefile 2.52% C++ 90.10% TeX 2.87% C 0.43% Python 0.04% Shell 0.32% Batchfile 0.03% GDB 0.07%

sdsl-lite's People

Contributors

adriangbrandon avatar amallia avatar chrisbarber avatar danielsaad avatar diegocaro avatar ekg avatar farruggia avatar fermeise avatar fmontoto avatar franramirez688 avatar hmusta avatar jmmackenzie avatar kdm9 avatar kimundi avatar koeppl avatar mbesta avatar mpetri avatar mr-c avatar murray1991 avatar niklasb avatar olydis avatar rkonow avatar rsharris avatar simongog avatar smdgjmigop avatar stefan-it avatar tb38 avatar thinred avatar ugermann avatar vpfautz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sdsl-lite's Issues

the util namespace is very messy and it is not clear to the library user that most of these functions exist

the util namespace/class is very messy (in my opinion).

for example there are many different functions in different places to read something from disk mixed with other functions that are just "helper" functions.

there are also other classes for example testutils::file which also perform I/O.

a consistent name space? maybe sdsl::io where all file reading/writing is performed would increase the usability of the library. (maybe make load/serialize of each class private and force the user to use the load_from_disk store_to_disk functions?)

Reinsert PHI algorithm for LCP construction

The old library implemented two version of the PHI algorithm.
The first using 5n bytes, the second 4n bytes.
We decieded to integrate only the first one, since it is faster and the memory peak is not much higher than the fast suffix array construction.
Planed optimization: keep track of the largest lcp value and use only n \log max_lcp bits instead of n \log n bits for the resulting lcp array.

Construct Integer Array > 2G elements

I have replaced the fixed size integer vector by and int_vector in the Larson's algorithm. Everything works fine for inputs <= 2G elements. But it breaks for > G elements.
It's possibly an unsigned/signed problem.

license change to lgpl

releasing a new version is a good time for a license change.

some files that need to be changed:

  • all the headers of the source code
  • the COPYING file
  • the readme file clearly stating that the library license

Construction process improvements

Working with the library, I feel that I'm constrained by the construction workflow.

My understanding of the construction workflow is:

  • Make an instance of the desired structure, with the default constructor (which should do nothing).
  • Call the construct function, which (at least for WTs) (1) loads the file into an int_vector. (2) Writes that int_vector to disc. (3) Creates a new instance of the desired structure by calling its constructor with an int_vector_file_buffer. (4) Uses swap to swap the new structure with the one passed to the construct function. (5) Deletes the temporary file containing the int_vector structure.

I have multiple concerns with this workflow:

  • Copying the file into an int_vector and then storing this structure on disc seams rather wasteful both in terms of I/O time, but also in terms of disc space (especially if one uses succinct structures because one works on very large data sets).
  • The int_vector_file_buffer is not very flexible. Seek functionality and the possibility to set 'limits' could be useful. By limits I mean that one could give a start and an end offset, after which the buffer would appear to only contain the values in between.
  • It is only possible to affect the structure at run-time, before construction, by the use of template variables.

Simple solutions to these concerns are possible, E.g. (1) one could simply not write the int_vector to disc (as it is now, it is completely loaded into memory anyway). (2) The extension to int_vector_file_buffer is the most extensive code work of these suggestions, but should be decoupled enough to only involve the int_vector_file_buffer implementation. (3) Though affecting the library more, the library interface could rely on a "construct" function in every structure instead of a constructor. This would save memory during construction, since we wouldn't need the temporary object, and make it possible to change the parameters of the structure at run time, before the actual construction. This change should be extremely simple, but it affects many structures.

So, what are your thoughts on the suggested changes? I think they would give more flexibility, and at the same time save some resources at construction time.

temp_write_read_buffer

This class is only used in the construction of wt_int. However, we can also int_vector_buffer there. I'll replace it and remove temp_write_read_buffer.hpp, if nobody else needs it.

Approximate matching with wildcards in the indexed string

I would love to see approximate string matching in SDSL-lite. Especially, I am interested in using compressed suffix trees with wildcard characters in the indexed string.

Example:

Indexed string T=AC*CA, where * is a wildcard character matching any symbol
Query Q=CTC

T has a match of Q at position 1, since the wildcard could be replaced by T. In the same way, there is a match for CGC, CTC, etc.

Note that I do not ask for approximate queries, but only approximate indexed strings. However, approximate queries (for instance wildcards in queries) would be nice too :-)

Thanks,
Sebastian

Jenkins CI cleanup after test cases

We have to delete all installed library files on the Jenkins server after the build and test run, since this is not done automatically by Jenkins and causes problems, when include files are removed from the library.

YGWYD paradigm

Complex structures like WTs, CSAs, and CSTs can be constructed by calling the construct method. The object, which will be created by construct is defined by solely the type of the object and the input and it is not possible to pass values for constructors of sub data structures. Therefore parameters like block size, sample densities and so on should be template arguments, so that

You Get What You Declare :)

range_search_2d

Signature of range_search_2d: currently two pointers to vectors are passed to range_search_2d. The found index/value pairs are stored in theses two vector in case the pointers are not nullptrs. The number of found index/value pairs is returned.

I would suggest to replace the two pointers in the parameter list by a bool generate_pairs, which indicates if a index/value pair vector should be created. The return value is then a pair, consisting of number of found pairs and a vector of index/value pairs. This vector has size zero in the case generate_pairs=false.

memory management enhancements

we have a quite powerful memory manager in the library which could be enhanced to support quite a few number of functions.

  • keep track of the memory usage and support functions like peak_memory()
  • optionally "log" each int_vector<> "create" and "delete" action so we can easily create a "memory usage" graph
  • optionally move all bitmagic constants into memory managed by the memory manager so they reside in hugepage space if used
  • we could use a "real" memory manager like jemalloc to manage all our memory ourself. we could use functions like mm::new mm::free mm::realloc to manage our own memory space which would make using hugepages easier. this would be quite a bit of work though.

improve serialize std::vector sdsl code

some sdsl data structures use std::vectors to store data. currently there are several helper functions that can be used to serialize and load the content stored in these vectors. unfortunately, the current way this is implemented is very slow as each element in the vector is serialised individually. there might be a faster way to do this for POD.

Default parameter for int_vector_buffer?

int_vector_buffer is used most of the time for reading int_vectors buffered form disk.
So I would suggest to have std::ios::in as default value for the second ctor parameter.
Agreed? @tb38?

Tests: pass test file names as command line arguments

As it is done in the benchmarks now. Make the code of the test program easier. And make can do the job of reading the instances from config files.
Make will also download the test cases from an online source, if they are not already present.

Return type of lex_count for order-preserving WTs

@tb38 currently the method returns rank(c, i) and the number of symbols smaller/greater are returned by reference. With C++11 we have the possibility to return a three value tuple with no overhead (move constructor!).
This would, IMHO, make the usage of the method easier.

Another point: wt_int should have the same functionality as wt_pc for order-preserving shapes. Anyone keen to do the coding?

Remove select_support_dummy ?

select_support_scan can replace select_support_dummy, since both have the same serialized representation (=nothing is written). The _scan version even supports selection. Therefore I will remove the _dummy version.

openmp for construction

there would be quite a few places where we could speed things up using simple openmp primitives (similar to what libdivsufsort). for example:

        void construct_init_rank_select() {
            util::init_support(m_tree_rank, &m_tree);
            util::init_support(m_tree_select0, &m_tree);
            util::init_support(m_tree_select1, &m_tree);
        }

of course this would have to be disabled when you do "construction time experiments"

magic numbers

besides bitmagic lots of the code contains lots of magic numbers which makes it hard for "newcomers" to get into. for example:

            for (size_type i=0; i < 511; ++i)
                m_nodes[i] = wt.m_nodes[i];
            for (size_type i=0; i<256; ++i)
                m_c_to_leaf[i] = wt.m_c_to_leaf[i];
            for (size_type i=0; i<256; ++i) {
                m_path[i] = wt.m_path[i];
            }

or

size_type sb    = (m_arg_cnt+4095)>>12;

maybe we should spend some time to making the code more easier to read in this aspect.

Renaming of bit_magic

I'm in favor of renaming good old bit_magic to simply bits. Also often used methods like b1Cnt (one bits count=popcount), i1BP (i-th 1-bits position=select), r1BP (rightmost 1-bit position), and l1BP (leftmost 1-bit position) should get better names:
How about that:

  • bit_magic::b1Cnt -> bits::cnt
  • bit_magic::i1BP -> bits::sel
  • bit_magic::r1BP -> bits::lo
  • bit_magic::l1BP -> bits::hi

?

lcp construction

I was trying to construct an lcp array and could not figure out how to do it. there seems to be no construct() method which constructs the lcp array form a file? There are also no examples showing how to construct an lcp array. I had to construct a cst and use cst.lcp[] to access the lcp array?

the construct_lcp classes all require cache configs which I don't know how to use? do I just point to the sa / text on disk using the constants KEY_BWT? do these have to be stored in sdsl format or in normal uint8_t or uint64_t? Can I somehow specify the num_bytes type of each input in the cache_config file? For example KEY_SA num_bytes=8 is a uncompressed sa on disk using 64bit integers.

another comment: the different values of num_bytes should be defines similar to KEY_SA

Signature of inverse permutations

Till now, inverse permutations like inverse suffix arrays (inverse of the suffix array), LF (inverse of psi) are accessed through the operator(). This lead to user confusion several times. We will change the access to isa[] and lf[] to avoid this in the future.

Remove algorithms_for_string_matching.hpp

This source file contains basically the methods algorithm::count, algorithm::locate, and algorithm::extract.
I suggest to

  • place them into suffix_array_helper.hpp and suffix_tree_helper.hpp, since they expect such a data structure as input and the user does not have to include another header besides suffix_[array|tree]s.hpp.
  • rewrite them such that automatically the fastest algorithm is chosen. E.g. using backward decoding for extract on a csa_wt CSA but forward decoding on a csa_sada one. This can be done by using the same technique as in the construct methods.
  • get rid of the algorithm namespace.

Reintegration of LCP construction algorithms

Include the following LCP construction algorithms again:

  • semi-external PHI (sampling=64) is very space efficient; takes about the text size.
  • goPHI; space dependent on reducible values in the text
  • go; most times faster than goPHI, but quadratic worst case complexity

Method names in CSTs

The following method names are quite long

  • leaves_in_the_subtree(v)
  • leftmost_leaf_in_the_subtree(v)
  • rightmost_leaf_in_the_subtree(v)

and should be replaced by

  • size(v)
  • leftmost_leaf(v)
  • rightmost_leaf(v)

since it is clear from the context that node v represents a subtree.

unused code makes the library less readable/usable

there are many "defunct" classes and functions that don't do anything or should not be used in practice.

For example, a library user who does not really know that i1BP is the right function to call instead of the many other (k1BP/j1BP)

same goes for classes/files like wt_fixed_block.hpp which are incomplete

code cleanup - remove NULL

searching for NULL in the code still results in a lot of occurrences. as we are now c++11 compatible these should all be changed to nullptr

Memory log

It would be nice to have a logger for the allocated space, which outputs the current allocated space each time an int_vector is created, resized or deleted.
I will add this functionality to the memory_management class mm.
You can then assign a output stream to mm and the information will be written into the stream.

add hash of class type to avoid loading incorrect objects

to avoid segfaulting when loading an index created with a different classes we could serialise the hash of the demangle() / util::class_to_hash() output. during loading we compare the stored hash with the hash of the object we are loading and abort if they do not match.

I also suggest we write the size first and if it is small (say less than 1024) entries we do not write the hash as it might contribute to the size of small data structures. for larger data structures the 64bit hash value is negligible

Move operators

c++11 introduces the concept of moving objects. the library currently implements this concept only in the int_vector<> class I think. as we advertise to be c++11 compatible we should make sure that the main objects (csa/cst/wt) can be moved efficiently.

Construction in RAM

Most temporary results during the construction process are stored and reread from disk. This reduces the memory footprint of the construction. However for small inputs, reading and writing to disk produces an overhead compared to in-memory processing.

I would like to solve this by implementing a simple RAM-file system. It would be necessary to encapsulate the file IO operations to be able to use the RAM-file system.
RAM-files should start with a prefix like "RAM://"...

Repository cleanup

The repository contains a few files that are not needed anymore or are not up to date:

  • CHANGES (out of date)
  • CMakemodules/Finddivsufsort* ?

anything else?

the content in the algorihtms.hpp file looks like it should be moved somewhere else (the namespace is also called algorithm not algorithms?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.