GithubHelp home page GithubHelp logo

jermp / lphash Goto Github PK

View Code? Open in Web Editor NEW
41.0 5.0 2.0 105.66 MB

Fast and compact locality-preserving minimal perfect hashing for k-mer sets.

License: MIT License

CMake 1.39% C++ 85.61% Shell 13.01%
bioinformatics hashing k-mers minimal-perfect-hash locality-preserving

lphash's Introduction

CodeQL DOI

LPHash

LPHash is a minimal perfect hash function (or MPHF) designed for k-mer sets where overlaps of k-1 bases between consecutive k-mers are exploited to:

  1. Reduce the space of the data structure. For example, with the right combination of parameters, LPHash is able to achive < 0.9 bits/k-mer in practice whereas the known theoretical lower-bound of a classic MPHF is 1.44 bits/k-mer and practical constructions take 2-3 bits/k-mer.

  2. Boost the evaluation speed of queries performed for consecutive k-mers.

  3. Preserve the locality of k-mers: consecutive k-mers are likely to receive consecutive hash codes.

The data structure and its construction/query algorithms are described in the paper Locality-Preserving Minimal Perfect Hashing of k-mers [1].

Please, cite this paper if you use LPHash.

Datasets are available on Zenodo here: https://zenodo.org/record/7239205.

Table of contents

Compiling the Code

The code is tested on Linux with gcc and on Mac OS with clang (Intel and ARM processors supported, like Apple M1). To build the code, CMake is required.

Clone the repository with

git clone --recursive https://github.com/jermp/lphash.git

If you have cloned the repository without --recursive, be sure you pull the dependencies with the following command before compiling:

git submodule update --init --recursive

K-mers can have maximal lengths of both 32 or 64, depending on their definition found in the file include/compile_constants.tpd (either uint64_t or __uint128_t). Although larger 128-bit k-mers also work for k <= 32, we recommend to use the right type whenever possible.

To compile the code for a release environment (see file CMakeLists.txt for the used compilation flags), it is sufficient to do the following:

mkdir build
cd build
cmake ..
make -j

For a testing environment, use the following instead:

mkdir debug_build
cd debug_build
cmake .. -D CMAKE_BUILD_TYPE=Debug -D LPHASH_USE_SANITIZERS=On
make -j

Dependencies

The repository has minimal dependencies: it only uses the PTHash library (for minimal perfect hashing [2,3]), and zlib to read gzip-compressed streams.

To automatically pull the PTHash dependency, just clone the repo with --recursive as explained in Compiling the Code.

If you do not have zlib installed, you can do

sudo apt-get install zlib1g

if you are on Linux/Ubuntu, or

brew install zlib

if you have a Mac (and Homebrew installed: https://brew.sh/).

Tools

There is one main executable called lphash after the compilation, which can be used to run a tool. Run ./lphash as follows to see a list of available tools.

LP-Hash: (L)ocality (P)reserving Minimal Perfect (Hash)ing of k-mers

Usage: ./lphash <tool> ...

Available tools:
  build-p      build a partitioned LP-MPHF
  build-u      build an unpartitioned LP-MPHF
  query-p      query a partitioned LP-MPHF
  query-u      query an unpartitioned LP-MPHF

Build a Function

The driver program called lphash can be used to build and query Locality-Preserving MPHFs. The subcommand build-p uses the partitioning strategy, while build-u does not (check out the paper for more details). In the following, we will focus on build-p alone since it is the most memory efficient between the two.

From within the directory where the code was compiled (see the section Compiling the Code), run the command:

./lphash build-p --help

to show the usage of the subcommand (reported below for convenience).

Usage: build-p [-h,--help] [-i input_filename] [-k k] [-m m] [-s seed] [-t threads] [-o output_filename] [-d tmp_dirname] [-c c] [--max-memory max-memory] [--check] [--verbose]

 [-i input_filename]
	REQUIRED: Must be a FASTA file (.fa/fasta extension) compressed with gzip (.gz) or not:
	- without duplicate nor invalid kmers
	- one DNA sequence per line.
	For example, it could be the de Bruijn graph topology output by BCALM.

 [-k k]
	REQUIRED: K-mer length (must be <= 63).

 [-m m]
	REQUIRED: Minimizer length (must be < k and <= 32).

 [-s seed]
	Seed for minimizer computation (default is 42).

 [-t threads]
	Number of threads for pthash (default is 1).

 [-o output_filename]
	Output file name where the data structure will be serialized (no files generated by default).

 [-d tmp_dirname]
	Temporary directory used for construction in external memory (default is directory '.').

 [-c c]
	A (floating point) constant that trades construction speed for space effectiveness of minimal perfect hashing. 
	A reasonable value lies between 3.0 and 10.0 (default is 3.00).

 [--max-memory max-memory]
	Maximum internal memory [GB] for building (8GB by default). Use external memory if needed.

 [--check]
	Check correctness after construction (disabled by default).

 [--verbose]
	Verbose output during construction (disabled by default).

 [-h,--help]
	Print this help text and silently exits.

Output format

./lphash build-p prints useful statistics about the constructed MPHF on stdout as single comma-separated values whose fields are:

  1. The input filename used to build the MPHF,
  2. k-mer length k,
  3. minimizer length m,
  4. the fraction of non-unique minimizer,
  5. the theoretical epsilon (number of super-k-mers / number of k-mers),
  6. the true epsilon,
  7. alpha (fragmentation factor = number of contigs / number of k-mers),
  8. the space of the MPHF in bits/k-mer.

Example

For the examples, we are going to use some collections of stitched unitigs from the directory data/unitigs_stitched. These collections were built for k = 31, 47 and 63, so MPHFs should be built with the right k to ensure correctness.

In the section Input Files, we explain how such collections of stitched unitigs can be obtained from raw FASTA files.

This example

./lphash build-p -i ../data/unitigs_stitched/se.ust.k31.fa.gz -k 31 -m 15 --check -o se_k31_m15.lph --verbose

builds a dictionary for the k-mers read from the file data/unitigs_stitched/se.ust.k31.fa.gz, with k = 31 and m = 15. It also check the correctness of the MPHF (--check option, for both minimality and collisions), and serializes it on disk to the file se_k31_m15.lph.

Another example is

./lphash build-p -i ../data/unitigs_stitched/se.ust.k63.fa.gz -k 63 -m 17 --check -o se_k63_m17.lph --verbose

which uses k = 63 and m = 17.

Similarly to build-p, the query-p command (or query-u if the MPHF was built by build-u) outputs single CSV lines on stdout. Queries can be performed by recomputing each hash value from scratch (random query) of by re-using the fact that successive k-mers overlap by k-1 bases (streaming). The fields are: input file,lphash file,total k-mers,barebone streaming time,barebone random time,full streaming time,full random time

  1. Input (fasta/fastq) file,
  2. the queried MPHF file,
  3. the total number of k-mers in the input,
  4. average streaming query time in ns/k-mer (without saving the hash values),
  5. average random query time in ns/k-mer (without saving the hash values),
  6. average streaming query time in ns/k-mer (by saving the hash values inside an in-memory vector),
  7. average random query time in ns/k-mer (by saving the hash values inside an in-memory vector).

Example

./lphash query-p -i se_k31_m15.lph -q ../data/queries/salmonella_enterica.fasta.gz

Input Files

LPHash is meant to index k-mers from collections that do not contain duplicates. These collections can be obtained, for example, by extracting the maximal unitigs of a de Bruijn graph.

To do so, we can use the tool BCALM2 [4]. This tool builds a compacted de Bruijn graph and outputs its maximal unitigs. From the output of BCALM2, we can then stitch (i.e., glue) some unitigs to reduce the number of nucleotides. The stitiching process is carried out using the UST tool.

Below we provide a complete example (assuming both BCALM2 and UST are installed correctly) that downloads the Human (GRCh38) Chromosome 13 and extracts the maximal stitiched unitigs for k = 31.

mkdir DNA_datasets
wget http://ftp.ensembl.org/pub/current_fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -O DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz
~/bcalm/build/bcalm -in ~/DNA_datasets/Homo_sapiens.GRCh38.dna.chromosome.13.fa.gz -kmer-size 31 -abundance-min 1 -nb-cores 8
~/UST/ust -k 31 -i ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa
gzip Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa.ust.fa
rm ~/Homo_sapiens.GRCh38.dna.chromosome.13.fa.unitigs.fa

Zenodo

We uploaded a copy of all the datasets used in the experiments of the paper, processed as we described above, on Zenodo: https://zenodo.org/record/7239205.

Authors

References

lphash's People

Contributors

jermp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

clayne tbenavi1

lphash's Issues

Building Time

Hello,

The building time for me seems to be much higher than the results in Table 5 of the paper. What value of k corresponds to the results in Table 5? I downloaded both the k=31 and k=63 human files from Zenodo and ran lphash to build the database.

Specifically, I ran:

lphash build-p -i human.k63.unitigs.fa.ust.fa.gz -c 5 -k 63 -m 28 --check -o human_c5_k63_m28.lph --verbose -t 4

How long is this expected to take? Thank you.

A kmer and its reverse complement hash to different values

Hello,
Based on my testing, Lphash doesn't convert kmers into their canonical representation. I would expect a kmer and its reverse complement to hash to the same value.

For example, BCALM 2 converts all k-mers into their canonical representation with respect to reverse-complements.

I believe that the kmers will need to be converted to their canonical form both during the building and the querying steps. Thanks for any insight.

Also, what is the expected output when you query lphash for a kmer that is not in the database? Does lphash return a particular value? I'm not 100% sure, but I think right now it will still output a hash number (that will collide with another kmer actually in the database). Thanks for any insights.

Different hashes error for S. cerevisiae

Hello,

I ran lphash on the S. cerevisiae reference genome and received the following error:

[Error] different hashes, maybe there were some Ns in the input (not supported as of now)

Specifically, I first ran bcalm with -kmer-size 31 and -abundance-min 1, then I ran ust with -k 31, then I ran lphash build-p with -k 31 and -m 15. As far as I can tell, there are no Ns in the genome.

Is this an error I should be concerned about? Thanks for your assistance.

I do notice that the genome has upper and lower case letters. Does lphash convert everything to uppercase for both building and querying?

Multithreading for query-p command

Hello,

Would it be possible to implement multiple threads for the query-p command. In particular, a very simple implementation would just be to have each thread analyze a different subset of the sequences in the input fasta file. Thanks for any assistance.

Code style / refactoring

  1. Use #pragma once directive instead of if-define guards.
  2. Move all code not strictly related to LPHash to external/ (or another directory), such as kseq.h and BooPHF.hpp.
  3. Delete unused scripts and code (such as prettyprint.hpp)
  4. Properly format all the code using the rules in .clang-format. For example: do not use if-else without parentheses. Only use one-liner ifs without parentheses.
  5. Maintain in src/ only the executables for building and performing queries. We can also have a unique driver program named lphash that takes as input an argument build or query specifying the sub-tool to use. See an example here: https://github.com/jermp/sshash-lite/blob/main/src/sshash-lite.cpp. This also implies having a unique tool to build both data structures, partitioned and un-partitioned (currently called -alt).
  6. Move tests into a separate folder called tests/.
  7. Use a build_configuration class to build the data structures with default parameters, as used here and here for example. A build_configuration object is then passed as input to the constructor of the mphf.

Associating metadata (satellite values) with kmers

Hello and thank you for this great tool! I was wondering if it would be possible to associate metadata with the kmers. The paper mentions abundance counts, reference identifiers, or contig identifiers as possible satellite values. How would one go about querying your lphash database for this information? Thanks for any insights.

(I understand that the database size and database construction time would not be optimized for this use case, but I am most interested in keeping a fast lookup time.)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.