GithubHelp home page GithubHelp logo

sstadick / rust-lapper Goto Github PK

View Code? Open in Web Editor NEW
55.0 6.0 7.0 144 KB

Rust implementation of a fast, easy, interval tree library nim-lapper

Home Page: https://docs.rs/rust-lapper

License: MIT License

Rust 100.00%
rust interval-tree bioinformatics algorithms interval tree intervaltree

rust-lapper's Introduction

rust-lapper

Build Status license Version info

Documentation Crates.io

This is a rust port of Brent Pendersen's nim-lapper. It has a few notable differences, mostly that the find and seek methods both return iterators, so all adaptor methods may be used normally.

This crate works well for most interval data that does not include very long intervals that engulf a majority of other intervals. It is still fairly comparable to other methods. If you absolutely need time guarantees in the worst case, see COItres and IITree.

However, on more typical datasets, this crate is between 4-10x faster than other interval overlap methods.

It should also be noted that the count method is agnostic to data type, and should be about as fast as it is possible to be on any dataset. It is an implementation of the BITS algorithm

Serde Support

rust-lapper supports serialization with serde for Lapper and Interval objects:

[dependencies]
rust-lapper = { version = "*", features = ["with_serde"] }

See examples/serde.rs for a brief example.

Benchmarks

Benchmarking interval tree-ish datastructures is hard Please see the interval_bakeoff project for details on how the benchmarks were run... It's not fully baked yet though, and is finiky to run.

Command to run:

./target/release/interval_bakeoff fake -a -l RustLapper -l
RustBio -l NestedInterval -n50000 -u100000

# This equates to the following params:
# num_intervals	50000
# universe_size	100000
# min_interval_size	500
# max_interval_size	80000
# add_large_span	true (universe spanning)

Set A / b Creation Times

crate/method A time B time
rust_lapper 15.625ms 31.25ms
nested_intervals 15.625ms 15.625ms
bio 15.625ms 31.25ms

100% hit rate (A vs A)

crate/method mean time intersection
rust_lapper/find 4.78125s 1469068763
rust_lapper/count 15.625ms 1469068763
nested_intervals/query_overlapping 157.4375s 1469068763
bio/find 33.296875s 1469068763

Sub 100% hit rate (A vs B)

crate/method mean time intersection
rust_lapper/find 531.25ms 176488436
rust_lapper/count 15.625ms 176488436
nested_intervals/query_overlapping 11.109375s 196090092
bio/find 4.3125s 176488436

nested_intervals rust-bio Note that rust-bio has a new interval tree structure which should be faster than what is shown here

Example

use rust_lapper::{Interval, Lapper};

type Iv = Interval<usize, u32>;
fn main() {
    // create some fake data
    let data: Vec<Iv> = vec![
        Iv {
            start: 70,
            stop: 120,
            val: 0,
        }, // max_len = 50
        Iv {
            start: 10,
            stop: 15,
            val: 0,
        },
        Iv {
            start: 10,
            stop: 15,
            val: 0,
        }, // exact overlap
        Iv {
            start: 12,
            stop: 15,
            val: 0,
        }, // inner overlap
        Iv {
            start: 14,
            stop: 16,
            val: 0,
        }, // overlap end
        Iv {
            start: 40,
            stop: 45,
            val: 0,
        },
        Iv {
            start: 50,
            stop: 55,
            val: 0,
        },
        Iv {
            start: 60,
            stop: 65,
            val: 0,
        },
        Iv {
            start: 68,
            stop: 71,
            val: 0,
        }, // overlap start
        Iv {
            start: 70,
            stop: 75,
            val: 0,
        },
    ];

    // make lapper structure
    let mut lapper = Lapper::new(data);

    // Iterator based find to extract all intervals that overlap 6..7
    // If your queries are coming in start sorted order, use the seek method to retain a cursor for
    // a big speedup.
    assert_eq!(
        lapper.find(11, 15).collect::<Vec<&Iv>>(),
        vec![
            &Iv {
                start: 10,
                stop: 15,
                val: 0
            },
            &Iv {
                start: 10,
                stop: 15,
                val: 0
            }, // exact overlap
            &Iv {
                start: 12,
                stop: 15,
                val: 0
            }, // inner overlap
            &Iv {
                start: 14,
                stop: 16,
                val: 0
            }, // overlap end
        ]
    );

    // Merge overlaping regions within the lapper to simplifiy and speed up quries that only depend
    // on 'any
    lapper.merge_overlaps();
    assert_eq!(
        lapper.find(11, 15).collect::<Vec<&Iv>>(),
        vec![&Iv {
            start: 10,
            stop: 16,
            val: 0
        },]
    );

    // Get the number of positions covered by the lapper tree:
    assert_eq!(lapper.cov(), 73);

    // Get the union and intersect of two different lapper trees
    let data = vec![
        Iv {
            start: 5,
            stop: 15,
            val: 0,
        },
        Iv {
            start: 48,
            stop: 80,
            val: 0,
        },
    ];
    let (union, intersect) = lapper.union_and_intersect(&Lapper::new(data));
    assert_eq!(union, 88);
    assert_eq!(intersect, 27);

    // Get the depth at each position covered by the lapper
    for interval in lapper.depth().filter(|x| x.val > 2) {
        println!(
            "Depth at {} - {}: {}",
            interval.start, interval.stop, interval.val
        );
    }

}

Release Notes

  • 1.1.0: Added insert functionality thanks to @zaporter
  • 0.4.0: Addition of the BITS count algorithm.
  • 0.4.2: Bugfix in to update starts/stops vectors when overlaps merged
  • 0.4.3: Remove leftover print statement
  • 0.5.0: Make Interval start/stop generic
  • 1.0.0: Add serde support via the with_serde feature flag

rust-lapper's People

Contributors

dependabot[bot] avatar maxcountryman avatar sstadick avatar thomasetter avatar zaporter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

rust-lapper's Issues

An API to find the closest interval to a value

It would be great to extend Lapper's modus operandi to also perform the "fuzzy" search - instead of the exact match inside one of the intervals, allow to search for the closest interval when there is no exact match.

The API could be similar to find:

pub fn closest(&self, start: I, stop: I) -> IterFind<'_, I, T>

Update / remove benchmarks in README

Basically, this lib is the fastest possible way of doing things for datatypes that don't have massively spanning intervals that encompass all the other intervals. In that case this is slower than many other implementations.

See coitrees for fastest general case interval list, but it is much less feature-full and harder to add feature to.

[Feature]: Use parallel gzip compression as implemented in pigz

Use parallel gzip compression as implemented in pigz.

https://zlib.net/pigz/pigz.pdf

Pigz compresses using threads to make use of multiple processors and cores. The input is bro-
ken up into 128 KB chunks with each compressed in parallel. The individual check value for each
chunk is also calculated in parallel. The compressed data is written in order to the output, and a
combined check value is calculated from the individual check values.
The compressed data for mat generated is in the gzip, zlib, or single-entry zip for mat using the
deflate compression method. The compression produces partial raw deflate streams which are
concatenated by a single write thread and wrapped with the appropriate header and trailer, w here
the trailer contains the combined check value.
Each partial raw deflate stream is terminated by an empty stored block ( using the
Z_SYNC_FLUSH option of zlib), in order to end that partial bit stream at a byte boundary. That
allows the partial streams to be concatenated simply as sequences of bytes. This adds a very
small four to five byte overhead to the output for each input chunk.
The default input block size is 1 28K, but can be changed with the -b option.
The number of com-
press threads is set by default to the number of online processors, which can be changed using
the -p option. Specifying -p 1 avoids the use of threads entirely.
The input blocks, while compressed independently, have the last 32K of the previous block loaded
as a preset dictionary to preserve t he compression effectiveness of deflating in a single thread.
This can be turned off using the -i or --independent option, so that the blocks can be decompressed independently for partial error recovery or f or random access. This also inserts an extra
empty block to flag independent blocks by prefacing each with the nine-byte sequence (in hex):
00 00 FF FF 00 00 00 FF FF.
Decompression can’t be parallelized, at least not without specially prepared deflate streams for
that purpose. A s a result, pigz uses a single thread (the main thread) for decompression, but will
create three other threads for reading, writing, and check calculation, which can speed up decompression under some circumstances..

Specify Range Inclusivity and Exclusivity

I have been using Lapper quite a bit and have realized that there is one area of the API that seems a bit confusing.

Consider

let lap: Lapper<usize,usize> = Lapper::new(vec![
    Interval{start:0, stop:10, val:0},
    Interval{start:20, stop:30, val:1},
]);
assert_eq!(lap.find(10, 20).count(),0);

The documentation for find() states that it Find all intervals that overlap start .. stop
If I interpret that as [start,stop) then it is consistent with the definition of Interval:

/// Represent a range from [start, stop)
/// Inclusive start, exclusive of stop
pub struct Interval<I, T>

However, right now I feel like I am walking on eggshells every time I make a query because I have to ensure I am thinking about these inclusivity bounds. Something like the following would take a lot of load off the programmer.

let lap: Lapper<usize,usize> = Lapper::new(vec![
    Interval{start:Inclusive::from(0), stop:Inclusive::from(10), val:0},
    Interval{start:20, stop:30, val:1},
]);
assert_eq!(lap.find(Inclusive::from(10), 20).count(),1); // should just include the first one
assert_eq!(lap.find(10, Inclusive::from(20)).count(),0); // should include neither as second is not inclusive at the start
assert_eq!(lap.find(10, 21).count(),1); // should include second

/\ This is just a quick demo. I am not attached to anything like this and think it is a bit ugly. (I also want to make sure that anyone who upgrades their version is not surprised by new behavior)

You're a more experienced Rust programmer: do you know a more idiomatic way to do this? Are you at all interested in merging something that would allow for explicit inclusivity/exclusivity?

I will happily write all of the code and do all of the leg work to get this merged if we can agree on an API design.

Using lapper in helix

Hi,

How to share lapper instance?
I am trying to pass lapper instance to to another structure but I have compilation error.

Code:

#[macro_use]
extern crate helix;
extern crate rust_lapper;

use rust_lapper::{Interval, Lapper};

type IntervalType = (u32, u32);
type Iv = Interval<u32>;

ruby! {
    class LapperSearchOne {
        struct {
            lapper: Lapper<u32>
        }

        def initialize(helix, intervals: Vec<IntervalType>) {
            let data = intervals.into_iter().map(|interval|
                Iv{start: interval.0, stop: interval.1, val: 0}
            ).collect();
            LapperSearchOne { helix, lapper: Lapper::new(data) }
        }
    }
}

Error:

error[E0277]: the trait bound `rust_lapper::Lapper<u32>: std::clone::Clone` is not satisfied
  --> src/lib.rs:14:13
   |
14 |             lapper: Lapper<u32>
   |             ^^^^^^^^^^^^^^^^^^^ the trait `std::clone::Clone` is not implemented for `rust_lapper::Lapper<u32>`
   |
   = note: required by `std::clone::Clone::clone`

Is this due to missing Clone in macro on line https://github.com/sstadick/rust-lapper/blob/master/src/lib.rs#L92
?

How to fix it? Thank you in advance.

Serialize and deserialize Interval and Lapper

Hi, thanks for the crate. I made some structs with Interval and Lapper in them. I'd like to serialize and deserialize the structs, but am stuck in serializing and deserializing Interval and Lapper. Any advice on this would be appreciated.

merge_overlaps loses values

merge_overlaps seems to take the value from the first Interval that overlaps and the other value is lost. I'm not sure the best way to handle this? Maybe a user-provided merge function that lets you create a new value when merging two ranges?

Add overlap merge function and allow for incoming vector to use it

type Iv = Interval<u32>;
fn merge_overlaps(ivs: Vec<Iv>) -> Vec<Iv> {
    let mut stack: VecDeque<&mut Iv> = VecDeque::new();
    let mut ivs: Vec<Iv> = Lapper::new(ivs).into_iter().collect();
    let mut ivs = ivs.iter_mut();
    if let Some(first) = ivs.next() {
        stack.push_back(first);
        for interval in ivs {
            let mut top = stack.pop_back().unwrap();
            if top.stop < interval.start {
                stack.push_back(top);
                stack.push_back(interval);
            } else if top.stop < interval.stop {
                top.stop = interval.stop;
                //stack.pop_back();
                stack.push_back(top);
            } else { // they were equal
                stack.push_back(top);
            }
        }
        stack.into_iter().map(|x| Iv{start: x.start, stop: x.stop, val: x.val}).collect()
    } else {
        vec![]
    }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.