remram44 / cdchunking-rs Goto Github PK

Content-Defined Chunking for Rust

Home Page: https://remram44.github.io/cdchunking-rs/

Rust 100.00%

rust chunk chunking rolling-hash-functions

cdchunking-rs's Introduction

Content-Defined Chunking

This crates provides a way to device a stream of bytes into chunks, using methods that choose the splitting point from the content itself. This means that adding or removing a few bytes in the stream would only change the chunks directly modified. This is different from just splitting every n bytes, because in that case every chunk is different unless the number of bytes changed is a multiple of n.

Content-defined chunking is useful for data de-duplication. It is used in many backup software, and by the rsync data synchronization tool.

This crate exposes both easy-to-use methods, implementing the standard Iterator trait to iterate on chunks in an input stream, and efficient zero-allocation methods that reuse an internal buffer.

Using this crate

First, add a dependency on this crate by adding the following to your Cargo.toml:

cdchunking = 1.0

And your lib.rs:

extern crate cdchunking;

Then create a Chunker object using a specific method, for example the ZPAQ algorithm:

use cdchunking::{Chunker, ZPAQ};

let chunker = Chunker::new(ZPAQ::new(13)); // 13 bits = 8 KiB block average

There are multiple way to get chunks out of some input data.

From an in-memory buffer: iterate on slices

If your whole input data is in memory at once, you can use the slices() method. It will return an iterator on slices of this buffer, allowing to handle those chunks with no additional allocation.

for slice in chunker.slices(data) {
    println("{:?}", slice);
}

From a file object: read chunks into memory

If you are reading from a file, or any object that implements Read, you can use Chunker to read whole chunks directly. Use the whole_chunks() method to get an iterator on chunks, read as new Vec<u8> objects.

for chunk in chunker.whole_chunks(reader) {
    let chunk = chunk.expect("Error reading from file");
    println!("{:?}", chunk);
}

You can also read all the chunks from the file and collect them in a Vec (of Vecs) using the all_chunks() method. It will take care of the IO errors for you, returning an error if any of the chunks failed to read.

let chunks: Vec<Vec<u8>> = chunker.all_chunks(reader)
    .expect("Error reading from file");
for chunk in chunks {
    println!("{:?}", chunk);
}

From a file object: streaming chunks with zero allocation

If you are reading from a file to write to another, you might deem the allocation of intermediate Vec objects unnecessary. If you want, you can have Chunker provide you chunks data from the internal read buffer, without allocating anything else. In that case, note that a chunk might be split between multiple read operations. This method will work fine with any chunk sizes.

Use the stream() method to do this. Note that because an internal buffer is reused, we cannot implement the Iterator trait, so you will have to use a while loop:

let mut chunk_iterator = chunker.stream(reader);
while let Some(chunk) = chunk_iterator.read() {
    let chunk = chunk.unwrap();
    match chunk {
        ChunkInput::Data(d) => {
            print!("{:?}, ", d);
        }
        ChunkInput::End => println!(" end of chunk"),
    }
}

cdchunking-rs's People

Contributors

Stargazers

Watchers

Forkers

cloudxtreme pradovic dmarcuse mrd0ll4r

cdchunking-rs's Issues

Read whole chunks but reusing a growable buffer

Similar to stream(), but grow the buffer to accommodate whole chunks.
Similar to whole_chunks(), but reuse the Vec, so only provide one at a time.

This is done in dhstore on top of stream(). It is kind of easy to implement, but could be provided by Chunker. Alternatively it could be possible to change the size of stream()'s internal buffer.

Eventual assertion failure when feeding in large files

This code reproduces the bug quickly on my machine.

extern crate cdchunking;

use cdchunking::{Chunker, ZPAQ};

fn main() {
    let chunker = Chunker::new(ZPAQ::new(8));
    // It fails quicker on a smaller chunk size, but fails on any chunk size

    let random = std::fs::File::open("/dev/urandom").unwrap();

    for chunk in chunker.whole_chunks(random) {}
}

thread 'main' panicked at 'assertion failed: self.status != EmitStatus::AtSplit', /home/jack/.cargo/registry/src/github.com-1ecc6299db9ec823/cdchunking-0.2.0/src/lib.rs:322:13
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::_print
             at libstd/sys_common/backtrace.rs:71
   2: std::panicking::default_hook::{{closure}}
             at libstd/sys_common/backtrace.rs:59
             at libstd/panicking.rs:380
   3: std::panicking::default_hook
             at libstd/panicking.rs:396
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:576
   5: std::panicking::begin_panic
             at /checkout/src/libstd/panicking.rs:537
   6: <cdchunking::ChunkStream<R, I>>::read
             at /home/jack/src/test/<panic macros>:3
   7: <cdchunking::WholeChunks<R, I> as core::iter::iterator::Iterator>::next
             at /home/jack/.cargo/registry/src/github.com-1ecc6299db9ec823/cdchunking-0.2.0/src/lib.rs:272
   8: testx::main
             at src/main.rs:11
   9: std::rt::lang_start::{{closure}}
             at /checkout/src/libstd/rt.rs:74
  10: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:479
  11: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:102
  12: std::rt::lang_start_internal
             at libstd/panicking.rs:458
             at libstd/panic.rs:358
             at libstd/rt.rs:58
  13: std::rt::lang_start
             at /checkout/src/libstd/rt.rs:74
  14: main
  15: __libc_start_main
  16: _start

zpaq algorithm does not appear to match impl in zpaq upstream

Hi, I was looking at the chunking provided by Zpaq in this crate (as part of comparing it to a zpaq implimentation I've written), and noticed that it doesn't appear to match up with the upstream zpaq code.

In particular, it appears that the hash update equation doesn't match up (though I might be missing some transform you've done).

cdchunking code:

const HM: Wrapping<u32> = Wrapping(123_456_791);
...
pub fn update(&mut self, byte: u8) -> bool {
        if byte == self.o1[self.c1 as usize] {
            self.h = self.h * HM + Wrapping(byte as u32 + 1);
        } else {
            self.h = self.h * HM * Wrapping(2) + Wrapping(byte as u32 + 1);
        }
        self.o1[self.c1 as usize] = byte;
        self.c1 = byte;

        self.h.0 < (1 << self.nbits)
}

zpaq code

            if (c==o1[c1]) h=(h+c+1)*314159265u, ++hits;
            else h=(h+c+1)*271828182u;
            o1[c1]=c;
            c1=c;

In particular, zpaq adds h, c, and 1 prior to multiplying (and uses different constants), but cdchunking adds c and 1 after h is mulitiplied.

I've done some slight modification to zpaq itself so that it prints chunk lengths, and I'm not seeing a correspondence between the lengths produced by zpaq and cdchunking.

diff --git a/zpaq.cpp b/zpaq.cpp
index f468a7b..84e1f5f 100644
--- a/zpaq.cpp
+++ b/zpaq.cpp
@@ -83,6 +83,7 @@ Possible options:
 #endif
 #ifdef unix
 #define PTHREAD 1
+#include <inttypes.h>
 #include <sys/param.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -2416,6 +2417,8 @@ int Jidac::add() {
         assert(sz<=MAX_FRAGMENT);
         total_done+=sz;
 
+       printf("## %" PRIu64 "\n", sz);
+
         // Look for matching fragment
         assert(uint64_t(sz)==sha1.usize());
         memcpy(sha1result, sha1.result(), 20);

Is this difference in algorithm an oversight?
Or is it purposefully different?
If it is purposefully different, is it following another already published source?
If it is purposefully different, could the divergence be documented?

Feed data manually

CDC Algorithms with some lookahead

Hello! I found your library while looking for a framework to implement multiple chunking algorithms for comparison. Your design of ChunkerImpl is pretty much exactly what I came up with while experimenting, and I was quite happy to see you've built an ecosystem around that :) I already implemented a few algorithms with ease, so that's nice.

I just came across QuickCDC, which I also wanted to implement and evaluate.
However, that's quite difficult with the current design:
The algorithm gradually fills a lookup table with (n first bytes of chunk, m last bytes of chunk, size) of generated chunks, and uses this to jump around a bit. Specifically, if it finds the n first bytes of chunk to match with the current input data, it jumps size bytes ahead and checks m last bytes of chunk:

If they match, the chunk is emitted. This is doable.
If they don't match, it must go back to the beginning of the chunk and "properly" chunk it, using some standard-ish CDC algorithm.
I can implement this by keeping the current chunk in memory and jumping back, but I cannot express the result of this via ChunkerImpl.

Specifically, as I need to "go back", if you want, I might need to express a chunk boundary that is before the current block of data being processed. Since the return value of find_boundary is (from my understanding) an index within data, I cannot express this "past chunk boundary".

One solution would be to change the return type to Option<isize> to express negative values, or change the semantics of the usize return type to be global within the stream of data, not local within data.
Either way, both solutions are somewhat annoying: I don't see how to keep the current (very nice!) interface of zero-allocation reference-to-internal-buffer for chunking.

Anyway, what's your opinion on all this?