GithubHelp home page GithubHelp logo

richox / orz Goto Github PK

View Code? Open in Web Editor NEW
799.0 19.0 53.0 26.45 MB

a high performance, general purpose data compressor written in the crab-lang

License: MIT License

Rust 98.23% Makefile 0.47% C 1.29%
compression data crab-lang

orz's Introduction

Orz

orz -- a general purpose data compressor written in the crab-lang.

LICENSE Enwik8 Benchmark

orz is an optimized ROLZ (reduced offset Lempel-Ziv) general purpose data compressor. input data is encoded as ROLZ-matches (reduced-offsets and match lengths), 2-byte words, and single bytes. then all encoded symbols are processed with a symbol ranking (aka Move-to-Front) transformer and a static huffman coder.

benefited from the ROLZ algorithm, orz compresses times faster than many other LZ-based compressors which has same compression ratio, and decompression speed is still very acceptable.

orz is completely implemented in the crab-lang. clone the repo and run cargo build --release to have an executable orz binary.

installation

you can install orz with cargo:

cargo install orz --git https://github.com/richox/orz --tag v1.6.2

usage

for compression:

orz encode <source-file-input> <compressed-file-output>

for decompression:

orz decode <compressed-file-input> <source-file-output>

for more details, see orz --help

benchmarks

benchmark for 100MB of Large Text Compression Benchmark (enwik8, see http://mattmahoney.net/dc/text.html):

(for latest enwik8 benchmark result, see github actions)

name compressed size encode time decode time
xz -6 26,665,156 69.815s 1.309s
orz -l2 26,893,684 8.245s 1.414s
zstd -19 26,942,199 62.931s 0.239s
orz -l1 27,220,056 6.714s 1.393s
orz -l0 27,896,572 5.209s 1.405s
bzip2 -9 29,008,758 7.417s 3.538s
zstd -15 29,544,237 29.860s 0.196s
brotli -9 29,685,672 36.147s 0.285s
brotli -8 30,326,580 17.989s 0.271s
zstd -10 30,697,144 4.205s 0.192s
brotli -7 31,057,759 11.730s 0.267s
lzfse 36,157,828 1.762s 0.179s
gzip -6 36,548,933 4.461s 0.357s

orz's People

Contributors

artoria2e5 avatar dependabot[bot] avatar marcusklaas avatar neutron3529 avatar richox avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

orz's Issues

Provide a C API+ABI

The world out there still speaks C by and large. To let more people use the library, Orz should get a C API exported, so that people can use it from C++, Objective-C, Nim, Python, Node.js and everything else.

The A little Rust with your C chapter explains how to make public functions C compatible and how to generate headers using cbindgen.

Default compression level

Hello,
default compression level should be 2 instead of 3.
3 throws an error because I think it was removed.

I think it is in line 20 main.rs

#[structopt(long = "level", short = "l", default_value = "3")] /// Set compression level (0..3)

Produces incorrect output if input is pipe

Steps to reproduce:

admin@ip-172-31-23-30:~/beat-orz/orz$ head -c 1000000 /dev/urandom > /tmp/a
admin@ip-172-31-23-30:~/beat-orz/orz$ cat /tmp/a | /home/admin/beat-orz/orz/target/release/orz encode > /tmp/or
[INFO] encode: 65536 bytes => 66663 bytes, 2.537MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 2.482 MB/s
[INFO]   time:  0.026 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ /home/admin/beat-orz/orz/target/release/orz decode < /tmp/or > /tmp/trip
[INFO] decode: 65536 bytes <= 66663 bytes, 6.351MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 6.172 MB/s
[INFO]   time:  0.011 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ cmp /tmp/trip /tmp/a
cmp: EOF on /tmp/trip after byte 65536, in line 284

OS: Debian Linux bookworm, rustc 1.73.0-nightly, orz 3380556

Possible reason: you probably don't check return value of libc::read or something

Performance?

I was super excited to see this! I'm currently looking for a fast compression alternative for zstd for compressing postgresql wal archives.

At least for this use-case, I wasn't able to reproduce the benchmarks you've provided.

(orz v1.6.2 installed using cargo install as described in the README, also tested with cargo build --release from current HEAD):

$ zstd 00000003000025EF0000007C
00000003000025EF0000007C : 50.55%   (  16.0 MiB =>   8.09 MiB, 00000003000025EF0000007C.zst)
'zstd 00000003000025EF0000007C' time: 0.064s, cpu: 104%

orz encode -l0 00000003000025EF0000007C 00000003000025EF0000007C.orz
[INFO] encode: 16777216 bytes => 8111757 bytes, 25.301MB/s
[INFO] statistics:
[INFO]   size:  16777216 bytes => 8111839 bytes
[INFO]   ratio: 48.35%
[INFO]   time:  0.669 sec

Which is factor ~10 slower than zstd :(

Platform: M1 Apple Silicon macOS (native), x86_64 Linux (musl cross-compiled)

thread 'main' has overflowed its stack in WIndows.

Whwn I use orz in Windows machine get the error as :

'''
thread 'main' has overflowed its stack

'''

and creates an empty zip file.

Checked on a Debian machine , worked perfectly. The problem exist only in WIndows.

Panics on debug

i tried this code (using the master branch)...

use orz::encode;
use orz::lz::LZCfg;

fn main() {
    let mut src = "Hola a todos!".as_bytes();
    let mut out: Vec<u8> = vec![];
    let cfg = LZCfg {
        match_depth: 48,
        lazy_match_depth1: 32,
        lazy_match_depth2: 16,
    };
    let result;
    result = encode(&mut src, &mut out, &cfg);
    match result {
        Ok(stat) => {
            println!(
                "source_size: {} -- target_size: {}",
                stat.source_size, stat.target_size
            );
        }
        Err(e) => eprintln!("Error: {:?}", e),
    };
}

Only if a run it with the flag release works.

imagen

thread 'main' has overflowed its stack

I tried to test something but didn't even get to that point as built executable crashes with this message:
thread 'main' has overflowed its stack

gdb says this:

$  gdb --args orz__debug_w32 encode README.md README.md.orz
GNU gdb (GDB) 7.9.1
(...)
Reading symbols from orz__debug_w32...done.
(gdb) r
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 14228.0x11dc]
[New Thread 14228.0x2980]
[New Thread 14228.0x1cc4]
[New Thread 14228.0x450]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file or directory.
(gdb) bt
#0  _alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
#1  0x0057ca13 in orz::encode::h64d6467265acc7bf (
    source=<error reading variable: Cannot access memory at address 0x6a0fc30>,
    target=<error reading variable: Cannot access memory at address 0x6a0fc38>, cfg=0x1a0fae4) at src/lib.rs:44
#2  0x004036d6 in orz::main::hc5aba79d15bc2c2c () at src/main.rs:94
#3  0x00407f0b in core::ops::function::FnOnce::call_once::hfde464d49ace8ae2 ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:248
#4  0x00402062 in std::sys_common::backtrace::__rust_begin_short_backtrace::h3b09b2cc1997b89a (
    f=0x402910 <orz::main::hc5aba79d15bc2c2c>)
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src\sys_common/backtrace.rs:122
#5  0x00408a93 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::he2d87c0b87bf469b ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:145
#6  0x0065c340 in call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:280
#7  do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panicking.rs:492
#8  try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library\std\src/panicking.rs:456
#9  catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panic.rs:137
#10 {closure#2} () at library\std\src/rt.rs:128
#11 do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panicking.rs:492
#12 try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library\std\src/panicking.rs:456
#13 catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panic.rs:137
#14 std::rt::lang_start_internal::h71a9cc7a00235f34 () at library\std\src/rt.rs:128
#15 0x00408a70 in std::rt::lang_start::h9847c1da96d8463b (main=0x402910 <orz::main::hc5aba79d15bc2c2c>, argc=4,
    argv=0x22e2df8) at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:144
#16 0x004053c3 in main ()
(gdb) l
83      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb)  r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 22192.0x4bc8]
[New Thread 22192.0x24ac]
[New Thread 22192.0x3da4]
[New Thread 22192.0xf1c]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb) q

I used brand new rustc 1.62.1, host: i686-pc-windows-gnu/x86_64-pc-windows-gnu from here. I tried both with the same result.

New hash function

Coming from c656c07#r37659833

We should probably comment here that hash_dword is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).

Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:

let pear: [u8; 256] = /* RFC 3074 table here */;
#[inline]
fn hash_pearson(val: u32) -> u8 {
  let mut h: u8 = pear[val << 24];
  h = pear[h ^ (val << 16) % 256];
  h = pear[h ^ (val << 8)  % 256];
  h = pear[h ^ (val)       % 256];
  h
}

/// Hash a u32 from buf[pos] to a usize (always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE).
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
   let val = buf.read::<u32>(pos).to_be() as u32;
   (hash_pearson(val) << 8) | (hash_pearson(val ^ 0x01000000))
}

(djb2 looks cool too, if you like the multiplication stuff.)

build failed

error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: could not compile libc

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: failed to compile orz v1.6.1 (https://github.com/richox/orz#28811d98), intermediate artifacts can be found at /tmp/cargo-install9LTsSH

Caused by:
build failed

Install instruction fails

The readme tells me to use

cargo install --git https://github.com/richox/orz --tag v1.6.1

to install it, but that fails with the error

error: multiple packages with binaries found: benchmark-tool, orz

Should be

cargo install orz --git https://github.com/richox/orz --tag v1.6.1

建议增加抑制info信息输出的选项。

程序可以解压数据后直接打印到标准输出,但目前貌似没有--quiet/--silent这样的选项可以关闭输出信息,很多小的文本文件经常需要在终端解压直接查看内容,建议增加上这样的选项。否则干扰了输出的文本,不太优雅。

image

Can not run in win10 platform?

Hello.It is some trouble when I run the program in the win10 platform.Could you please give me some advises?

**F:\orz-master\orz-master\target\debug>.\orz.exe encode 1111111111111111111111111111111111 thread 'main' panicked at 'assertion index < len failed: index out of bounds: index = 16777251, len = 16777251', C:\Users\lenovo\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src\libcore\macros\mod.rs:16:9 note: run with RUST_BACKTRACE=1environment variable to display a backtrace**

Any test?

@richox

It seems that there is no test code added in the project, how do we ensure that the compression and decompression results are correct. The current project version has reached 1.4, which means that the project function has been stable and can be used in the production environment. In this case, it is very necessary to add the corresponding test code.

Default compression level (3): Error: "invalid level: 3"

On FreeBSD I get (using v1.6.2):

Error: "invalid level: 3"

When using the default compression level (3) while encoding:

# orz encode /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Same when specifying -l 3

# orz encode -l 3 /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Dropping to -l 2 seems to work:

# orz encode -l 2 /COPYRIGHT /COPYRIGHT.orz
[INFO] encode: 6109 bytes => 3147 bytes, 1.861MB/s
[INFO] statistics:
[INFO]   size:  6109 bytes => 3165 bytes
[INFO]   ratio: 51.81%
[INFO]   time:  0.016 sec

please add a magic header

Unlike all other Unix compressors, orz's format doesn't give any reliable way to sniff it in a maybe-compressed file. While in some contexts (private data, files with .orz suffix) the format is already known, there are also cases where programs assume it's possible to find out transport compression by reading the start of the header. And eg. libarchive/bsdtar have no other mode but sniffing.

I see that you haven't committed to a stable bitstream yet -- at least, the decompressor gives a warning when trying to uncompress a file made with an earlier version. Thus, adding such a magic might still be acceptable to you.

A proper magic would be:

  • at least 32-bits in length
  • not all in ASCII (unlike current version number)
  • unlikely to happen in unrelated files

Can you support compressing directories?

Most compression software supports compressing directories, but this software currently only supports compressing a single file.
Can you support compressing directories?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.