richox / orz Goto Github PK

View Code? Open in Web Editor NEW

799.0 19.0 53.0 26.45 MB

a high performance, general purpose data compressor written in the crab-lang

License: MIT License

Rust 98.23% Makefile 0.47% C 1.29%

compression data crab-lang

orz's Introduction

Orz

orz -- a general purpose data compressor written in the crab-lang.

orz is an optimized ROLZ (reduced offset Lempel-Ziv) general purpose data compressor. input data is encoded as ROLZ-matches (reduced-offsets and match lengths), 2-byte words, and single bytes. then all encoded symbols are processed with a symbol ranking (aka Move-to-Front) transformer and a static huffman coder.

benefited from the ROLZ algorithm, orz compresses times faster than many other LZ-based compressors which has same compression ratio, and decompression speed is still very acceptable.

orz is completely implemented in the crab-lang. clone the repo and run cargo build --release to have an executable orz binary.

installation

you can install orz with cargo:

cargo install orz --git https://github.com/richox/orz --tag v1.6.2

usage

for compression:

orz encode <source-file-input> <compressed-file-output>

for decompression:

orz decode <compressed-file-input> <source-file-output>

for more details, see orz --help

benchmarks

benchmark for 100MB of Large Text Compression Benchmark (enwik8, see http://mattmahoney.net/dc/text.html):

(for latest enwik8 benchmark result, see github actions)

name	compressed size	encode time	decode time
xz -6	26,665,156	69.815s	1.309s
orz -l2	26,893,684	8.245s	1.414s
zstd -19	26,942,199	62.931s	0.239s
orz -l1	27,220,056	6.714s	1.393s
orz -l0	27,896,572	5.209s	1.405s
bzip2 -9	29,008,758	7.417s	3.538s
zstd -15	29,544,237	29.860s	0.196s
brotli -9	29,685,672	36.147s	0.285s
brotli -8	30,326,580	17.989s	0.271s
zstd -10	30,697,144	4.205s	0.192s
brotli -7	31,057,759	11.730s	0.267s
lzfse	36,157,828	1.762s	0.179s
gzip -6	36,548,933	4.461s	0.357s

orz's People

Contributors

Stargazers

Watchers

orz's Issues

Provide a C API+ABI

The world out there still speaks C by and large. To let more people use the library, Orz should get a C API exported, so that people can use it from C++, Objective-C, Nim, Python, Node.js and everything else.

The A little Rust with your C chapter explains how to make public functions C compatible and how to generate headers using cbindgen.

Forbidden filename: checkout on WIndows 10 not possible

I want to give this a try, but git checkout and even zip download of this repo failes. I can't even copy the file contents by hand, as it is not possible to create a file named aux. The file extension doesn't even matter, it always leads to an exception.
Turnes out Windows has some reserved filenames:
https://kizu514.com/blog/forbidden-file-names-on-windows-10/

Remove unsafe blocks without hurting performance

Default compression level

Hello,
default compression level should be 2 instead of 3.
3 throws an error because I think it was removed.

I think it is in line 20 main.rs

#[structopt(long = "level", short = "l", default_value = "3")] /// Set compression level (0..3)

Good decompression speed up technique: musttail

Hi. This is impressive program! Unfortunately, it seems decompressing speed is not so good as zstd's. So I propose this trick:

https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html

This post may be useful too: https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html

Produces incorrect output if input is pipe

Steps to reproduce:

admin@ip-172-31-23-30:~/beat-orz/orz$ head -c 1000000 /dev/urandom > /tmp/a
admin@ip-172-31-23-30:~/beat-orz/orz$ cat /tmp/a | /home/admin/beat-orz/orz/target/release/orz encode > /tmp/or
[INFO] encode: 65536 bytes => 66663 bytes, 2.537MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 2.482 MB/s
[INFO]   time:  0.026 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ /home/admin/beat-orz/orz/target/release/orz decode < /tmp/or > /tmp/trip
[INFO] decode: 65536 bytes <= 66663 bytes, 6.351MB/s
[INFO] statistics:
[INFO]   size:  65536 bytes => 66681 bytes
[INFO]   ratio: 101.75%
[INFO]   speed: 6.172 MB/s
[INFO]   time:  0.011 sec
admin@ip-172-31-23-30:~/beat-orz/orz$ cmp /tmp/trip /tmp/a
cmp: EOF on /tmp/trip after byte 65536, in line 284

OS: Debian Linux bookworm, rustc 1.73.0-nightly, orz 3380556

Possible reason: you probably don't check return value of libc::read or something

Performance?

I was super excited to see this! I'm currently looking for a fast compression alternative for zstd for compressing postgresql wal archives.

At least for this use-case, I wasn't able to reproduce the benchmarks you've provided.

(orz v1.6.2 installed using cargo install as described in the README, also tested with cargo build --release from current HEAD):

$ zstd 00000003000025EF0000007C
00000003000025EF0000007C : 50.55%   (  16.0 MiB =>   8.09 MiB, 00000003000025EF0000007C.zst)
'zstd 00000003000025EF0000007C' time: 0.064s, cpu: 104%

orz encode -l0 00000003000025EF0000007C 00000003000025EF0000007C.orz
[INFO] encode: 16777216 bytes => 8111757 bytes, 25.301MB/s
[INFO] statistics:
[INFO]   size:  16777216 bytes => 8111839 bytes
[INFO]   ratio: 48.35%
[INFO]   time:  0.669 sec

Which is factor ~10 slower than zstd :(

Platform: M1 Apple Silicon macOS (native), x86_64 Linux (musl cross-compiled)

Error: "invalid level: 3"

rust

thread 'main' has overflowed its stack in WIndows.

Whwn I use orz in Windows machine get the error as :

'''
thread 'main' has overflowed its stack

'''

and creates an empty zip file.

Checked on a Debian machine , worked perfectly. The problem exist only in WIndows.

Panics on debug

i tried this code (using the master branch)...

use orz::encode;
use orz::lz::LZCfg;

fn main() {
    let mut src = "Hola a todos!".as_bytes();
    let mut out: Vec<u8> = vec![];
    let cfg = LZCfg {
        match_depth: 48,
        lazy_match_depth1: 32,
        lazy_match_depth2: 16,
    };
    let result;
    result = encode(&mut src, &mut out, &cfg);
    match result {
        Ok(stat) => {
            println!(
                "source_size: {} -- target_size: {}",
                stat.source_size, stat.target_size
            );
        }
        Err(e) => eprintln!("Error: {:?}", e),
    };
}

Only if a run it with the flag release works.

thread 'main' has overflowed its stack

I tried to test something but didn't even get to that point as built executable crashes with this message:
thread 'main' has overflowed its stack

gdb says this:

$  gdb --args orz__debug_w32 encode README.md README.md.orz
GNU gdb (GDB) 7.9.1
(...)
Reading symbols from orz__debug_w32...done.
(gdb) r
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 14228.0x11dc]
[New Thread 14228.0x2980]
[New Thread 14228.0x1cc4]
[New Thread 14228.0x450]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file or directory.
(gdb) bt
#0  _alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
#1  0x0057ca13 in orz::encode::h64d6467265acc7bf (
    source=<error reading variable: Cannot access memory at address 0x6a0fc30>,
    target=<error reading variable: Cannot access memory at address 0x6a0fc38>, cfg=0x1a0fae4) at src/lib.rs:44
#2  0x004036d6 in orz::main::hc5aba79d15bc2c2c () at src/main.rs:94
#3  0x00407f0b in core::ops::function::FnOnce::call_once::hfde464d49ace8ae2 ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:248
#4  0x00402062 in std::sys_common::backtrace::__rust_begin_short_backtrace::h3b09b2cc1997b89a (
    f=0x402910 <orz::main::hc5aba79d15bc2c2c>)
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src\sys_common/backtrace.rs:122
#5  0x00408a93 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::he2d87c0b87bf469b ()
    at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:145
#6  0x0065c340 in call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\core\src\ops/function.rs:280
#7  do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panicking.rs:492
#8  try<i32, &(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> () at library\std\src/panicking.rs:456
#9  catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> () at library\std\src/panic.rs:137
#10 {closure#2} () at library\std\src/rt.rs:128
#11 do_call<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panicking.rs:492
#12 try<isize, std::rt::lang_start_internal::{closure_env#2}> () at library\std\src/panicking.rs:456
#13 catch_unwind<std::rt::lang_start_internal::{closure_env#2}, isize> () at library\std\src/panic.rs:137
#14 std::rt::lang_start_internal::h71a9cc7a00235f34 () at library\std\src/rt.rs:128
#15 0x00408a70 in std::rt::lang_start::h9847c1da96d8463b (main=0x402910 <orz::main::hc5aba79d15bc2c2c>, argc=4,
    argv=0x22e2df8) at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3\library\std\src/rt.rs:144
#16 0x004053c3 in main ()
(gdb) l
83      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb)  r
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: d:\progs\dev\src\orz\orz\orz__debug_w32.exe encode README.md README.md.orz
[New Thread 22192.0x4bc8]
[New Thread 22192.0x24ac]
[New Thread 22192.0x3da4]
[New Thread 22192.0xf1c]

Program received signal SIGSEGV, Segmentation fault.
_alloca () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:88
88      in ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S
(gdb) q

I used brand new rustc 1.62.1, host: i686-pc-windows-gnu/x86_64-pc-windows-gnu from here. I tried both with the same result.

New hash function

Coming from c656c07#r37659833

We should probably comment here that hash_dword is always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE (5219), or it would not make enough sense to do anything here (usize always fits a u32 on 32 and 64-bit platforms).

Now that log 5219 / log 2 ≈ 12.35, the largest we want would be a 16-bit hash function. A pearson hash does not look too bad in this case:

let pear: [u8; 256] = /* RFC 3074 table here */;
#[inline]
fn hash_pearson(val: u32) -> u8 {
  let mut h: u8 = pear[val << 24];
  h = pear[h ^ (val << 16) % 256];
  h = pear[h ^ (val << 8)  % 256];
  h = pear[h ^ (val)       % 256];
  h
}

/// Hash a u32 from buf[pos] to a usize (always modulo LZ_MF_BUCKET_ITEM_HASH_SIZE).
unsafe fn hash_dword(buf: &[u8], pos: usize) -> usize {
   let val = buf.read::<u32>(pos).to_be() as u32;
   (hash_pearson(val) << 8) | (hash_pearson(val ^ 0x01000000))
}

(djb2 looks cool too, if you like the multiplication stuff.)

build failed

error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: could not compile libc

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: linker cc not found
|
= note: No such file or directory (os error 2)

error: aborting due to previous error

error: failed to compile orz v1.6.1 (https://github.com/richox/orz#28811d98), intermediate artifacts can be found at /tmp/cargo-install9LTsSH

Caused by:
build failed

How to use orz with tar?

tar has an argument--use-compress-program , but I don't know how to use with orz.

[速度PK]

我最近简单用C++写了一个Huffman压缩算法，不知道和作者速度有多少速度差距，在Mac OS和Linux下测试成功，在Windows下没测试过，项目：https://github.com/yangyongkang2000/C-Programming/tree/master/Huffman/Huffman
欢迎测试速度差距。

Install instruction fails

The readme tells me to use

cargo install --git https://github.com/richox/orz --tag v1.6.1

to install it, but that fails with the error

error: multiple packages with binaries found: benchmark-tool, orz

Should be

cargo install orz --git https://github.com/richox/orz --tag v1.6.1

建议增加抑制info信息输出的选项。

程序可以解压数据后直接打印到标准输出，但目前貌似没有--quiet/--silent这样的选项可以关闭输出信息，很多小的文本文件经常需要在终端解压直接查看内容，建议增加上这样的选项。否则干扰了输出的文本，不太优雅。

Can not run in win10 platform？

Hello.It is some trouble when I run the program in the win10 platform.Could you please give me some advises?

**F:\orz-master\orz-master\target\debug>.\orz.exe encode 1111111111111111111111111111111111 thread 'main' panicked at 'assertion index < len failed: index out of bounds: index = 16777251, len = 16777251', C:\Users\lenovo\.rustup\toolchains\stable-x86_64-pc-windows-msvc\lib/rustlib/src/rust\src\libcore\macros\mod.rs:16:9 note: run with RUST_BACKTRACE=1environment variable to display a backtrace**

Add Cargo.lock to the repository

For some context, see https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries

Not having a Cargo.lock file seems to have broken the packaging for VoidLinux. See void-linux/void-packages#15730

Any test?

@richox

It seems that there is no test code added in the project, how do we ensure that the compression and decompression results are correct. The current project version has reached 1.4, which means that the project function has been stable and can be used in the production environment. In this case, it is very necessary to add the corresponding test code.

Default compression level (3): Error: "invalid level: 3"

On FreeBSD I get (using v1.6.2):

Error: "invalid level: 3"

When using the default compression level (3) while encoding:

# orz encode /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Same when specifying -l 3

# orz encode -l 3 /COPYRIGHT /COPYRIGHT.orz
Error: "invalid level: 3"

Dropping to -l 2 seems to work:

# orz encode -l 2 /COPYRIGHT /COPYRIGHT.orz
[INFO] encode: 6109 bytes => 3147 bytes, 1.861MB/s
[INFO] statistics:
[INFO]   size:  6109 bytes => 3165 bytes
[INFO]   ratio: 51.81%
[INFO]   time:  0.016 sec

please add a magic header

Unlike all other Unix compressors, orz's format doesn't give any reliable way to sniff it in a maybe-compressed file. While in some contexts (private data, files with .orz suffix) the format is already known, there are also cases where programs assume it's possible to find out transport compression by reading the start of the header. And eg. libarchive/bsdtar have no other mode but sniffing.

I see that you haven't committed to a stable bitstream yet -- at least, the decompressor gives a warning when trying to uncompress a file made with an earlier version. Thus, adding such a magic might still be acceptable to you.

A proper magic would be:

at least 32-bits in length
not all in ASCII (unlike current version number)
unlikely to happen in unrelated files

Can you support compressing directories?

Most compression software supports compressing directories, but this software currently only supports compressing a single file.
Can you support compressing directories?

richox / orz Goto Github PK

orz's Introduction

Orz

installation

usage

benchmarks

orz's People

Contributors

Stargazers

Watchers

Forkers

orz's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs