image-rs / weezl Goto Github PK

View Code? Open in Web Editor NEW

25.0 8.0 7.0 1.59 MB

LZW en- and decoding that goes weeeee!

License: Apache License 2.0

Rust 100.00%

hacktoberfest

weezl's Introduction

weezl

LZW en- and decoding that goes weeeee!

Overview

This library, written in purely safe and dependency-less Rust, provides encoding and decoding for lzw compression in the style as it occurs in gif and tiff image formats. It has a standalone binary that may be used to handle those data streams but it is not compatible with Spencer's compress and uncompress binaries (though a drop-in may be developed at a later point).

Using in a no_std environment is also possible though an allocator is required. This, too, may be relaxed in a later release. A feature flag already exists but currently turns off almost all interfaces.

License

All code is dual licensed MIT OR Apache-2.0.

weezl's People

Contributors

Stargazers

Watchers

Forkers

worldsender isabella232 shutton dbckr fintelia dalcde ka-de

weezl's Issues

Add a `skip` method, discarding some amount of input

In gif it might happen that some part of a frame is outside the region of interest. In these cases it would be interesting to investigate if decoding can be sped up by skipping over and discarding some data. A similar strategy might be useful for seeking in compressed archives.

Support implicit reset

I am reversing a proprietary image format that uses LZW internally for compressing frames and I also plan to write a converter for it in Rust as a practice. I chose weezl because it looks promising and it's already a dependency for image-rs which I am also using in the converter for reading/writing images in common formats.

However one problem that I ran into is that weezl doesn't like bitstreams with no leading clear code. The official converter for that image format apparently always emits such type of bitstreams. Other than that they just seem to be standard LZW LSB bitstreams and should be supported by weezl. I saw that there's a TODO in the decoder source code. Any chance that this will be supported?

A way to avoid huge allocations

Currently the encoder unconditionally makes huge allocations. I assume this is a performance tradeoff.

https://github.com/image-rs/lzw/blob/0d3c809a37574cc84684e02c96c62ef079c926c9/src/encode.rs#L251

Could there be a way to select a different tradeoff, maximum, or maybe something based on the size of the input?

Add restore points

Complementing forward seeking, #8 , add the ability to restore a particular state of decoding and encoding. For encoding specifically this may in the future also be used to tune compression ratios by purposefully inserting additional reset codes or continuing with a full dictionary to optimize the dictionary usage.

Rename repository to weezl

This crate is published under the name weezl. It might make sense to rename the repository to match, to avoid possible confusion

New lzw encoder creates invalid streams

Encoding of large-ish images results in GIF images that look broken.

For example re-encoding of this image:

gives this file:

(Firefox refuses to render it. Chrome and macOS render only 20-something lines and garbage pixels.)

It's easy to reproduce with the example code. I've verified using another codebase that it's a bug in the GIF encoder, not the reader. The bug is in v0.11. It's not in v0.10.

unit tests can't be run from the crate downloaded from crates.io

This is more of a 'for your information' rather than a bug report about something that is wrong in the project. But I thought it wouldn't hurt to inform upstream about it :)

The unit tests depend on a file named /benches/binary-8-msb.lzw that isn't included in the crate uploaded to crates.io.

test output:

error: couldn't read /tmp/r/weezl-0.1.5/benches/binary-8-msb.lzw: No such file or directory (os error 2)
    --> src/decode.rs:1240:37
     |
1240 |           const FILE: &'static [u8] = include_bytes!(concat!(
     |  _____________________________________^
1241 | |             env!("CARGO_MANIFEST_DIR"),
1242 | |             "/benches/binary-8-msb.lzw"
1243 | |         ));
     | |__________^
     |
     = note: this error originates in the macro `include_bytes` (in Nightly builds, run with -Z macro-backtrace for more info)

error: could not compile `weezl` due to previous error
warning: build failed, waiting for other jobs to finish...
error: build failed

This means that we have to disable to tests when packaging this crate for debian. Would it be possible to include the /benches/binary-8-msb.lzw file in the next release?

Invalid codes being created during decode: `debug_asserts` for invariants

These codes are inserted into the table, but can't be used or referenced in the code text. They are just 'waste'.

I've thrown together a patch for to put in debug_asserts for the actual invariants that the code is working under:

Patch file

From ac575ce26fb081883092536e0fcbf00c2af59cc2 Mon Sep 17 00:00:00 2001
From: Andreas Molzer <[email protected]>
Date: Tue, 19 Apr 2022 21:29:13 +0200
Subject: [PATCH] Add debug assertions on internal invariants

---
 src/decode.rs | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/src/decode.rs b/src/decode.rs
index 283e31f..49f3bfd 100644
--- a/src/decode.rs
+++ b/src/decode.rs
@@ -711,7 +711,7 @@ impl<C: CodeBuffer> Stateful for DecodeState<C> {
             Some(tup) => {
                 status = Ok(LzwStatus::Ok);
                 code_link = Some(tup)
-            },
+            }
         };
 
         // Track an empty `burst` (see below) means we made no progress.
@@ -827,6 +827,7 @@ impl<C: CodeBuffer> Stateful for DecodeState<C> {
                 // the case of requiring an allocation (which can't occur in practice).
                 let new_link = self.table.derive(&link, cha, code);
                 self.next_code += 1;
+                debug_assert!(self.next_code as usize <= MAX_ENTRIES);
                 code = burst;
                 link = new_link;
             }
@@ -918,6 +919,8 @@ impl<C: CodeBuffer> Stateful for DecodeState<C> {
                 }
 
                 self.next_code += 1;
+                debug_assert!(self.next_code as usize <= MAX_ENTRIES);
+
                 new_link = link;
             } else {
                 // It's actually quite likely that the next code will be a reset but just in case.
@@ -1203,6 +1206,13 @@ impl Table {
     }
 
     fn derive(&mut self, from: &Link, byte: u8, prev: Code) -> Link {
+        debug_assert!(
+            self.inner.len() < MAX_ENTRIES,
+            "Invalid code would be created {:?} {} {:?}",
+            from.prev,
+            byte,
+            prev
+        );
         let link = from.derive(byte, prev);
         let depth = self.depths[usize::from(prev)] + 1;
         self.inner.push(link.clone());
-- 
2.35.1

The trace of running decoding with those suggest that the comparison itself relies on an incorrect assumption. Since it uses == it relies on self.next_code <= self.code_buffer.max_code() but that doesn't hold. When we reach 12-bits then the code buffer does not get larger and max_code() remains at 4095. At the same time next_code will advance to 4096, and never beyond in the sequential code path, a code that will never be created and thus works correctly with the rest of the logic.

But when that is the exact moment that we enter a burst, as is the case with the provided file, then it will advance next_code beyond that and not notice that the maximum code has been reached. An easy fix would be to adjust the condition:

if potential_code >= self.code_buffer.max_code() - Code::from(self.is_tiff) {

I'll measure if that leads to too much of a performance loss due to executing less of the simple code reconstruction.

Originally posted by @HeroicKatora in #30 (comment)

Add dumb methods for lazy users

Sometimes you just want to get data (de)compressed and don't really have the patience to look at the elegance of a very finely adjustable system.

For those cases I propose adding these functions:

fn encode(data: &[u8], order: BitOrder, size) -> Vec<u8>
fn encode_tiff(data: &[u8], order: BitOrder, size) -> Vec<u8>
fn decode(data: &[u8], order: BitOrder, size) -> Vec<u8>
fn decode_tiff(data: &[u8], order: BitOrder, size) -> Vec<u8>

In reality the decode functions would probably return a Result instead.

LZW attempts to decode buffer after output is already filled

In DecodeState::advance, after processing a burst, the decoder unconditionally processes the new code. However, we shouldn't do so if the output is already filled, because the remaining bits in the buffer may be nonsense. This caused an InvalidCode error when trying to read one of my images.

I attempted to write a fix at https://github.com/dalcde/lzw/tree/check-out but I didn't make a PR because I'm not confident it is correct.