GithubHelp home page GithubHelp logo

kosarev / escapeless Goto Github PK

View Code? Open in Web Editor NEW
20.0 3.0 3.0 28 KB

Efficient binary encoding for large alphabets

License: MIT License

C 46.66% Makefile 4.37% Python 14.88% C++ 34.09%
base64 z85 ascii85 base122 yenc base32 base16 binary-encoding uuencoding

escapeless's Introduction

escapeless

Efficient binary encoding for large alphabets.

Build Status

Features

  • Low fixed-size overhead.
  • Compression-friendly output.
  • Arbitrary alphabets.
  • Fast and simple algorithm.
  • Does not involve heavy-weight arithmetic.

Comparison chart

Encoding Alphabet Size Overhead
escapeless255 255 0.4%
escapeless254 254 0.8%
escapeless253 253 1.2%
yEnc 252 1.6%*, 0-100%
escapeless252 252 1.6%
escapeless251 251 2.0%
escapeless250 250 2.4%
B-News 224 2.5%
escapeless240 240 6.7%
escapeless230 230 11.4%
escapeless225 225 13.8%
Base122 122 14.3%
basE91 91 22%*, 14-23%
Base94 94 22.2%
Ascii85 85 25.0%
Z85 85 25.0%
Base64 64 33.3%
uuencode 64 33.3%
Base58 58 36.6%
Base36 / 64-bit 36 59.2%*, 0-62.5%
Base32 32 60.0%
Base36 / 32-bit 36 62.0%*, 0-75%
Base16 16 100.0%

(*) On uniform distribution of input octets.

Building and testing

$ git clone [email protected]:kosarev/escapeless.git
$ cd c
$ make
$ make test

Basic idea

Given a source alphabet of size S and a target alphabet of size N < S, break the sequence of input characters into blocks so that the number of characters in each block does not exceed N − 1.

Since a block can contain at most N − 1 different characters and the target alphabet contains N characters, it is known that all those used characters can be mapped to the target alphabet and at least one extra character of the target alphabet will remain unmapped. For example:

 A B C D E F G H I J K L    12  Characters of the source alphabet (S)
 A   C D E     H I   K L     8  Characters of the target alphabet (N)
   x       x x     x         4  Characters missing in the target alphabet (takeouts)
   | | | |     | | |         7  Characters used in the block
 .         . .       . .     5  Characters not used in the block

Here, one possible mapping is:

 B −> A
 J −> K

with L left unmapped and all other characters of the target alphabet mapped to themselves.

What that unmapped character is for, is to make it possible to map unused takeouts, like F and G in the example, to a character of the target alphabet that does not represent any characters of the source alphabet for that block. Taking that into account, here's how a complete mapping would look:

 B −> A
 F -> L
 G -> L
 J −> K

Once the mapping is determined, we can output the encoded block with takeout characters in it replaced with members of the target alphabet. To let a decoder know the mapping, we also have to prepend each of the encoded blocks with a series of characters the takeouts are mapped to and assume that the decoder will be given the same set of takeout characters specified in the same order.

Overhead formula

For a source alphabet of size S, a target alphabet of size N and a block of N − 1 characters, the size of the encoded block is:

 encoded_block_size = takeouts_map_size + block_size =
                      (S − N) + (N - 1) =
                      S - 1

The overhead is thus:

 overhead = (encoded_block_size - block_size) / block_size =
            ((S - 1) - (N - 1)) / (N - 1) =
            (S - 1 - N + 1) / (N - 1) =
            (S - N) / (N - 1)

Encoding algorithm

  1. Break the input message into blocks so that no block contains more than N - 1 characters, where N is the size of the target alphabet. Process every block separately as specified below.

  2. Map every takeout character to a character of the target alphabet that is not used in the block and is not a takeout character. All takeouts not used in the block shall map to the same character.

  3. Replace takeout characters of the block using that map.

  4. Output the map followed by the rewritten block.

Decoding algorithm

  1. Read the takeouts map and the encoded block.

  2. Using the map, restore the takeouts in the block.

  3. Output decoded block.

The idea explained in greater detail

Escapeless, Restartable, Binary Encoding

Thanks, Ian!

escapeless's People

Contributors

kosarev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

escapeless's Issues

Couple of Questions

Question 1: Is the Python port written in Python 2? Because when running it, it came up with some errors that upon searching pointed me to people trying to run Python 2 code in Python 3.

Question 2: Is there any way we could get a bit of documentation on how to use this? I get a base of what it's looking for, but even with the testers, it's hard to see what's happening.

Provide examples of use

#2 suggests that the implementations and tests are probably not enough to figure out how to use the code.

Suggestion: special cases

Hi there,

I am using a variant of this encoding in a project that runs on an ancient sub-1 MHz 8-bit machine, with data over serial port and extremely low bandwidth. I say a variant because I keep all excluded symbols in the 0...n range and exploit this fact.

I needed to make some optimisations and one applies to the general format of this encoding scheme and thus this project:

When marking the symbols used in a block, I keep track of the maximum byte found. If max < (0xff - n), you can safely pick the last n possible bytes as substitutions and move on without checking if they're taken. Obviously this is content-dependent, but it does trigger.

The other optimisation will not work with your code, but in case anyone else might want to do what I'm doing:

Because all of my "take outs" are in 0...n, when encoding I also find the minimum byte in a block, and if min > n, then there is no need for substitution at all - in this case I set at least the two first bytes of the replacement list to the same value. The receiver then checks if first two replacements are the same byte, and if they are, it knows that there is no substitution required.

This encoding scheme is a great find by the way, my project would have been impossible to implement if it wasn't for this. It's a case where I need to send IP packets via a serial link that does not support hardware flow control (so SLIP will not work), but it still requires explicit flow control that must be controlled by the application (XON/XOFF will not work either), plus extra control characters (block acknowledgment and separator / end of packet = forced end of block). With escapeless I could easily implement my own flow control and still send arbitrary data using escapeless 128/4.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.