escapeless

Efficient binary encoding for large alphabets.

Features

Low fixed-size overhead.
Compression-friendly output.
Arbitrary alphabets.
Fast and simple algorithm.
Does not involve heavy-weight arithmetic.

Comparison chart

Encoding	Alphabet Size	Overhead
escapeless255	255	0.4%
escapeless254	254	0.8%
escapeless253	253	1.2%
yEnc	252	1.6%*, 0-100%
escapeless252	252	1.6%
escapeless251	251	2.0%
escapeless250	250	2.4%
B-News	224	2.5%
escapeless240	240	6.7%
escapeless230	230	11.4%
escapeless225	225	13.8%
Base122	122	14.3%
basE91	91	22%*, 14-23%
Base94	94	22.2%
Ascii85	85	25.0%
Z85	85	25.0%
Base64	64	33.3%
uuencode	64	33.3%
Base58	58	36.6%
Base36 / 64-bit	36	59.2%*, 0-62.5%
Base32	32	60.0%
Base36 / 32-bit	36	62.0%*, 0-75%
Base16	16	100.0%

(*) On uniform distribution of input octets.

Building and testing

$ git clone [email protected]:kosarev/escapeless.git
$ cd c
$ make
$ make test

Basic idea

Given a source alphabet of size S and a target alphabet of size N < S, break the sequence of input characters into blocks so that the number of characters in each block does not exceed N − 1.

Since a block can contain at most N − 1 different characters and the target alphabet contains N characters, it is known that all those used characters can be mapped to the target alphabet and at least one extra character of the target alphabet will remain unmapped. For example:

 A B C D E F G H I J K L    12  Characters of the source alphabet (S)
 A   C D E     H I   K L     8  Characters of the target alphabet (N)
   x       x x     x         4  Characters missing in the target alphabet (takeouts)
   | | | |     | | |         7  Characters used in the block
 .         . .       . .     5  Characters not used in the block

Here, one possible mapping is:

 B −> A
 J −> K

with L left unmapped and all other characters of the target alphabet mapped to themselves.

What that unmapped character is for, is to make it possible to map unused takeouts, like F and G in the example, to a character of the target alphabet that does not represent any characters of the source alphabet for that block. Taking that into account, here's how a complete mapping would look:

 B −> A
 F -> L
 G -> L
 J −> K

Once the mapping is determined, we can output the encoded block with takeout characters in it replaced with members of the target alphabet. To let a decoder know the mapping, we also have to prepend each of the encoded blocks with a series of characters the takeouts are mapped to and assume that the decoder will be given the same set of takeout characters specified in the same order.

Overhead formula

For a source alphabet of size S, a target alphabet of size N and a block of N − 1 characters, the size of the encoded block is:

 encoded_block_size = takeouts_map_size + block_size =
                      (S − N) + (N - 1) =
                      S - 1

The overhead is thus:

 overhead = (encoded_block_size - block_size) / block_size =
            ((S - 1) - (N - 1)) / (N - 1) =
            (S - 1 - N + 1) / (N - 1) =
            (S - N) / (N - 1)

Encoding algorithm

Break the input message into blocks so that no block contains more than N - 1 characters, where N is the size of the target alphabet. Process every block separately as specified below.
Map every takeout character to a character of the target alphabet that is not used in the block and is not a takeout character. All takeouts not used in the block shall map to the same character.
Replace takeout characters of the block using that map.
Output the map followed by the rewritten block.

Decoding algorithm

Read the takeouts map and the encoded block.
Using the map, restore the takeouts in the block.
Output decoded block.

The idea explained in greater detail

Escapeless, Restartable, Binary Encoding

Thanks, Ian!

Suggestion: special cases

Hi there,

I am using a variant of this encoding in a project that runs on an ancient sub-1 MHz 8-bit machine, with data over serial port and extremely low bandwidth. I say a variant because I keep all excluded symbols in the 0...n range and exploit this fact.

I needed to make some optimisations and one applies to the general format of this encoding scheme and thus this project:

When marking the symbols used in a block, I keep track of the maximum byte found. If max < (0xff - n), you can safely pick the last n possible bytes as substitutions and move on without checking if they're taken. Obviously this is content-dependent, but it does trigger.

The other optimisation will not work with your code, but in case anyone else might want to do what I'm doing:

Because all of my "take outs" are in 0...n, when encoding I also find the minimum byte in a block, and if min > n, then there is no need for substitution at all - in this case I set at least the two first bytes of the replacement list to the same value. The receiver then checks if first two replacements are the same byte, and if they are, it knows that there is no substitution required.

This encoding scheme is a great find by the way, my project would have been impossible to implement if it wasn't for this. It's a case where I need to send IP packets via a serial link that does not support hardware flow control (so SLIP will not work), but it still requires explicit flow control that must be controlled by the application (XON/XOFF will not work either), plus extra control characters (block acknowledgment and separator / end of packet = forced end of block). With escapeless I could easily implement my own flow control and still send arbitrary data using escapeless 128/4.

kosarev / escapeless Goto Github PK

escapeless's Introduction

escapeless

Features

Comparison chart

Building and testing

Basic idea

Overhead formula

Encoding algorithm

Decoding algorithm

The idea explained in greater detail

escapeless's People

Contributors

Stargazers

Watchers

Forkers

escapeless's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs