qntm / base32768 Goto Github PK

View Code? Open in Web Editor NEW

131.0 4.0 5.0 2.49 MB

Binary-to-text encoding highly optimised for UTF-16

License: MIT License

JavaScript 100.00%

base32768's People

Contributors

Stargazers

Watchers

Forkers

quinndiggity linjianfeng exfinium loonghe

base32768's Issues

Add extra information in the ReadMe

This repo is very similar to https://github.com/rinick/base2e15 and https://github.com/grandchild/base32k except that 2e15 uses Unified CJK characters instead of other non-CJK characters. It would be good to have a pros-cons section.

If this supports UTF16, then why cant I use Uint16array?

I'm not sure I understand, but I can only use Uint8array, not Uint16array?

base.decode(...).buffer returns 1 byte too long buffer.

Title says it all:
base.decode(...).buffer returns 1 byte too long buffer.

Example:

base.decode("h").buffer.length == 1
base.decode("hi").buffer.length == 3
base.decode("hii").buffer.length == 3
base.decode("hiii").buffer.length == 5
base.decode("hiiii").buffer.length == 5

Consideration when string will be stored in text files

First of all, thanks a lot for creating base32768.

I'm experimenting with converting binary data into text, then store as text files in the filesystem.

The reason why I do it may not be your interest, so I fold it.

I need to ship an offline .html to my users, who can simply double click to open it, and it'll be a fancy SPA that runs on file:///. It can't read anything from the file system, except css js and images via src="" tags. So, I use const database = "a_long_long_string" inside a .js file to provide data to the app.

Since text files are UTF-8, this package seems to lose the edge. (As far as my tests, when my instruction is "hey I have this big string, please write it to file", both browser and nodejs will write UTF-8, so I guess it's not up to me to change)

So I tested different ways to convert binary data to string, for instance, base32768, base64, and TextDecoder.

Base32768 works as intended, of course.

Base64 seems to do really well, it's faster than base32768 and produces smaller text files. I guess it's because each base64 char is 1 byte in UTF8, correct? Your README says it's 75% efficient while base32768 is 63%, which is consistent with my output file size.

TextDecoder is 20x more performant than the 2 above, but the conversion can't be reversed if the input Uint8Array contains bytes from 128-255 and I don't understand why:

TextDecoder test code:

function convertToStringThenBack(input: Uint8Array) {
  const string = new TextDecoder().decode(input.buffer);
  const back = new TextEncoder().encode(string);
  const isEqual =
    back.length === input.length &&
    back.every((value, index) => value === input[index]);
  if (isEqual) console.log("good");
  else console.log("bad");
}

convertToStringThenBack(new Uint8Array([1, 10, 100, 127])); // good
convertToStringThenBack(new Uint8Array([1, 10, 100, 128])); // bad

Performance is a big deal because my data is very large, even if it means I'll have to add an extra step after TextDecoder to sanitize the string, however I simply don't know how to make it work.

Regardless, do you have any comments to whatever I wrote above? Thanks very much!

How to differentiate between `0011(111)` being decoded at the end of a stream and `00(11111)`?

I'm writing a port that works on bit streams instead of bytes and I have a problem when I'm not using 15*n +? 7 bits; decoding the string leads to an ambiguity. How can I solve this?

Typing plans?

Hello. I am writing a project where I use base2048, base32768, and base65536 (all three— I give the user the choice of which they prefer). I am finding them very useful.

base2048 and base65536 have TypeScript typings, but base32768 does not. I get an error if I include base32768 in TypeScript without giving it an any-type exemption.

Are there plans to add types to base32768 like the other two have?
If (no promises) I contributed base32768 typings by copying what the other two have, would this be a welcome PR?

Thanks.

Time Benchmark

Requesting comparison between this and http://blog.kevinalbs.com/base122 and http://base91.sourceforge.net/

Language-agnostic test case files

These would make it easier to port base37268 to other programming languages.

How To Use In Browsers Section in FAQ

It will be nice to have a FAQ Section. The question in my mind is: How to use this in a browser, like Base64? In a single file, we can have
<img src="data:image/png;base64,iVBORw0KGgoAAA ANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4 //8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU 5ErkJggg==" alt="Red dot" />
and a red dot is displayed (example from Wikipedia). How to do this in base32768?

Case folding in base32768

We've been using your base32768 encoding in rclone via @Max-Sum 's go port https://github.com/Max-Sum/base32768 as a way to encode encrypted file names onto cloud storage systems. This seems particularly effective on OneDrive which seems to use UTF-16 internally.

However we noticed in this issue rclone/rclone#6803 that there are some characters which can be case folded in the set of 32768 characters.

I wrote a little Go program to demonstrate this here: https://go.dev/play/p/SK5G4dnHM6T

This prints stuff like

Duplicate case folded rune ƃ into Ƃ
Duplicate case folded rune ƅ into Ƅ

and comes up with the summary

Found 521 case folding and 199 duplicate case folding characters out of 32896

Which means that there are 521 characters which have a case folded variant, but more importantly there are 199 characters which have both the upper case and lower case variants in the 32768 characters.

This is important because rclone generates file names with these characters and OneDrive is case insensitive. So there is a small chance that two different encrypted file names map to two strings which are the same when compared case insensitively.

Now, I think the probability of this is quite small. The minimum length of a file name is 16 bytes (un-encoded) so 128 bits which makes 9 base32768 characters.

So there is about a 0.006 chance any given character can be case folded. What we'd like to know is how many of these filenames would we have to put in a directory in order to have a 50% chance of having a case folded collision. I've had a few goes at working this out and I'm coming out with an answer of the order of 10²¹. I'm not sure I trust my maths here but its a big number, that I'm sure of.

We've been thinking about making a variant of base32768 which does not include both the upper and lower case versions of any characters which can be case folded.

Any thoughts?

SyntaxError: Unexpected token ...

When attempting to run the example code from the README, I receive the following error:

$ node test.js 
/home/aceat64/node_modules/base32768/index.js:90
        var result = bits_to_bits([...buf.values()], MAGIC_NUMBER_B, MAGIC_NUMBER_A);
                                   ^^^

SyntaxError: Unexpected token ...
    at exports.runInThisContext (vm.js:53:16)
    at Module._compile (module.js:373:25)
    at Object.Module._extensions..js (module.js:416:10)
    at Module.load (module.js:343:32)
    at Function.Module._load (module.js:300:12)
    at Module.require (module.js:353:17)
    at require (internal/module.js:12:17)
    at Object.<anonymous> (/home/aceat64/test.js:1:79)
    at Module._compile (module.js:409:26)
    at Object.Module._extensions..js (module.js:416:10)

qntm / base32768 Goto Github PK

base32768's People

Contributors

Stargazers

Watchers

Forkers

base32768's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs