qntm / base32768 Goto Github PK
View Code? Open in Web Editor NEWBinary-to-text encoding highly optimised for UTF-16
License: MIT License
Binary-to-text encoding highly optimised for UTF-16
License: MIT License
This repo is very similar to https://github.com/rinick/base2e15 and https://github.com/grandchild/base32k except that 2e15 uses Unified CJK characters instead of other non-CJK characters. It would be good to have a pros-cons section.
I'm not sure I understand, but I can only use Uint8array, not Uint16array?
Title says it all:
base.decode(...).buffer returns 1 byte too long buffer.
Example:
base.decode("h").buffer.length == 1
base.decode("hi").buffer.length == 3
base.decode("hii").buffer.length == 3
base.decode("hiii").buffer.length == 5
base.decode("hiiii").buffer.length == 5
First of all, thanks a lot for creating base32768
.
I'm experimenting with converting binary data into text, then store as text files in the filesystem.
I need to ship an offline .html to my users, who can simply double click to open it, and it'll be a fancy SPA that runs on
file:///
. It can't read anything from the file system, except css js and images via src="" tags. So, I useconst database = "a_long_long_string"
inside a .js file to provide data to the app.
Since text files are UTF-8, this package seems to lose the edge. (As far as my tests, when my instruction is "hey I have this big string, please write it to file", both browser and nodejs will write UTF-8, so I guess it's not up to me to change)
So I tested different ways to convert binary data to string, for instance, base32768, base64, and TextDecoder.
Base32768 works as intended, of course.
Base64 seems to do really well, it's faster than base32768 and produces smaller text files. I guess it's because each base64 char is 1 byte in UTF8, correct? Your README says it's 75% efficient while base32768 is 63%, which is consistent with my output file size.
TextDecoder
is 20x more performant than the 2 above, but the conversion can't be reversed if the input Uint8Array contains bytes from 128-255 and I don't understand why:
function convertToStringThenBack(input: Uint8Array) {
const string = new TextDecoder().decode(input.buffer);
const back = new TextEncoder().encode(string);
const isEqual =
back.length === input.length &&
back.every((value, index) => value === input[index]);
if (isEqual) console.log("good");
else console.log("bad");
}
convertToStringThenBack(new Uint8Array([1, 10, 100, 127])); // good
convertToStringThenBack(new Uint8Array([1, 10, 100, 128])); // bad
Performance is a big deal because my data is very large, even if it means I'll have to add an extra step after TextDecoder
to sanitize the string, however I simply don't know how to make it work.
Regardless, do you have any comments to whatever I wrote above? Thanks very much!
I'm writing a port that works on bit streams instead of bytes and I have a problem when I'm not using 15*n +? 7
bits; decoding the string leads to an ambiguity. How can I solve this?
Hello. I am writing a project where I use base2048, base32768, and base65536 (all three— I give the user the choice of which they prefer). I am finding them very useful.
base2048 and base65536 have TypeScript typings, but base32768 does not. I get an error if I include base32768 in TypeScript without giving it an any-type exemption.
Are there plans to add types to base32768 like the other two have?
If (no promises) I contributed base32768 typings by copying what the other two have, would this be a welcome PR?
Thanks.
Requesting comparison between this and http://blog.kevinalbs.com/base122 and http://base91.sourceforge.net/
These would make it easier to port base37268
to other programming languages.
It will be nice to have a FAQ Section. The question in my mind is: How to use this in a browser, like Base64? In a single file, we can have
<img src="data:image/png;base64,iVBORw0KGgoAAA ANSUhEUgAAAAUAAAAFCAYAAACNbyblAAAAHElEQVQI12P4 //8/w38GIAXDIBKE0DHxgljNBAAO9TXL0Y4OHwAAAABJRU 5ErkJggg==" alt="Red dot" />
and a red dot is displayed (example from Wikipedia). How to do this in base32768?
We've been using your base32768 encoding in rclone via @Max-Sum 's go port https://github.com/Max-Sum/base32768 as a way to encode encrypted file names onto cloud storage systems. This seems particularly effective on OneDrive which seems to use UTF-16 internally.
However we noticed in this issue rclone/rclone#6803 that there are some characters which can be case folded in the set of 32768 characters.
I wrote a little Go program to demonstrate this here: https://go.dev/play/p/SK5G4dnHM6T
This prints stuff like
Duplicate case folded rune ƃ into Ƃ
Duplicate case folded rune ƅ into Ƅ
and comes up with the summary
Found 521 case folding and 199 duplicate case folding characters out of 32896
Which means that there are 521 characters which have a case folded variant, but more importantly there are 199 characters which have both the upper case and lower case variants in the 32768 characters.
This is important because rclone generates file names with these characters and OneDrive is case insensitive. So there is a small chance that two different encrypted file names map to two strings which are the same when compared case insensitively.
Now, I think the probability of this is quite small. The minimum length of a file name is 16 bytes (un-encoded) so 128 bits which makes 9 base32768 characters.
So there is about a 0.006 chance any given character can be case folded. What we'd like to know is how many of these filenames would we have to put in a directory in order to have a 50% chance of having a case folded collision. I've had a few goes at working this out and I'm coming out with an answer of the order of 10²¹. I'm not sure I trust my maths here but its a big number, that I'm sure of.
We've been thinking about making a variant of base32768 which does not include both the upper and lower case versions of any characters which can be case folded.
Any thoughts?
When attempting to run the example code from the README, I receive the following error:
$ node test.js
/home/aceat64/node_modules/base32768/index.js:90
var result = bits_to_bits([...buf.values()], MAGIC_NUMBER_B, MAGIC_NUMBER_A);
^^^
SyntaxError: Unexpected token ...
at exports.runInThisContext (vm.js:53:16)
at Module._compile (module.js:373:25)
at Object.Module._extensions..js (module.js:416:10)
at Module.load (module.js:343:32)
at Function.Module._load (module.js:300:12)
at Module.require (module.js:353:17)
at require (internal/module.js:12:17)
at Object.<anonymous> (/home/aceat64/test.js:1:79)
at Module._compile (module.js:409:26)
at Object.Module._extensions..js (module.js:416:10)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.