GithubHelp home page GithubHelp logo

base2048's Introduction

base2048

Base2048 is a binary encoding optimised for transmitting data through Twitter. This JavaScript module, base2048, is the first implementation of this encoding. Using Base2048, up to 385 octets can fit in a single Tweet. Compare with Base65536, which manages only 280 octets.

Encoding Efficiency Bytes per Tweet *
UTF‑8 UTF‑16 UTF‑32
ASCII‑constrained Unary / Base1 0% 0% 0% 1
Binary 13% 6% 3% 35
Hexadecimal 50% 25% 13% 140
Base64 75% 38% 19% 210
Base85 † 80% 40% 20% 224
BMP‑constrained HexagramEncode 25% 38% 19% 105
BrailleEncode 33% 50% 25% 140
Base2048 56% 69% 34% 385
Base32768 63% 94% 47% 263
Full Unicode Ecoji 31% 31% 31% 175
Base65536 56% 64% 50% 280
Base131072 53%+ 53%+ 53% 297

* A Tweet can be up to 280 Unicode characters, give or take Twitter's complex "weighting" calculation.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in text: escape characters, brackets, punctuation etc..
‡ Base131072 is a work in progress, not yet ready for general use.

Installation

npm install base2048

Usage

import { encode, decode } from 'base2048'

const uint8Array = new Uint8Array([1, 2, 4, 8, 16, 32, 64, 128])
const str = encode(uint8Array)
console.log(str) // 'GƸOʜeҩ'

const uint8Array2 = decode(str)
console.log(uint8Array2)
// [1, 2, 4, 8, 16, 32, 64, 128]

API

base2048 accepts and returns Uint8Arrays. Note that every Node.js Buffer is a Uint8Array. A Uint8Array can be converted to a Node.js Buffer like so:

const buffer = Buffer.from(uint8Array.buffer, uint8Array.byteOffset, uint8Array.byteLength)

encode(uint8Array)

Encodes a Uint8Array and returns a Base2048 String suitable for passing through Twitter. Give or take some padding characters, the output string has 1 character per 11 bits of input.

decode(string)

Decodes a Base2048 String and returns a Uint8Array containing the original binary data.

Rationale

Originally, Twitter allowed Tweets to be at most 140 characters. Discounting URLs, which have their own complex rules, Tweet length was computed as the number of Unicode code points in the Tweet — not the number of octets in any particular encoding of that Unicode string. In 2015, observing that most existing text-based encodings made negligible use of most of the Unicode code point space (e.g. Base64 encodes only 6 bits per character = 105 octets per Tweet), I developed Base65536, which encodes 16 bits per character = 280 octets per Tweet.

On 26 September 2017, Twitter announced that

we're going to try out a longer limit, 280 characters, in languages impacted by cramming (which is all except Japanese, Chinese, and Korean).

This statement is fairly light on usable details and/or factual accuracy. However, following some experimentation and examination of the new web client code, we now understand that maximum Tweet length is indeed 280 Unicode code points, except that code points U+1100 HANGUL CHOSEONG KIYEOK upwards now count double.

Effectively, Twitter divides Unicode into 4,352 "light" code points (U+0000 to U+10FF inclusive) and 1,109,760 "heavy" code points (U+1100 to U+10FFFF inclusive).

Base65536 solely uses heavy characters, which means that a new "long" Tweet can still only contain at most 140 characters of Base65536, encoding 280 octets. This seemed like an imperfect state of affairs to me, and so here we are.

Base2048 solely uses light characters, which means a new "long" Tweet can contain at most 280 characters of Base2048. Base2048 is an 11-bit encoding, so those 280 characters encode 3080 bits i.e. 385 octets of data, significantly better than Base65536.

Note

At the time of writing, the sophisticated weighted-code point check is only carried out client-side. Server-side, the check is still a simple code point length check, now capped at 280 code points. So, by circumventing the client-side check, it's possible to send 280 characters of Base65536 i.e. 560 bytes of data in a single Tweet.

Base2048 was developed under the assumption that most people will not go to the trouble of circumventing the client-side check and/or that eventually the check will be implemented server-side as well.

Code point safety

Base2048 uses only "safe" Unicode code points (no unassigned code points, no control characters, no whitespace, no combining diacritics, ...). This guarantees that the data sent will remain intact when sent through any "Unicode-clean" text interface.

In the available space of 4,352 light code points, there are 2,343 safe code points. For Base2048, since I felt it improved the character repertoire, I further ruled out the four "Symbol" General Categories, leaving 2,212 safe code points, and the "Letter, Modifier" General Category, leaving 2,176 safe code points. From these I chose 211 = 2048 code points for the primary repertoire and 23 = 8 additional code points to use as padding characters.

License

MIT

base2048's People

Contributors

dependabot[bot] avatar estella144 avatar ojwb avatar qntm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

base2048's Issues

Python Javascript Shenanigans

Hi,

I've been fiddling with using base2048 using js2py. I can get the encode to work, but the decode is returning an empty list?

My python is as following:

import js2py
from js2py import require

base2048 = require('base2048')

t = [1, 2, 4, 8, 16, 32, 64, 128] # 'GƸOʜeҩ'
print(t)
a = base2048.encode(t)
print(a)
b = base2048.decode(a)
print(b)

And returns:
[1, 2, 4, 8, 16, 32, 64, 128]
GƸOʜeҩ
[]

Was wondering if you had any idea what's happening? I can't tell if its a js2py or base2048 thing - but decode seems to work in the sense that it doesn't like taking non-strings.

Thanks for the time

What about emojis?

If the goal is to fit as much text as possible in a tweet, why not using emojis as well?

Ability to select by double click

Some code points breaks the words, which does not allow user to select the entire text by double click/tap.

Example:

ЭइໃݹƐಬໃݲЭॴເɕநՋໂڥѻටҽݤߜටԮڍஸق൱ݹࠅ৩ບݔƐഠഉٽॳඝಽٽࠌҼຯڤߜ૪ଞ1

BaseZalgo?

Base2048 hits a great sweet-spot for tweets & similar presentations. Another possible encoding optimization target might be horizontal width, via the use of arbitrary combining-diacriticals (~ 'Zalgo text') to either:

  • squeeze even more bits into one horizontal-character; or
  • layer some "out of band" data onto baseline-text (that's remains somewhat readable/interpretable)

I see you mentioned this possibility at https://qntm.org/safe, but didn't consider such code-space expansion sufficiently safe/appropriate for your prior aims.

I'd be interested in working out some standard conventions for encoding binary data using combining-diacriticals, both for densifying Base2048 or adding a side channel to text where the uncomposed characters remain unchanged.

If you'd be interesting in collaborating or reviewing any work-in-progress, please let me know.

At my current not-totally-hopeless but still-fairly-rudimentary understanding of tricky Unicode composing issues, I'm especially worried of the difficulty of "hardening against normalization" you mention – & hoping to find a rule-of-thumb, or searching/testing method, for minimizing risks without necessarily retreating to a simple-but-restrictive approach (like the "Base1-encoded number of combin[ers]" you mention).

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.