mathiasbynens / regexpu Goto Github PK

View Code? Open in Web Editor NEW

230.0 14.0 20.0 219 KB

A source code transpiler that enables the use of ES2015 Unicode regular expressions in ES5.

Home Page: https://mths.be/regexpu

License: MIT License

JavaScript 100.00%

regular-expression regexp regex code-generation javascript ecmascript es2015 unicode

regexpu's Introduction

regexpu

regexpu is a source code transpiler that enables the use of ES2015 Unicode regular expressions in JavaScript-of-today (ES5). It rewrites regular expressions that make use of the ES2015 u flag into equivalent ES5-compatible regular expressions.

Here’s an online demo.

Traceur v0.0.61+, Babel v1.5.0+, esnext v0.12.0+, and Bublé v0.12.0+ use regexpu for their u regexp transpilation. The REPL demos for Traceur, Babel, esnext, and Bublé let you try u regexps as well as other ES.next features.

Example

Consider a file named example-es2015.js with the following contents:

var string = 'foo💩bar';
var match = string.match(/foo(.)bar/u);
console.log(match[1]);
// → '💩'

// This regex matches any symbol from U+1F4A9 to U+1F4AB, and nothing else.
var regex = /[\u{1F4A9}-\u{1F4AB}]/u;
// The following regex is equivalent.
var alternative = /[💩-💫]/u;
console.log([
  regex.test('a'),  // false
  regex.test('💩'), // true
  regex.test('💪'), // true
  regex.test('💫'), // true
  regex.test('💬')  // false
]);

Let’s transpile it:

$ regexpu < example-es2015.js > example-es5.js

example-es5.js can now be used in ES5 environments. Its contents are as follows:

var string = 'foo💩bar';
var match = string.match(/foo((?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))bar/);
console.log(match[1]);
// → '💩'

// This regex matches any symbol from U+1F4A9 to U+1F4AB, and nothing else.
var regex = /(?:\uD83D[\uDCA9-\uDCAB])/;
// The following regex is equivalent.
var alternative = /(?:\uD83D[\uDCA9-\uDCAB])/;
console.log([
  regex.test('a'),  // false
  regex.test('💩'), // true
  regex.test('💪'), // true
  regex.test('💫'), // true
  regex.test('💬')  // false
]);

Known limitations

regexpu only transpiles regular expression literals, so things like RegExp('…', 'u') are not affected.
regexpu doesn’t polyfill the RegExp.prototype.unicode getter because it’s not possible to do so without side effects.
regexpu doesn’t support canonicalizing the contents of back-references in regular expressions with both the i and u flag set, since that would require transpiling/wrapping strings.
regexpu doesn’t match lone low surrogates accurately. Unfortunately that is impossible to implement due to the lack of lookbehind support in JavaScript regular expressions.

Installation

To use regexpu programmatically, install it as a dependency via npm:

npm install regexpu --save-dev

To use the command-line interface, install regexpu globally:

npm install regexpu -g

API

`regexpu.version`

A string representing the semantic version number.

`regexpu.rewritePattern(pattern, flags, options)`

This is an alias for the rewritePattern function exported by regexpu-core. Please refer to that project’s documentation for more information.

regexpu.rewritePattern uses regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, it’s better to include it directly:

// Instead of…
const rewritePattern = require('regexpu').rewritePattern;

// Use this:
const rewritePattern = require('regexpu-core');

This prevents the Recast and Esprima dependencies from being loaded into memory.

`regexpu.transformTree(ast, options)` or its alias `regexpu.transform(ast, options)`

This function accepts an abstract syntax tree representing some JavaScript code, and returns a transformed version of the tree in which any regular expression literals that use the ES2015 u flag are rewritten in ES5.

const regexpu = require('regexpu');
const recast = require('recast');
const tree = recast.parse(code); // ES2015 code
const transformedTree = regexpu.transform(tree);
const result = recast.print(transformedTree);
console.log(result.code); // transpiled ES5 code
console.log(result.map); // source map

The optional options object is passed to regexpu-core’s rewritePattern. For a description of the available options, see its documentation.

regexpu.transformTree uses Recast, regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, it’s better to include it directly:

const transformTree = require('regexpu/transform-tree');

This prevents the Esprima dependency from being loaded into memory.

`regexpu.transpileCode(code, options)`

This function accepts a string representing some JavaScript code, and returns a transpiled version of this code tree in which any regular expression literals that use the ES2015 u flag are rewritten in ES5.

const es2015 = 'console.log(/foo.bar/u.test("foo💩bar"));';
const es5 = regexpu.transpileCode(es2015);
// → 'console.log(/foo(?:[\\0-\\t\\x0B\\f\\x0E-\\u2027\\u202A-\\uD7FF\\uDC00-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF])bar/.test("foo💩bar"));'

The optional options object recognizes the following properties:

sourceFileName: a string representing the file name of the original ES2015 source file.
sourceMapName: a string representing the desired file name of the source map.
dotAllFlag: a boolean indicating whether to enable experimental support for the s (dotAll) flag.
unicodePropertyEscape: a boolean indicating whether to enable experimental support for Unicode property escapes.

The sourceFileName and sourceMapName properties must be provided if you want to generate source maps.

const result = regexpu.transpileCode(code, {
  'sourceFileName': 'es2015.js',
  'sourceMapName': 'es2015.js.map',
});
console.log(result.code); // transpiled source code
console.log(result.map); // source map

regexpu.transpileCode uses Esprima, Recast, regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, feel free to include it directly:

const transpileCode = require('regexpu/transpile-code');

Transpilers that use regexpu internally

If you’re looking for a general-purpose ES.next-to-ES5 transpiler with support for Unicode regular expressions, consider using one of these:

Traceur v0.0.61+
Babel v1.5.0+
esnext v0.12.0+
Bublé v0.12.0+

For maintainers

How to publish a new release

On the main branch, bump the version number in package.json:
```
npm version patch -m 'Release v%s'
```
Instead of patch, use minor or major as needed.

Note that this produces a Git commit + tag.
Push the release commit and tag:
```
git push && git push --tags
```
Our CI then automatically publishes the new release to npm.

Author


Mathias Bynens

License

regexpu is available under the MIT license.

regexpu's People

Contributors

Stargazers

Watchers

Forkers

azu cyberlight gerhobbelt braintrain cybernetics alexxnica harendranathvegi9 tianfanfan joaquinmorales josuacarranza rakhithjk redstrike nicolo-ribaudo 00mjk mpadev0103 16bcs080

regexpu's Issues

Valid pattern with unescaped dot fails to rewrite

The rewritePattern function generates an error when passed a regex pattern containing an unescaped dot within an alternatives group.

For example, the following pattern:

    /(x.x|x)/

Fails to rewrite in the following code:

const regex = /(x.x|x)/
const pattern = regex.toString()
rewritePattern(pattern)

And instead generates the following error:

Error: Invalid node type: dot; expected types: /^(?:anchor|characterClass|characterClassEscape|empty|group|quantifier|reference|unicodePropertyEscape|value)$/

The error goes away if the dot is escaped:

    /(x\.x|x)/

But of course this changes the meaning of the regex.

The problem is seen in the current regexpu version (4.6.0) but doesn't seem to exist in the previous version (tested 4.5.4).

Document inability to match lone low surrogates accurately

var regex = /^a[\u{D800}-\u{DBFF}\u{DC00}-\u{DFFF}]b$/u;
console.log(
  regex.test('a\uD800b'),
  regex.test('a\uDC00b')
);

Expected: true, true
Actual: true, false

https://mothereff.in/regexpu#var%20regex%20%3D%20%2F%5Ea%5B%5Cu%7BD800%7D-%5Cu%7BDBFF%7D%5Cu%7BDC00%7D-%5Cu%7BDFFF%7D%5Db%24%2Fu%3B%0Aconsole.log%28%0A%20%20regex.test%28%27a%5CuD800b%27%29%2C%0A%20%20regex.test%28%27a%5CuDC00b%27%29%0A%29%3B

Ref. mathiasbynens/regenerate#28 (comment).
https://esdiscuss.org/topic/q-lonely-surrogates-and-unicode-regexps#content-3

Transpiling of /(\1)+\1\1/u to /(\x01)+\1\1/

Your library transpile /(\1)+\1\1/u to /(\x01)+\1\1/.

Is there a change in the ES6 specs which allows such interpretation?

The current behavior in ES5:

var string = '\x01\x01';
var match = string.match(/(\1)+\1\1/);
console.log(match);
-> [ "", "" ]

All of them are interpreted as capturing groups, as far as my testing reveals on Firefox. The ECMA 5 spec also seems to agree with this particular case: http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.11

The specs seems to allow \1 to appear before its capturing group as long as there are enough number of capturing groups in the entire expression.

It is an error if n is greater than the total number of left capturing parentheses in the entire regular expression.

Missing module jsesc

When attempting to use after installing globally with npm I got an error about a missing jsesc package. Performing a global install for jsesc resolved the issue.

doesn't recognize unicode character classes

I'm not sure if this is intended because I'm not sure if ECMA 6 intends on supporting this or not, but I see the compiler is not liking unicode character classes ( "\p" in posix regexes).
For example:
var match = string.match(/\p{L&}/u);
is not liked by the transpiler.

Transpiling of /[]/u to /(?:)/

Is transpiling /[]/u to /(?:)/ (matches empty string) correct according to ES6?

/[]/ is an empty character class that doesn't match anything in ES5.

alias .transformTree as .transform

this module is the only one in esnext that doesn't conform to .transform()

Transpilation of .ignoreCase for HTML `pattern`

No, really. I’m sure that sounds bizarre but I have a reason.

On my Node server, I'm generating HTML that uses the pattern attribute. Ideally, it would look something like this:

<input pattern="<% /^foo.bar$/i.toSource %>">

However, the pattern attribute is specified to act like it only has the u flag. regexpu helps me with my server-side regexes that use dotAll and such for older browsers, but i can’t be used.

Would it be in-scope to add ignoreCase as an option for regexpu?

Specifying astral plane character range in surrogate form

Does the draft spec says anything about this use case?

/[\uD80C\uDC00-\uD80D\uDC1F]/u

I expect it to behave the same as

/[\u{13000}-\u{1342F}]/u

since

/[\uD80C\uDC00\uD80D\uDC1F]/u

(without the range) is correctly recognized as 2 separate characters by regexpu.

Integration with esnext

resugar/resugar#30

Use Unicode v5.1.0 for whitespace

Quote from https://people.mozilla.org/~jorendorff/es6-draft.html#sec-white-space (emphasis mine):

ECMAScript implementations must recognize as Whitespace code points listed in the “Separator, space” (Zs) category by Unicode 5.1. ECMAScript implementations may also recognize as Whitespace additional category Zs code points from subsequent editions of the Unicode Standard.

At the moment we’re using Unicode v7.0.0 for everything, which means we’re missing out on some Unicode v5.1.0 code points.

Integration with Traceur

https://github.com/google/traceur-compiler/wiki/AddingTransformationPasses except we don’t even need special parsing (IIRC Traceur already parses regular expressions just fine, even those with ES6 flags).

get wrong result when enable `Unicode property escapes` but `disable s (dotAll) flag`

The es2015+ code:

/^\p{Unified_Ideograph}.$/us.test('中\n')
// true

/hello.world/su.test('hello\nworld') 
// true

build options (1):

- [ ] enable s (dotAll) flag
- [x] enable Unicode property escapes (\p{…} and \P{…}
- [ ] use ES2015 u flag in output

result (1):

/^(?:[\u3400-\u4DB5\u4E00-\u9FEF\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0])(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/s.test('中\n')
// false

/hello(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])world/s.test('hello\nworld'))
// false

build options (2):

- [x] enable s (dotAll) flag
- [x] enable Unicode property escapes (\p{…} and \P{…}
- [ ] use ES2015 u flag in output

result (2):

/^(?:[\u3400-\u4DB5\u4E00-\u9FEF\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0])(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/.test('中\n')
// true

/hello(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])world/.test('hello\nworld')
// true

How to use /\p{L}/u with babel?

This is probably best way to ask, I was wondering if it's possible to compile this regex to non unicode regex.

It works with this demo https://mothereff.in/regexpu but not with proved babel link in README.

I need to generate non unicode regex for /\p{N}/u and /\p{L}/u.

I can just copy paste the regex but I would prefer this regex would be generated by Babel in my build script.

Should this be reported to Babel?

Integration with ES6 to ES5 transpilers

Done

Traceur v0.0.61+ (google/traceur-compiler#1294)
~~6to5~~ Babel v1.5.0+ (babel/babel#11)
esnext v0.12.0+ (resugar/resugar#30)
Bublé v0.12.0+ (https://gitlab.com/Rich-Harris/buble/issues/3#note_12596080)

In progress

none

TODO

~~es6now~~ esdown (zenparsing/esdown#11)
es6-transpiler (termi/es6-transpiler#45)
jsdc (army8735/jsdc#17)
~~jstransform (https://github.com/facebook/jstransform/issues/64)~~ no longer maintained

Support for back-references

/(s)\1/ui is currently transformed to /([s\u017F])\x01/i.

Back references should probably not be transformed to hexadecimal escapes
Canonicalizing the back reference's content is tricky, I'm not sure how this feature can be supported without canonicalizing the input string first, e.g. in /(s)\1/ui.test("s\u017f") == true.

Update regexpu’s API following the breaking regexpu-core changes

Re-posted from mathiasbynens/regexpu-core#55:

mathiasbynens/regexpu-core#49 changed regexpu-core’s API. We should make sure that these changes are reflected in the upstream regexpu project as well, e.g. in regexpu.transpileCode(code, options). This is a blocker for updating the demo at https://mothereff.in/regexpu.

https://github.com/mathiasbynens/regexpu#regexputranspilecodecode-options

cc @nicolo-ribaudo

Confirm whether my interpretation of the spec + assumptions are correct

When the u flag is enabled, should inverse/uppercase character class escapes (e.g. \D) match all Unicode code points (rather than all BMP code points) except those in the lowercase variant of the character class escape (e.g. \d) set?
When the u flag is enabled, should negated character classes (e.g. [^a]) match all Unicode code points (rather than BMP code points) except those in the set?

http://esdiscuss.org/topic/questions-regarding-es6-unicode-regular-expressions cc @allenwb @NorbertLindenberg

`/\uD834\uDF06/u` vs. `/\u{D834}\u{DF06}/u`

https://bugs.ecmascript.org/show_bug.cgi?id=3521#c3

regexpu already handles this just fine, but there are no tests for this behavior specifically.

`/^.$/us` is transpiled incorrectly

Current output:

const transpiled = /^(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/

Actual result:

transpiled.test('\u0001\udc00'); // true

Expected result:

/^.$/us.test('\u0001\udc00'); // false

Upstream Esprima patch

jquery/esprima#264
https://code.google.com/p/esprima/issues/detail?id=557

facebookarchive/esprima#42

Integration with 6to5

babel/babel#11

Runtime

If (Exception) val 1==1 }
`

clear input

With Firefox, if I click "Clear Recent History" with:

Time range to clear: Everything
Details:
- Cache
- Offline Website Data
- Site Preferences

Then refresh the page, my previous input remains. Only workaround ive found is
to open private window

`/./u` and `/[^x]/u` matching surrogate halves individually

Reported by Marja Hölttä:

var string = '𝌆𝌆';
var match = string.match(/(....)/u);
console.log(match[1]);

I checked that the same behavior occurs for other character classes too, like this:

var string = 'a𝌆b';
var match = string.match(/a([^c][^c])b/u);
console.log(match[1]); // 𝌆

And as a bonus, it transforms /(.+)\1/u to this:

> r = /((?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])+)\1/
> r.test('𝜆𝌆𝌇') // code units: D835 DF06 D834 DF06 D834 DF07
true

…which is pretty surprising :)

Assertion error '\\u03B8' == '[\\u03B8\\u03F4]'

Strangely, the Debian package for regexpu got broken recently, and since I didn't change anything, I don't know exactly what the reason is. Here is the link to the bug report

Remove the upper limit of hex digits in Unicode code point escapes

E.g. '\u{00000000000000000001D306}' == '\u{1D306}'.

The following dependencies need patching:

regjsparser: done in v0.1.3
Esprima: https://github.com/ariya/esprima/pull/293 (master) & https://github.com/ariya/esprima/pull/294 (harmony)
esprima-fb (used by recast by default): https://github.com/facebook/esprima (https://github.com/facebook/esprima/issues/77)
recast: benjamn/recast#142

Babel Plugin

Hi @mathiasbynens,

I'm one of the contributors to Babel. I was wondering if you might consider turning regexpu into a Babel plugin or something along those lines.

We want the npm download size to shrink and the extra dependencies that regexpu pulls in for transpilation are a big part of that.

Just interested in seeing what it would take for regexpu to switch?

`/[\u{11450}\u{11C50}\u{11C52}]/u`

Sorry, if I report a bug in a wrong place and not sure if the online demo has the latest code.

for the regexp in the title, the last part is lost:
/(?:[\uD805\uD807]\uDC50)/, which matches only 2 code points, not 3.

Update:
it is a bug in "regenerate.js" in optimizeByLowSurrogates, seems:

// String.fromCodePoint(0x11450) === String.fromCharCode(0xD805, 0xDC50)
// String.fromCodePoint(0x11C50) === String.fromCharCode(0xD807, 0xDC50)
// String.fromCodePoint(0x11C52) === String.fromCharCode(0xD807, 0xDC52)

var set = regenerate()
  .add(0x11450)
  .add(0x11C50)
  .add(0x11C52)
  ;
console.log(set.toString());