GithubHelp home page GithubHelp logo

mathiasbynens / regexpu Goto Github PK

View Code? Open in Web Editor NEW
230.0 14.0 20.0 219 KB

A source code transpiler that enables the use of ES2015 Unicode regular expressions in ES5.

Home Page: https://mths.be/regexpu

License: MIT License

JavaScript 100.00%
regular-expression regexp regex code-generation javascript ecmascript es2015 unicode

regexpu's Introduction

regexpu Build status Code coverage status regexpu on npm

regexpu is a source code transpiler that enables the use of ES2015 Unicode regular expressions in JavaScript-of-today (ES5). It rewrites regular expressions that make use of the ES2015 u flag into equivalent ES5-compatible regular expressions.

Here’s an online demo.

Traceur v0.0.61+, Babel v1.5.0+, esnext v0.12.0+, and Bublé v0.12.0+ use regexpu for their u regexp transpilation. The REPL demos for Traceur, Babel, esnext, and Bublé let you try u regexps as well as other ES.next features.

Example

Consider a file named example-es2015.js with the following contents:

var string = 'foo💩bar';
var match = string.match(/foo(.)bar/u);
console.log(match[1]);
// → '💩'

// This regex matches any symbol from U+1F4A9 to U+1F4AB, and nothing else.
var regex = /[\u{1F4A9}-\u{1F4AB}]/u;
// The following regex is equivalent.
var alternative = /[💩-💫]/u;
console.log([
  regex.test('a'),  // false
  regex.test('💩'), // true
  regex.test('💪'), // true
  regex.test('💫'), // true
  regex.test('💬')  // false
]);

Let’s transpile it:

$ regexpu < example-es2015.js > example-es5.js

example-es5.js can now be used in ES5 environments. Its contents are as follows:

var string = 'foo💩bar';
var match = string.match(/foo((?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF]))bar/);
console.log(match[1]);
// → '💩'

// This regex matches any symbol from U+1F4A9 to U+1F4AB, and nothing else.
var regex = /(?:\uD83D[\uDCA9-\uDCAB])/;
// The following regex is equivalent.
var alternative = /(?:\uD83D[\uDCA9-\uDCAB])/;
console.log([
  regex.test('a'),  // false
  regex.test('💩'), // true
  regex.test('💪'), // true
  regex.test('💫'), // true
  regex.test('💬')  // false
]);

Known limitations

  1. regexpu only transpiles regular expression literals, so things like RegExp('…', 'u') are not affected.
  2. regexpu doesn’t polyfill the RegExp.prototype.unicode getter because it’s not possible to do so without side effects.
  3. regexpu doesn’t support canonicalizing the contents of back-references in regular expressions with both the i and u flag set, since that would require transpiling/wrapping strings.
  4. regexpu doesn’t match lone low surrogates accurately. Unfortunately that is impossible to implement due to the lack of lookbehind support in JavaScript regular expressions.

Installation

To use regexpu programmatically, install it as a dependency via npm:

npm install regexpu --save-dev

To use the command-line interface, install regexpu globally:

npm install regexpu -g

API

regexpu.version

A string representing the semantic version number.

regexpu.rewritePattern(pattern, flags, options)

This is an alias for the rewritePattern function exported by regexpu-core. Please refer to that project’s documentation for more information.

regexpu.rewritePattern uses regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, it’s better to include it directly:

// Instead of…
const rewritePattern = require('regexpu').rewritePattern;

// Use this:
const rewritePattern = require('regexpu-core');

This prevents the Recast and Esprima dependencies from being loaded into memory.

regexpu.transformTree(ast, options) or its alias regexpu.transform(ast, options)

This function accepts an abstract syntax tree representing some JavaScript code, and returns a transformed version of the tree in which any regular expression literals that use the ES2015 u flag are rewritten in ES5.

const regexpu = require('regexpu');
const recast = require('recast');
const tree = recast.parse(code); // ES2015 code
const transformedTree = regexpu.transform(tree);
const result = recast.print(transformedTree);
console.log(result.code); // transpiled ES5 code
console.log(result.map); // source map

The optional options object is passed to regexpu-core’s rewritePattern. For a description of the available options, see its documentation.

regexpu.transformTree uses Recast, regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, it’s better to include it directly:

const transformTree = require('regexpu/transform-tree');

This prevents the Esprima dependency from being loaded into memory.

regexpu.transpileCode(code, options)

This function accepts a string representing some JavaScript code, and returns a transpiled version of this code tree in which any regular expression literals that use the ES2015 u flag are rewritten in ES5.

const es2015 = 'console.log(/foo.bar/u.test("foo💩bar"));';
const es5 = regexpu.transpileCode(es2015);
// → 'console.log(/foo(?:[\\0-\\t\\x0B\\f\\x0E-\\u2027\\u202A-\\uD7FF\\uDC00-\\uFFFF]|[\\uD800-\\uDBFF][\\uDC00-\\uDFFF]|[\\uD800-\\uDBFF])bar/.test("foo💩bar"));'

The optional options object recognizes the following properties:

The sourceFileName and sourceMapName properties must be provided if you want to generate source maps.

const result = regexpu.transpileCode(code, {
  'sourceFileName': 'es2015.js',
  'sourceMapName': 'es2015.js.map',
});
console.log(result.code); // transpiled source code
console.log(result.map); // source map

regexpu.transpileCode uses Esprima, Recast, regjsgen, regjsparser, and regenerate as internal dependencies. If you only need this function in your program, feel free to include it directly:

const transpileCode = require('regexpu/transpile-code');

Transpilers that use regexpu internally

If you’re looking for a general-purpose ES.next-to-ES5 transpiler with support for Unicode regular expressions, consider using one of these:

For maintainers

How to publish a new release

  1. On the main branch, bump the version number in package.json:

    npm version patch -m 'Release v%s'

    Instead of patch, use minor or major as needed.

    Note that this produces a Git commit + tag.

  2. Push the release commit and tag:

    git push && git push --tags

    Our CI then automatically publishes the new release to npm.

Author

twitter/mathias
Mathias Bynens

License

regexpu is available under the MIT license.

regexpu's People

Contributors

azu avatar bnjmnt4n avatar eventualbuddha avatar greenkeeperio-bot avatar jdalton avatar mathiasbynens avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

regexpu's Issues

Valid pattern with unescaped dot fails to rewrite

The rewritePattern function generates an error when passed a regex pattern containing an unescaped dot within an alternatives group.

For example, the following pattern:

    /(x.x|x)/

Fails to rewrite in the following code:

const regex = /(x.x|x)/
const pattern = regex.toString()
rewritePattern(pattern)

And instead generates the following error:

Error: Invalid node type: dot; expected types: /^(?:anchor|characterClass|characterClassEscape|empty|group|quantifier|reference|unicodePropertyEscape|value)$/

The error goes away if the dot is escaped:

    /(x\.x|x)/

But of course this changes the meaning of the regex.

The problem is seen in the current regexpu version (4.6.0) but doesn't seem to exist in the previous version (tested 4.5.4).

Document inability to match lone low surrogates accurately

Transpiling of /(\1)+\1\1/u to /(\x01)+\1\1/

Your library transpile /(\1)+\1\1/u to /(\x01)+\1\1/.

Is there a change in the ES6 specs which allows such interpretation?


The current behavior in ES5:

var string = '\x01\x01';
var match = string.match(/(\1)+\1\1/);
console.log(match);
-> [ "", "" ]

All of them are interpreted as capturing groups, as far as my testing reveals on Firefox. The ECMA 5 spec also seems to agree with this particular case: http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.11

The specs seems to allow \1 to appear before its capturing group as long as there are enough number of capturing groups in the entire expression.

It is an error if n is greater than the total number of left capturing parentheses in the entire regular expression.

Missing module jsesc

When attempting to use after installing globally with npm I got an error about a missing jsesc package. Performing a global install for jsesc resolved the issue.

doesn't recognize unicode character classes

I'm not sure if this is intended because I'm not sure if ECMA 6 intends on supporting this or not, but I see the compiler is not liking unicode character classes ( "\p" in posix regexes).
For example:
var match = string.match(/\p{L&}/u);
is not liked by the transpiler.

Transpiling of /[]/u to /(?:)/

Is transpiling /[]/u to /(?:)/ (matches empty string) correct according to ES6?

/[]/ is an empty character class that doesn't match anything in ES5.

Transpilation of .ignoreCase for HTML `pattern`

No, really. I’m sure that sounds bizarre but I have a reason.

On my Node server, I'm generating HTML that uses the pattern attribute. Ideally, it would look something like this:

<input pattern="<% /^foo.bar$/i.toSource %>">

However, the pattern attribute is specified to act like it only has the u flag. regexpu helps me with my server-side regexes that use dotAll and such for older browsers, but i can’t be used.

Would it be in-scope to add ignoreCase as an option for regexpu?

Specifying astral plane character range in surrogate form

Does the draft spec says anything about this use case?

/[\uD80C\uDC00-\uD80D\uDC1F]/u

I expect it to behave the same as

/[\u{13000}-\u{1342F}]/u

since

/[\uD80C\uDC00\uD80D\uDC1F]/u

(without the range) is correctly recognized as 2 separate characters by regexpu.

Use Unicode v5.1.0 for whitespace

Quote from https://people.mozilla.org/~jorendorff/es6-draft.html#sec-white-space (emphasis mine):

ECMAScript implementations must recognize as Whitespace code points listed in the “Separator, space” (Zs) category by Unicode 5.1. ECMAScript implementations may also recognize as Whitespace additional category Zs code points from subsequent editions of the Unicode Standard.

At the moment we’re using Unicode v7.0.0 for everything, which means we’re missing out on some Unicode v5.1.0 code points.

get wrong result when enable `Unicode property escapes` but `disable s (dotAll) flag`

The es2015+ code:

/^\p{Unified_Ideograph}.$/us.test('中\n')
// true

/hello.world/su.test('hello\nworld') 
// true

build options (1):

- [ ] enable s (dotAll) flag
- [x] enable Unicode property escapes (\p{…} and \P{…}
- [ ] use ES2015 u flag in output

result (1):

/^(?:[\u3400-\u4DB5\u4E00-\u9FEF\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0])(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/s.test('中\n')
// false

/hello(?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])world/s.test('hello\nworld'))
// false

build options (2):

- [x] enable s (dotAll) flag
- [x] enable Unicode property escapes (\p{…} and \P{…}
- [ ] use ES2015 u flag in output

result (2):

/^(?:[\u3400-\u4DB5\u4E00-\u9FEF\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0])(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/.test('中\n')
// true

/hello(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])world/.test('hello\nworld')
// true

How to use /\p{L}/u with babel?

This is probably best way to ask, I was wondering if it's possible to compile this regex to non unicode regex.

It works with this demo https://mothereff.in/regexpu but not with proved babel link in README.

I need to generate non unicode regex for /\p{N}/u and /\p{L}/u.

I can just copy paste the regex but I would prefer this regex would be generated by Babel in my build script.

Should this be reported to Babel?

Support for back-references

/(s)\1/ui is currently transformed to /([s\u017F])\x01/i.

  1. Back references should probably not be transformed to hexadecimal escapes
  2. Canonicalizing the back reference's content is tricky, I'm not sure how this feature can be supported without canonicalizing the input string first, e.g. in /(s)\1/ui.test("s\u017f") == true.

Confirm whether my interpretation of the spec + assumptions are correct

  1. When the u flag is enabled, should inverse/uppercase character class escapes (e.g. \D) match all Unicode code points (rather than all BMP code points) except those in the lowercase variant of the character class escape (e.g. \d) set?
  2. When the u flag is enabled, should negated character classes (e.g. [^a]) match all Unicode code points (rather than BMP code points) except those in the set?

http://esdiscuss.org/topic/questions-regarding-es6-unicode-regular-expressions cc @allenwb @NorbertLindenberg

`/^.$/us` is transpiled incorrectly

Current output:

const transpiled = /^(?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?:[^\uD800-\uDBFF]|^)[\uDC00-\uDFFF])$/

Actual result:

transpiled.test('\u0001\udc00'); // true

Expected result:

/^.$/us.test('\u0001\udc00'); // false

clear input

With Firefox, if I click "Clear Recent History" with:

Time range to clear: Everything
Details:
- Cache
- Offline Website Data
- Site Preferences

Then refresh the page, my previous input remains. Only workaround ive found is
to open private window

`/./u` and `/[^x]/u` matching surrogate halves individually

Reported by Marja Hölttä:

var string = '𝌆𝌆';
var match = string.match(/(....)/u);
console.log(match[1]);

I checked that the same behavior occurs for other character classes too, like this:

var string = 'a𝌆b';
var match = string.match(/a([^c][^c])b/u);
console.log(match[1]); // 𝌆

And as a bonus, it transforms /(.+)\1/u to this:

> r = /((?:[\0-\t\x0B\f\x0E-\u2027\u202A-\uD7FF\uDC00-\uFFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF]|[\uD800-\uDBFF])+)\1/
> r.test('𝜆𝌆𝌇') // code units: D835 DF06 D834 DF06 D834 DF07
true

…which is pretty surprising :)

Babel Plugin

Hi @mathiasbynens,

I'm one of the contributors to Babel. I was wondering if you might consider turning regexpu into a Babel plugin or something along those lines.

We want the npm download size to shrink and the extra dependencies that regexpu pulls in for transpilation are a big part of that.

Just interested in seeing what it would take for regexpu to switch?

`/[\u{11450}\u{11C50}\u{11C52}]/u`

Sorry, if I report a bug in a wrong place and not sure if the online demo has the latest code.

for the regexp in the title, the last part is lost:
/(?:[\uD805\uD807]\uDC50)/, which matches only 2 code points, not 3.

Update:
it is a bug in "regenerate.js" in optimizeByLowSurrogates, seems:

// String.fromCodePoint(0x11450) === String.fromCharCode(0xD805, 0xDC50)
// String.fromCodePoint(0x11C50) === String.fromCharCode(0xD807, 0xDC50)
// String.fromCodePoint(0x11C52) === String.fromCharCode(0xD807, 0xDC52)

var set = regenerate()
  .add(0x11450)
  .add(0x11C50)
  .add(0x11C52)
  ;
console.log(set.toString());

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.