fent / ret.js Goto Github PK
View Code? Open in Web Editor NEWTokenizes a string that represents a regular expression.
License: MIT License
Tokenizes a string that represents a regular expression.
License: MIT License
There is a problem with tokenizing character class which contains new lines (as they are not included).
Given: character class [ \t\r\n]
Current result:
[
{ type: 7, value: 32 },
{ type: 7, value: 9 }
]
Expected result:
[
{ type: 7, value: 32 },
{ type: 7, value: 9 },
{ type: 7, value: 13 },
{ type: 7, value: 10 }
]
The 's' flag should have implications for the behavior of ret, specifically what it outputs for the .
expression, but it always outputs ranges as if s
flag were not specified.
I've noticed some discrepancies between the docs and the actual behavior.
Here's my test script:
var ret = require('ret');
console.log(ret.types);
var regexes = [/foo|bar/, /abc/, /(a+)+/, /(aa+)+/, /(a+){40}$/];
regexes.forEach(function(re) {
var pattern = re.source;
var tokens = ret(pattern);
console.log('pattern: /' + pattern + '/ , tokens: ' + JSON.stringify(tokens, null, 2));
});
This produces plenty of output, including:
pattern: /(a+)+/ , tokens: {
"type": 0,
"stack": [
{
"type": 5,
"min": 1,
"max": null,
"value": {
"type": 1,
"stack": [
{
"type": 5,
"min": 1,
"max": null,
"value": {
"type": 7,
"value": 97
}
}
],
"remember": true
}
}
]
}
This illustrates several discrepancies:
I'm currently updating my genex
package to Typescript and while writing typings for ret
I noticed this strange mention in the documentation that doesn't seem to make sense to me:
SET Contains a key set specifying what tokens are allowed and a key not specifying if the set should be negated. A set can contain other sets, ranges, and characters.
In which cases would a SET
token return other SET
tokens? I can't think of an example for this.
Also, regarding typings, would you be willing to bundle them if I submit a PR? I currently have this:
declare function ret(input: string): ret.Root;
declare namespace ret {
enum types {
ROOT = 0,
GROUP = 1,
POSITION = 2,
SET = 3,
RANGE = 4,
REPETITION = 5,
REFERENCE = 6,
CHAR = 7,
}
type Token = Group | Position | Set | Range | Repetition | Reference | Char;
type Tokens = Root | Token;
type Root = {
type: types.ROOT;
stack?: Token[];
options?: Token[][];
};
type Group = {
type: types.GROUP;
remember: boolean;
followedBy: boolean;
notFollowedBy: boolean;
stack?: Token[];
options?: Token[][];
};
type Position = {
type: types.POSITION;
value: "^" | "$" | "B" | "b";
};
type Set = {
type: types.SET;
set: (Set | Range | Char)[];
not: boolean;
};
type Range = {
type: types.RANGE;
from: number;
to: number;
};
type Repetition = {
type: types.REPETITION;
min: number;
max: number;
value: Token;
};
type Reference = {
type: types.REFERENCE;
value: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9;
};
type Char = {
type: types.CHAR;
value: number;
};
}
export = ret;
I was attempting to resolve this issue noted in #25 (sorry for the delay btw). However it appears that some of the suggested problem cases are not correctly tokenized to begin with.
[2-\\]]
:{
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 50, to: 93 }
],
}]
}
{
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 50, to: 92 }
],
}, {
type: types.CHAR,
value: 93
}]
}
Note that I have tested this on the codebase before and after it was rewritten in typescript and it is an error in both versions.
I would also reccomend adding the following to the main test file
'Range (in set) test cases': {
'Testing complex range cases': {
'token.from is a hyphen and the range is preceded by a single character [a\\--\\-]': {
'topic': ret('[a\\--\\-]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.CHAR, value: 97 },
{ type: types.RANGE, from: 45, to: 45 }
],
}]
})
}
},
'token.from is a hyphen and the range is preceded by a single character [a\\--\\/]': {
'topic': ret('[a\\--\\/]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.CHAR, value: 97 },
{ type: types.RANGE, from: 45, to: 47 }
],
}]
})
}
},
'token.from is a hyphen and the range is preceded by a single character [c\\--a]': {
'topic': ret('[c\\--a]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.CHAR, value: 99 },
{ type: types.RANGE, from: 45, to: 97 }
],
}]
})
}
},
'token.from is a hyphen and the range is preceded by a single character [\\-\\--\\-]': {
'topic': ret('[\\-\\--\\-]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.CHAR, value: 45 },
{ type: types.RANGE, from: 45, to: 45 }
],
}]
})
}
},
'token.from is a hyphen and the range is preceded by a predefined set [\\w\\--\\-]': {
'topic': ret('[\\w\\--\\-]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{
type: types.SET, not: false, set: [
{ type: types.CHAR, value: 95 },
{ type: types.RANGE, from: 97, to: 122 },
{ type: types.RANGE, from: 65, to: 90 },
{ type: types.RANGE, from: 48, to: 57 }
]
},
{ type: types.RANGE, from: 45, to: 45 }
],
}]
})
}
},
'token.from is a caret and the range is the first item of the set [\\^-9]': {
'topic': ret('[\\^-9]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 45, to: 57 }
],
}]
})
}
},
'token.to is a closing square bracket [2-\\]]': {
'topic': ret('[2-\\]]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 50, to: 93 }
],
}]
})
}
},
'token.to is a closing square bracket [\\^-\\]]': {
'topic': ret('[\\^-\\]]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 94, to: 93 }
],
}]
})
}
},
'token.to is a closing square bracket [[-\\]]': {
'topic': ret('[[-\\]]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [
{ type: types.RANGE, from: 92, to: 93 }
],
}]
})
}
},
'Contains emtpy set': {
'topic': ret('[]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: false, set: [],
}]
})
}
},
'Contains emtpy negated set': {
'topic': ret('[^]'),
'Tokenizes correctly': (t) => {
assert.deepStrictEqual(t, {
type: types.ROOT, stack: [{
type: types.SET, not: true, set: [],
}]
})
}
},
}
}
I think I found a bug related to the way the -
character is parsed:
var ret = require('ret'), util = require('util');
console.log(util.inspect(ret(/[01]-[ab]/.source), false, null, true));
Output:
{
"type": ret.types.ROOT,
"stack": [
{
"type": ret.types.SET,
"set": [
{
"type": ret.types.CHAR,
"value": 48
},
{
"type": ret.types.CHAR,
"value": 49
},
{
"type": ret.types.RANGE,
"from": 93,
"to": 91
},
{
"type": ret.types.CHAR,
"value": 97
},
{
"type": ret.types.CHAR,
"value": 98
}
],
"not": false
}
]
}
Expected Output:
{
"type": ret.types.ROOT,
"stack": [
{
"type": ret.types.SET,
"set": [
{
"type": ret.types.CHAR,
"value": 48
},
{
"type": ret.types.CHAR,
"value": 49
}
],
"not": false
},
{
"type": ret.types.CHAR,
"value": 45
},
{
"type": ret.types.SET,
"set": [
{
"type": ret.types.CHAR,
"value": 97
},
{
"type": ret.types.CHAR,
"value": 98
}
],
"not": false
},
]
}
Another bug related to the SET
parse tree:
var ret = require('ret'), util = require('util');
console.log(util.inspect(ret(/[]]/.source), false, null, true));
Output:
{
"type": ret.types.ROOT,
"stack": [
{
"type": ret.types.SET,
"set": [],
"not": false
},
{
"type": ret.types.CHAR,
"value": 93
}
]
}
Expected output:
{
"type": ret.types.ROOT,
"stack": [
{
"type": ret.types.SET,
"set": [
{
"type": ret.types.CHAR,
"value": 93
}
],
"not": false
}
]
}
Strangely [[]
produces the correct parse tree whereas []]
doesn't, I assume it's related to the greediness.
When is types.RANGE used? I cannot seam to write an expression that parses to it. In addition, I cannot find any code that generates them.
[^.]
, [^\.]
and [^\\.]
all tokenize to
{
"type":0,
"stack":
[{
"type":3,
"set":
[{
"type":7,
"value":46
}],
"not":true
}]
}
To see what happens to your code in Node.js 10, Greenkeeper has created a branch with the following changes:
.travis.yml
package.json
files, so that was left aloneIf you’re interested in upgrading this repo to Node.js 10, you can open a PR with these changes. Please note that this issue is just intended as a friendly reminder and the PR as a possible starting point for getting your code running on Node.js 10.
Greenkeeper has checked the engines
key in any package.json
file, the .nvmrc
file, and the .travis.yml
file, if present.
engines
was only updated if it defined a single version, not a range..nvmrc
was updated to Node.js 10.travis.yml
was only changed if there was a root-level node_js
that didn’t already include Node.js 10, such as node
or lts/*
. In this case, the new version was appended to the list. We didn’t touch job or matrix configurations because these tend to be quite specific and complex, and it’s difficult to infer what the intentions were.For many simpler .travis.yml
configurations, this PR should suffice as-is, but depending on what you’re doing it may require additional work or may not be applicable at all. We’re also aware that you may have good reasons to not update to Node.js 10, which is why this was sent as an issue and not a pull request. Feel free to delete it without comment, I’m a humble robot and won’t feel rejected 🤖
There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.
Your Greenkeeper Bot 🌴
Parsing (.+)\1+ yields, among other things, this REPETITION token with a value field. I don't see this in the docs.
{
"type": 5,
"min": 1,
"max": null,
"value": {
"type": 6,
"value": 1
}
}
const ret = require('ret');
let tokens = ret(^\/store\/(?:([^\/]+?)));
Question mark after the plus in the regex is a lazy quantifier but it is considered as optional identifier, and in the result:
{ "type": 0, "stack": [ { .... { "type": 1, "stack": [ { "type": 1, "stack": [ { "type": 5, "min": 0 #this mean that it is optional, "max": 1, "value": { ... } } } ] ... } ] }
It would be nice to support the latest RegExp JavaScript features:
\p
and \P
: Unicode property escapes(?<group>)
and \k<group>
: Named groups (#43)(?<=)
and (?<!)
: Negative lookbehind assertionsThey are all now part of the EcmaScript standard. Node 9
does not support them but Node 10
will.
Thanks for this project, it's really useful.
https://hackernoon.com/the-madness-of-parsing-real-world-javascript-regexps-d9ee336df983#.2l8qu3l76
The article gives the following test cases:
/\1/ // Matches Unicode code point 1 aka Ctrl-A
/()\1/ // Empty capture followed by a backreference to that capture
/()\01/ // Empty capture followed by code point 1
/\11/ // Match a tab character, which is code point 9!
/\18/ // Match code point 1, followed by "8"
/\176/ // Match a tilde, "~"
/\400/ // Match a space followed by a zero
The rule is that the whole number is taken as a decimal backreference number, but if it has leading zeros or it is out of range (there are not enough capture parentheses) we abandon that interpretation, switch number base, and reinterpret it as up to 3 digits of octal escape up to 255 (\377), possibly followed by literal numbers.
Every time I implement a parser for this, I’m convinced I can parse it in one pass, and every time I am wrong and have to do it with a two-pass algorithm (the first one just counts the captures).
Taking a hint from Golang, the regexp/syntax
offers a Simplify
method.
I know it's not trivial, but it would be super useful to have this in ret.js
as well.
Advanced simplifications / optimizations are not as useful as redundant ones, for instance:
/(a|a|a)/ => /(a)/
/(a+)+/ => /(a+)/
/a{0,2}?/ => /a{0,2}/
/(a{0,2})?/ => /(a{0,2})/
/(?:a+)+/ to /(?:a)+/
If I have an expression /...\10/
the reference will be parsed as '1' rather than '10'. This is confirmed by the fact that here only one digit is tested.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.