fent / ret.js Goto Github PK

View Code? Open in Web Editor NEW

93.0 8.0 16.0 155 KB

Tokenizes a string that represents a regular expression.

License: MIT License

JavaScript 71.75% TypeScript 28.25%

node javascript regular-expressions parser

ret.js's Issues

Tokenize character class with new lines

There is a problem with tokenizing character class which contains new lines (as they are not included).

Given: character class [ \t\r\n]

Current result:

[
  { type: 7, value: 32 },
  { type: 7, value: 9 }
]

Expected result:

[
  { type: 7, value: 32 },
  { type: 7, value: 9 },
  { type: 7, value: 13 },
  { type: 7, value: 10 }
]

No support for 's' flag

The 's' flag should have implications for the behavior of ret, specifically what it outputs for the . expression, but it always outputs ranges as if s flag were not specified.

Documentation doesn't match behavior

I've noticed some discrepancies between the docs and the actual behavior.

Here's my test script:

var ret = require('ret');

console.log(ret.types);

var regexes = [/foo|bar/, /abc/, /(a+)+/, /(aa+)+/, /(a+){40}$/];
regexes.forEach(function(re) {
  var pattern = re.source;
  var tokens = ret(pattern);

  console.log('pattern: /' + pattern + '/ , tokens: ' + JSON.stringify(tokens, null, 2));
});

This produces plenty of output, including:

pattern: /(a+)+/ , tokens: {
  "type": 0,
  "stack": [
    {
      "type": 5,
      "min": 1,
      "max": null,
      "value": {
        "type": 1,
        "stack": [
          {
            "type": 5,
            "min": 1,
            "max": null,
            "value": {
              "type": 7,
              "value": 97
            }
          }
        ],
        "remember": true
      }
    }
  ]
}

This illustrates several discrepancies:

The 'max' field for the repetition is 'null', not 'Infinity'. This seems to be a JSON thing, since the source clearly sets the value of max to 'Infinity'.
The 'value' field for the REPETITION token type is not stated in the docs.
The GROUP token type seems to have either a 'stack' or an 'options' field. This is not clear in the docs.

Typings and SET Clarification

I'm currently updating my genex package to Typescript and while writing typings for ret I noticed this strange mention in the documentation that doesn't seem to make sense to me:

SET Contains a key set specifying what tokens are allowed and a key not specifying if the set should be negated. A set can contain other sets, ranges, and characters.

In which cases would a SET token return other SET tokens? I can't think of an example for this.

Also, regarding typings, would you be willing to bundle them if I submit a PR? I currently have this:

declare function ret(input: string): ret.Root;

declare namespace ret {
  enum types {
    ROOT = 0,
    GROUP = 1,
    POSITION = 2,
    SET = 3,
    RANGE = 4,
    REPETITION = 5,
    REFERENCE = 6,
    CHAR = 7,
  }

  type Token = Group | Position | Set | Range | Repetition | Reference | Char;
  type Tokens = Root | Token;

  type Root = {
    type: types.ROOT;
    stack?: Token[];
    options?: Token[][];
  };

  type Group = {
    type: types.GROUP;
    remember: boolean;
    followedBy: boolean;
    notFollowedBy: boolean;
    stack?: Token[];
    options?: Token[][];
  };

  type Position = {
    type: types.POSITION;
    value: "^" | "$" | "B" | "b";
  };

  type Set = {
    type: types.SET;
    set: (Set | Range | Char)[];
    not: boolean;
  };

  type Range = {
    type: types.RANGE;
    from: number;
    to: number;
  };

  type Repetition = {
    type: types.REPETITION;
    min: number;
    max: number;
    value: Token;
  };

  type Reference = {
    type: types.REFERENCE;
    value: 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9;
  };

  type Char = {
    type: types.CHAR;
    value: number;
  };
}

export = ret;

Bug in set tokenization when '\]' is in the set

I was attempting to resolve this issue noted in #25 (sorry for the delay btw). However it appears that some of the suggested problem cases are not correctly tokenized to begin with.

Test Case `[2-\\]]`:

Expected Output:

{
  type: types.ROOT, stack: [{
    type: types.SET, not: false, set: [
      { type: types.RANGE, from: 50, to: 93 }
    ],
  }]
}

Actual Output

{
  type: types.ROOT, stack: [{
    type: types.SET, not: false, set: [
      { type: types.RANGE, from: 50, to: 92 }
    ],
  }, {
   type: types.CHAR,
   value: 93
  }]
}

Note that I have tested this on the codebase before and after it was rewritten in typescript and it is an error in both versions.

I would also reccomend adding the following to the main test file

    'Range (in set) test cases': {
      'Testing complex range cases': {
        'token.from is a hyphen and the range is preceded by a single character [a\\--\\-]': {
          'topic': ret('[a\\--\\-]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.CHAR, value: 97 },
                  { type: types.RANGE, from: 45, to: 45 }
                ],
              }]
            })
          }
        },
        'token.from is a hyphen and the range is preceded by a single character [a\\--\\/]': {
          'topic': ret('[a\\--\\/]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.CHAR, value: 97 },
                  { type: types.RANGE, from: 45, to: 47 }
                ],
              }]
            })
          }
        },
        'token.from is a hyphen and the range is preceded by a single character [c\\--a]': {
          'topic': ret('[c\\--a]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.CHAR, value: 99 },
                  { type: types.RANGE, from: 45, to: 97 }
                ],
              }]
            })
          }
        },
        'token.from is a hyphen and the range is preceded by a single character [\\-\\--\\-]': {
          'topic': ret('[\\-\\--\\-]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.CHAR, value: 45 },
                  { type: types.RANGE, from: 45, to: 45 }
                ],
              }]
            })
          }
        },
        'token.from is a hyphen and the range is preceded by a predefined set [\\w\\--\\-]': {
          'topic': ret('[\\w\\--\\-]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  {
                    type: types.SET, not: false, set: [
                      { type: types.CHAR, value: 95 },
                      { type: types.RANGE, from: 97, to: 122 },
                      { type: types.RANGE, from: 65, to: 90 },
                      { type: types.RANGE, from: 48, to: 57 }
                    ]
                  },
                  { type: types.RANGE, from: 45, to: 45 }
                ],
              }]
            })
          }
        },
        'token.from is a caret and the range is the first item of the set [\\^-9]': {
          'topic': ret('[\\^-9]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.RANGE, from: 45, to: 57 }
                ],
              }]
            })
          }
        },
        'token.to is a closing square bracket [2-\\]]': {
          'topic': ret('[2-\\]]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.RANGE, from: 50, to: 93 }
                ],
              }]
            })
          }
        },
        'token.to is a closing square bracket [\\^-\\]]': {
          'topic': ret('[\\^-\\]]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.RANGE, from: 94, to: 93 }
                ],
              }]
            })
          }
        },
        'token.to is a closing square bracket [[-\\]]': {
          'topic': ret('[[-\\]]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [
                  { type: types.RANGE, from: 92, to: 93 }
                ],
              }]
            })
          }
        },
        'Contains emtpy set': {
          'topic': ret('[]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: false, set: [],
              }]
            })
          }
        },
        'Contains emtpy negated set': {
          'topic': ret('[^]'),
          'Tokenizes correctly': (t) => {
            assert.deepStrictEqual(t, {
              type: types.ROOT, stack: [{
                type: types.SET, not: true, set: [],
              }]
            })
          }
        },
      }
    }

Wrongfully parsed SET

I think I found a bug related to the way the - character is parsed:

var ret = require('ret'), util = require('util');

console.log(util.inspect(ret(/[01]-[ab]/.source), false, null, true));

Output:

{
    "type": ret.types.ROOT,
    "stack": [
        {
            "type": ret.types.SET,
            "set": [
                {
                    "type": ret.types.CHAR,
                    "value": 48
                },
                {
                    "type": ret.types.CHAR,
                    "value": 49
                },
                {
                    "type": ret.types.RANGE,
                    "from": 93,
                    "to": 91
                },
                {
                    "type": ret.types.CHAR,
                    "value": 97
                },
                {
                    "type": ret.types.CHAR,
                    "value": 98
                }
            ],
            "not": false
        }
    ]
}

Expected Output:

{
    "type": ret.types.ROOT,
    "stack": [
        {
            "type": ret.types.SET,
            "set": [
                {
                    "type": ret.types.CHAR,
                    "value": 48
                },
                {
                    "type": ret.types.CHAR,
                    "value": 49
                }
            ],
            "not": false
        },
        {
            "type": ret.types.CHAR,
            "value": 45
        },
        {
            "type": ret.types.SET,
            "set": [
                {
                    "type": ret.types.CHAR,
                    "value": 97
                },
                {
                    "type": ret.types.CHAR,
                    "value": 98
                }
            ],
            "not": false
        },
    ]
}

Incoherent parsed SET

Another bug related to the SET parse tree:

var ret = require('ret'), util = require('util');

console.log(util.inspect(ret(/[]]/.source), false, null, true));

Output:

{
    "type": ret.types.ROOT,
    "stack": [
        {
            "type": ret.types.SET,
            "set": [],
            "not": false
        },
        {
            "type": ret.types.CHAR,
            "value": 93
        }
    ]
}

Expected output:

{
    "type": ret.types.ROOT,
    "stack": [
        {
            "type": ret.types.SET,
            "set": [
                {
                    "type": ret.types.CHAR,
                    "value": 93
                }
            ],
            "not": false
        }
    ]
}

Strangely [[] produces the correct parse tree whereas []] doesn't, I assume it's related to the greediness.

types.RANGE

When is types.RANGE used? I cannot seam to write an expression that parses to it. In addition, I cannot find any code that generates them.

`[^.]`, `[^\.]` and `[^\\.]` have the same token

[^.], [^\.] and [^\\.] all tokenize to

{
  "type":0,
  "stack":
    [{
      "type":3,
      "set":
        [{
          "type":7,
          "value":46
         }],
       "not":true
    }]
}

Version 10 of node.js has been released

Version 10 of Node.js (code name Dubnium) has been released! 🎊

To see what happens to your code in Node.js 10, Greenkeeper has created a branch with the following changes:

Added the new Node.js version to your .travis.yml
The new Node.js version is in-range for the engines in 1 of your package.json files, so that was left alone

If you’re interested in upgrading this repo to Node.js 10, you can open a PR with these changes. Please note that this issue is just intended as a friendly reminder and the PR as a possible starting point for getting your code running on Node.js 10.

More information on this issue

Greenkeeper has checked the engines key in any package.json file, the .nvmrc file, and the .travis.yml file, if present.

engines was only updated if it defined a single version, not a range.
.nvmrc was updated to Node.js 10
.travis.yml was only changed if there was a root-level node_js that didn’t already include Node.js 10, such as node or lts/*. In this case, the new version was appended to the list. We didn’t touch job or matrix configurations because these tend to be quite specific and complex, and it’s difficult to infer what the intentions were.

For many simpler .travis.yml configurations, this PR should suffice as-is, but depending on what you’re doing it may require additional work or may not be applicable at all. We’re also aware that you may have good reasons to not update to Node.js 10, which is why this was sent as an issue and not a pull request. Feel free to delete it without comment, I’m a humble robot and won’t feel rejected 🤖

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

npm README is out of sync with GitHub README

Parsing (.+)\1+ yields, among other things, this REPETITION token with a value field. I don't see this in the docs.

        {   
            "type": 5,
            "min": 1,
            "max": null,
            "value": {
                "type": 6,
                "value": 1
            }   
        }

Lazy quantifiers does not considered

const ret = require('ret');

let tokens = ret(^\/store\/(?:([^\/]+?)));

Question mark after the plus in the regex is a lazy quantifier but it is considered as optional identifier, and in the result:

{ "type": 0, "stack": [ { .... { "type": 1, "stack": [ { "type": 1, "stack": [ { "type": 5, "min": 0 #this mean that it is optional, "max": 1, "value": { ... } } } ] ... } ] }

Add support for latest RegExp JavaScript features

It would be nice to support the latest RegExp JavaScript features:

\p and \P: Unicode property escapes
(?<group>) and \k<group>: Named groups (#43)
(?<=) and (?<!): Negative lookbehind assertions

They are all now part of the EcmaScript standard. Node 9 does not support them but Node 10 will.

Thanks for this project, it's really useful.

Backreferences vs code points

https://hackernoon.com/the-madness-of-parsing-real-world-javascript-regexps-d9ee336df983#.2l8qu3l76

The article gives the following test cases:

/\1/ // Matches Unicode code point 1 aka Ctrl-A
/()\1/ // Empty capture followed by a backreference to that capture
/()\01/ // Empty capture followed by code point 1
/\11/ // Match a tab character, which is code point 9!
/\18/ // Match code point 1, followed by "8"
/\176/ // Match a tilde, "~"
/\400/ // Match a space followed by a zero

The rule is that the whole number is taken as a decimal backreference number, but if it has leading zeros or it is out of range (there are not enough capture parentheses) we abandon that interpretation, switch number base, and reinterpret it as up to 3 digits of octal escape up to 255 (\377), possibly followed by literal numbers.

Every time I implement a parser for this, I’m convinced I can parse it in one pass, and every time I am wrong and have to do it with a two-pass algorithm (the first one just counts the captures).

Feature Request - Abiltiy to Simplify

Taking a hint from Golang, the regexp/syntax offers a Simplify method.

I know it's not trivial, but it would be super useful to have this in ret.js as well.

Advanced simplifications / optimizations are not as useful as redundant ones, for instance:

/(a|a|a)/ => /(a)/
/(a+)+/ => /(a+)/
/a{0,2}?/ => /a{0,2}/
/(a{0,2})?/ => /(a{0,2})/
/(?:a+)+/ to /(?:a)+/

[BUG]: Cannot handle references with value greater than 10

If I have an expression /...\10/ the reference will be parsed as '1' rather than '10'. This is confirmed by the fact that here only one digit is tested.

fent / ret.js Goto Github PK

ret.js's Issues

Test Case [2-\\]]:

Expected Output:

Actual Output

Version 10 of Node.js (code name Dubnium) has been released! 🎊

Recommend Projects

Recommend Topics

Recommend Org

Jobs

Test Case `[2-\\]]`: