ammar / regexp_parser Goto Github PK

View Code? Open in Web Editor NEW

141.0 141.0 23.0 1.38 MB

A regular expression parser library for Ruby

License: MIT License

Ruby 92.12% Ragel 7.86% Shell 0.02%

regexp_parser's People

Contributors

Stargazers

Watchers

Forkers

coatl launchthing gjtorikian jhart-r7 backus danascheider cadelaren lubyruffy barthez tumkit15 mschnitzer camertron pocke owst twalpole dgollahon jesterxd gogainda sysfce2 koic tagliala

regexp_parser's Issues

Possible feature incompatibility with Ruby versions.

On my recent mutant work on regexp mutations I found cases where, regexp parser would indicate that ruby at a given version does support a specific unicode property, where ruby apparently would not support it. It could very well also be I use the regexp_parser API wrong.

# report.rb
require 'regexp_parser'

syntax = ::Regexp::Syntax.version_class("ruby/#{RUBY_VERSION}")

puts "RUBY_VERSION:            #{RUBY_VERSION}"
puts "Regexp::Parser::VERSION: #{::Regexp::Parser::VERSION}"
puts "Syntax:                  #{syntax.class}"

puts "Does not recognize while indicated by regexp_parser:"

syntax
  .features
  .fetch(:property, []).each do |property|
    property_specifier = "\\p{#{property}}"

    begin
      /#{property_specifier}/
    rescue RegexpError
      puts property_specifier
    end
  end

I've tested with non EOL head rubies getting me that outputs:

RUBY_VERSION:            2.7.6
Regexp::Parser::VERSION: 2.3.0
Syntax:                  Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}
RUBY_VERSION:            3.0.4
Regexp::Parser::VERSION: 2.3.0
Syntax:                  Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}
RUBY_VERSION:            3.1.2
Regexp::Parser::VERSION: 2.3.0
Syntax:                  Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}

A fix may be easy in removing indicated support, or well: Telling me where I use the API wrong.

Ruby 2.5.0 support

Getting a Regexp::Syntax::UnknownSyntaxNameError on Ruby 2.5.0:

Unknown syntax name 'ruby/2.5.0'. Forgot to add it to Regexp::Syntax::VERSIONS?

When would support for 2.5.0 be available?

Release new version

Hey @ammar any chance you could release a new gem version? Another option would be to add a .gemspec to this project so I can add a git or github reference to it in my Gemfile. Thanks!

I also can't figure out how to build the gem or run tests. Running rake gem:release doesn't build the gem (at least, it doesn't appear in pkg), and rake test fails with a huge stack trace. Could you add some guidelines to the readme?

Mishandled comments with extended mode regexp?

My understanding (based on the docs) is that the following are equivalent:

  /
    a # bc+
  /x

  /
    a #bc+
  /x

  /
    a# bc+
  /x

  /
    a#bc+
  /x

  /a # bc+/x
  /a #bc+/x
  /a# bc+/x
  /a#bc+/x

in all cases, the pattern should match only a single 'a' as whitespace and anything on a line following a # is ignored. Indeed it seems that Ruby (2.6.6) does indeed treat them as equivalent. However, it seems that regexp_parser does not, and only recognises the comment if:

The pattern is across multiple lines (examples 1-4), and
There is a space before the # (examples 1-2)

To demonstrate, run the following:

require 'regexp_parser'

p Regexp::Parser::VERSION

[
  /
    a # bc+
  /x,
  /
    a #bc+
  /x,
  /
    a# bc+
  /x,
  /
    a#bc+
  /x,
  /a # bc+/x,
  /a #bc+/x,
  /a# bc+/x,
  /a#bc+/x
].each do |pat|
  puts "Pattern: #{pat.inspect}"
  pat =~ 'abcc'
  puts "pat =~ 'abccc' last_match: #{Regexp.last_match}"
  puts Regexp::Parser.parse(pat).each_expression.map { |exp, _| exp.class.name.sub(/.*::/, "") }.join(', ')

  puts
end

prints:

"1.7.1"
Pattern: /
    a # bc+
  /x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Comment, WhiteSpace

Pattern: /
    a #bc+
  /x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Comment, WhiteSpace

Pattern: /
    a# bc+
  /x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Literal, Literal, WhiteSpace

Pattern: /
    a#bc+
  /x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, Literal, WhiteSpace

Pattern: /a # bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal

Pattern: /a #bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal

Pattern: /a# bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal

Pattern: /a#bc+/x
pat =~ 'abccc' last_match: a
Literal, Literal

Notice how all patterns match just 'a', but only the first two parse include a Comment expression when parsed.

Am I misunderstanding something, or is this a bug?

incompatible character encodings when calling `.to_s` on tree parsed from single regex

related: rubocop/rubocop#9056

Regexp::Parser.parse((/¡#≥
/x).to_s).to_s

expected: all nodes of tree are utf-8 (they all come from a utf-8 source string)
actual: raises Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT

thank you for your great lib!

Regexp::Syntax::Ruby::* does not implement: [escape:codepoint_list]

(I'm not sure if I'm using the right terminology here so if my description doesn't make sense then just look at the irb snippet below)

It looks like regexp_parser successfully scans for escape:codepoint_list tokens but the different ruby version syntax classes don't say they implement this type of node.

Example:

2.3.0 :001 > codepoint_list_regex = /\u{9879}/
 => /\u{9879}/
2.3.0 :002 > require 'regexp_parser'
 => true
2.3.0 :003 > Regexp::Parser.parse(codepoint_list_regex)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [escape:codepoint_list]

I have a fix which I should be able to PR soon

regexp_parser is not thread-safe

test case:

require 'uri'

@results = []
1000.times { Thread.new { @results << Regexp::Parser.parse(URI.regexp) } }
@results.map(&:strfre).uniq.count # => 10 or so (should be 1)

I guess this is due to the "instance" variables on the Scanner, Lexer and Parser classes existing on the classes themselves and thus being shared among executions. If so, it can be easily fixed by making their class methods into instance methods. E.g. for Scanner:

def self.scan(input_object, &block)
  new.scan(input_object, &block)
end

def scan(input_object, &block)
  @literal, top, stack = nil, 0, []
  # ...
end

Octal escape sequences not parsed from character classes

Octal escape sequences get their own representation when parsed at the top level, but not from within a character class:

irb(main):032:0> Regexp::Parser::VERSION
=> "2.6.2"
irb(main):033:0> Regexp::Scanner.scan(/\101;\x42/)
=> [[:escape, :octal, "\\101", 0, 4],
    [:literal, :literal, ";", 4, 5],
    [:escape, :hex, "\\x42", 5, 9]]
irb(main):034:0> Regexp::Scanner.scan(/[\101;\x42]/)
=>
[[:set, :open, "[", 0, 1],
 [:escape, :literal, "\\1", 1, 3],
 [:literal, :literal, "0", 3, 4],
 [:literal, :literal, "1", 4, 5],
 [:literal, :literal, ";", 5, 6],
 [:escape, :hex, "\\x42", 6, 10],
 [:set, :close, "]", 10, 11]]

Since hex is the same in both cases, getting escape literal "\\1" instead of escape octal "\\101" seems like a bug.

Scanner fails on certain backslash sequences

Consider

Regexp::Scanner.scan('\c;')

And note that the result is []

\c; does compile in ruby but I'm not sure what it matches; in PCRE, it matches {, but not in ruby...

Similarly,

Regexp::Scanner.scan('')
Regexp::Scanner.scan('\x')
Regexp::Scanner.scan('\x;HelloWorld')

are all []. All are invalid in ruby.

Desired Behavior:

\cX should probably be emitted for any X as Ruby allows them.
For the others, a Scanner exception would be much better than do nothing.

Thanks, c.

Question: Corpus test?

Hello regexp_parser team!

While improving mutants regexp support I'm often challenged with "Did I cover all nodes?" type of questions, and for that reason I try to source as many test cases as possible from my dependencies. I did so in unparser re-using the parser test suite for ruby edge cases.

And I'd love if I had the ability to do this for the regexp parser dependency also. Is there a good way to 'mass source' expressions from your libraries tests?

Cheers,

Markus

Hex type (`\h`) and nonhex type (`\H`) not implemented

2.3.0 :012 > Regexp::Parser.parse(/\h/)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [type:hex]
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/syntax.rb:154:in `implements!'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:22:in `block in lex'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:4200:in `emit'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:3366:in `scan'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:20:in `lex'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/parser.rb:26:in `parse'
    from (irb):12
    from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

2.3.0 :013 > Regexp::Parser.parse(/\H/)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [type:nonhex]
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/syntax.rb:154:in `implements!'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:22:in `block in lex'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:4200:in `emit'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:3367:in `scan'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:20:in `lex'
    from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/parser.rb:26:in `parse'
    from (irb):13
    from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

Should have basic rubocop linting

At least unused variables should be caught, c.f. #78

Ruby 2.2.7 syntax missing

There is no syntax file for ruby 2.2.7, which means that on a current ruby 2.2 the code now fails with "Regexp::Syntax::UnknownSyntaxNameError: Unknown syntax name 'ruby/2.2.7'. Forgot to add it in the case statement?". I realize that ruby 2.2 is old, but it is still officially maintained until March 2018.

Suggestion / TBD: improve options handling

Right now, it is hard to tell which options apply to a specific expression when passing in a ::Regexp with flags or when there are groups with option flags.

E.g.

P = Regexp::Parser
P.parse(/a/i).expressions[0].i?                      # => false 
P.parse(/(a)/i).expressions[0].i?                    # => false 
P.parse(/x(?i)a/).expressions.last.i?                # => false 
P.parse(/(?i:(a))/).expressions[0].expressions[0].i? # => false
P.parse(/(?i-i:a)/).expressions[0].i?                # => true (should be false)

To correctly determine wether a is case-insensitive, a user would have to keep track of the root options and any "currently active" option groups himself.

@ammar if you agree, I think we could improve this by setting a correct options hash for every Token, as follows:

add another field to the Token class, :options, in token.rb:11
remove Parser::options and move all option processing to the Lexer instead. imho currently active options are similar to group_level or set_level.
set up initial @options based on input in the Lexer (use an empty Hash if input is a ::String instead of a ::Regexp)
dup and update these @options whenever encountering a group with option flags
keep track of option group "history" to correctly handle nesting
split [:group, :options, ...] into two tokens: [:group, :options_local, ...] and [:group, :options_switch, ...], to allow differentiating between group-local option modifications (e.g. /a(?i:b)c/) and those that persist after the group is closed (e.g. /a(?i)bc/)
pass these @options to every new Token created in lexer.rb:30
remove custom handling of options from root.rb
change option methods in expression.rb from (@options and @options[:m]) ? true : false to !!@options[:m]

Warnings from expression.rb

This is similar to #10 and pretty simple. My spec output for my regexp_parser implementation (mbj/mutant#565) currently gets flooded with this warning:

/Users/johnbackus/Projects/regexp_parser/lib/regexp_parser/expression.rb:75: warning: instance variable @quantifier not initialized

I'll open a PR to address this. I also get some similar warnings on boot from scanner.rb as mentioned in #10. I will probably look into addressing this as well and include a fix in my PR if the changes are reasonable.

Ruby 2.4.3 raising errors

Similar to #41,

we're getting errors running under Ruby 2.4.3:

Unknown syntax name 'ruby/2.4.3'. Forgot to add it to Regexp::Syntax::VERSIONS?

Multi-byte named capture groups do not parse

Hi again,

I have found one interesting case where regexp_parser will not parse one that MRI accepts:

/(?<æ>.)(.)(?<b>\d+)(\d)/.match('ab12').named_captures # => {"æ"=>"a", "b"=>"1"}

Regexp::Parser.parse(/(?<æ>.)(.)(?<b>\d+)(\d)/) # => Regexp::Scanner::InvalidGroupOption: Invalid group option  in (?

See this line and this line from ruby/spec for where I sourced this case.

Like in #75 I'm just curious if this is intended behavior and, if so, if it can be documented. Otherwise parity with MRI is preferred.

Thanks again!

Release 2.3.1 syntax?

The current release doesn't have a syntax file for ruby 2.3.1.

Support ruby 2.5.0

Just released. PR #49

https://www.ruby-lang.org/en/news/2017/12/25/ruby-2-5-0-released/

Only alphanumeric character set ranges are detected as ranges

S = Regexp::Scanner
S.scan /[1-9]/ # => [..., [:set, :range, "1-9", 1, 4], ...]
S.scan /[a-z]/ # => [..., [:set, :range, "a-z", 1, 4], ...]
S.scan /[!-%]/ # => [..., [:set, :member, "!", 1, 2], [:set, :member, "-", 2, 3], [:set, :member, "%", 3, 4], ...]
S.scan /[ä-ü]/ # => [..., [:set, :member, "\xC3\xA4", 1, 3], [:set, :member, "-", 3, 4], [:set, :member, "\xC3\xBC", 4, 6], ...]

I think we could detect everything as a range that follows this pattern:

anything_but_a_bracket . "-" . anything_but_a_bracket

Of course, "anything" can be a unicode escape or even a codepoint list. Funny enough this works:

/[\u{41 42}-\u{55 56}]/ =~ "\u{44}" # => 0

But besides of that are there any other pitfalls I didn’t think of?

Ruby 2.3.7 support

And 2.2.10, but I don't care so much about that one :)

Misses special characters after bare `#` (e.g. `/#(\d+)/`)

In ruby's regexp, /#(\d+)/ matches literal # then unnamed capture (\d+).
But Regexp::Parser.parse treats it as literal '#(\d+)'.

irb(main):001:0> /#(\d+)/.match('#123')
=> #<MatchData "#123" 1:"123">
irb(main):002:0> Regexp::Parser.parse('#(\d+)')
=> #<Regexp::Expression::Root:0x00005619a4ad5e30 @type=:expression, @token=:root, @text="", @ts=0, @level=nil, @set_level=nil, @conditional_level=nil, @nesting_level=0, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::Literal:0x00005619a4b8cb80 @type=:literal, @token=:literal, @text="#(\\d+)", @ts=0, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}>]>
irb(main):003:0> Regexp::Parser.parse('\#(\d+)')
=> #<Regexp::Expression::Root:0x00005619a4be42e0 @type=:expression, @token=:root, @text="", @ts=0, @level=nil, @set_level=nil, @conditional_level=nil, @nesting_level=0, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::EscapeSequence::Literal:0x00005619a4bf5310 @type=:escape, @token=:literal, @text="\\#", @ts=0, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}>, #<Regexp::Expression::Group::Capture:0x00005619a4bf52c0 @type=:group, @token=:capture, @text="(", @ts=2, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::CharacterType::Digit:0x00005619a4bf51d0 @type=:type, @token=:digit, @text="\\d", @ts=3, @level=1, @set_level=0, @conditional_level=0, @nesting_level=2, @quantifier=#<Regexp::Expression::Quantifier:0x00005619a4bf51a8 @token=:one_or_more, @text="+", @mode=:greedy, @min=1, @max=-1>, @options={}>], @number=1, @number_at_level=1>]>

Regexp::Parser.parse('#(\d+)') shoud return same tree as Regexp::Parser.parse('\#(\d+)') (except for first node, it's "#" instead of "\\#")

Regexp accepted but rejected by regexp_parser

irb(main):002:0> /(a)(?('01'))/
=> /(a)(?('01'))/
irb(main):003:0> Regexp::Parser.parse("(a)(?('01'))")
/home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:290:in `conditional': Unknown Conditional token condition_open (Regexp::Parser::UnknownTokenError)
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:81:in `parse_token'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:39:in `block in parse'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:74:in `emit'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:57:in `block in lex'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2422:in `emit'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2475:in `emit_literal'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2408:in `emit'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2172:in `scan'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:21:in `scan'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:33:in `lex'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:17:in `lex'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:38:in `parse'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:22:in `parse'
	from (irb):3:in `<main>'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
	from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `load`

I've reduced this from another rubyspec corpus case.

Nested repetitions parsed potentially incorrectly

Summary: /o{2}{5}/ matches 10 o's (in ruby) but the {2} quantifier is lost in the parse tree.

Example code:

#!/usr/bin/env ruby

require 'rubygems'
require 'regexp_parser'
require 'pp'

re = /o{2}{5}/
pp Regexp::Parser.parse(re)

puts "o" if "o" =~ re
puts "2 o" if "o"*2 =~ re
puts "5 o" if "o"*5 =~ re
puts "10 o" if "o"*10 =~ re

Output:

#<Regexp::Expression::Root:0x105fd9d40
 @expressions=
  [#<Regexp::Expression::Literal:0x105fd6398
    @expressions=[],
    @options=nil,
    @quantifier=
     #<Regexp::Expression::Quantifier:0x105fd5e20
      @max=5,
      @min=5,
      @mode=:greedy,
      @text="{5}",
      @token=:interval>,
    @text="o",
    @token=:literal,
    @type=:literal>],
 @options=nil,
 @text="",
 @token=:root,
 @type=:expression>
10 o

Comments:

As far as I can tell, the nested quantifier syntax isn't documented in ruby and is illegal in pcre. Grep for example, will not match any number of o's for the given regexp. As such, I'd be content with a will-not-fix verdict. But I thought you might be interested.

Thank you for your time.

regex_parser_error

@type tail path /tmp/access_log-20170205 pos_file /tmp/grock1 @type grok grok_pattern %{IP:ip_address} tag grokked_log @type elasticsearch logstash_format false host 192.168.1.36 #(optional; default="localhost") port 9200 #(optional; default=9200) index_name grok1 #(optional; default=fluentd) type_name log #(optional; default=fluentd)

error getting is:-
2017-02-09 11:14:31 +0530 [error]: #0 error_class=NameError error="uninitialized constant Fluent::Plugin::RegexpParser"

my log file

vi access_log-20170205
192.168.0.204 - - [02/Feb/2017:16:45:34 +0530] "GET /browserconfig.xml HTTP/1.1" 404 294 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; Trident/7.0; rv:11.0) like Gecko"
192.168.0.204 - - [02/Feb/2017:17:47:01 +0530] "GET /browserconfig.xml HTTP/1.1" 404 294 "-" "Mozilla/5.0 (Windows NT 6.3; Win64; x64; Trident/7.0; rv:11.0) like Gecko"

Invalid parse of `/foo(?#)bar/`

It looks like the edge case /foo(?#)bar/ is not properly parsed. I found this while running this gem against rubyspec. Here is the relevant part of the spec: https://github.com/ruby/spec/blob/master/language/regexp_spec.rb#L102-L105

Proposal to eliminate warnings

Right now regexp_parser controlled code can generate warnings. On projects that have warnings enabled (such as mutant) and have the warning generating call side in the main loop: This can lead megabytes of warnings in the build logs.

This proposoal is about:

Getting agreement on removing all warnings from regexp_parser
Removing all warnings
Getting agreement on making the build fail in future in the presence of warnings
Enforce warnings are absent on the build

I'm happy to take 2. and 4. in case @ammar agrees to 1. and 4.

Use of byte index vs. character index

It seems the ts and te values are byte index, not character index even if you feed a multibyte string to the parser. It can be hard to have to convert index values around for one to use this parser because you normally parse a regexp as a multibyte text.

cf. rubocop/rubocop#8989

Is there any plan to optionally provide character index in addition to or instead of byte index? Thanks!

New release with the YAML parsing changes

Hello 👋

I was wondering if you were planning on tagging a new release soon. This commit is interesting to me a408eb2 :)

Thanks!

Inconsistent scanning of properties within sets

Outside of sets, Regexp::Scanner recognizes two types of nonproperty:

S = Regexp::Scanner
S.scan /\p{ascii}/    # => [[:property,    :ascii, "\\p{ascii}", 0, 9]]
S.scan /\p{^ascii}/   # => [[:nonproperty, :ascii, "\\p{^ascii}", 0, 10]]
S.scan /\P{ascii}/    # => [[:nonproperty, :ascii, "\\P{ascii}", 0, 9]]

Within sets, only one of them is recognized as nonproperty:

S.scan /[\p{ascii}]/  # => [..., [:set, :ascii, "\\p{ascii}", 1, 10], ...]]
S.scan /[\p{^ascii}]/ # => [..., [:nonproperty, :ascii, "\\p{^ascii}", 1, 11], ...]]
S.scan /[\P{ascii}]/  # => [..., [:set, :ascii, "\\P{ascii}", 1, 10], ...]]

And I guess you would actually see it as a bug that \p{^...} does not get the type :set like everything else in a set, is that right? That would be easy to fix, and I think it would make sense from the viewpoint of consistency to change that.

However, fixing that would still make it necessary to scan the data of properties in sets for /\\(P|p\{\^)/ to detect whether they are negative. That is different from how they are handled outside of sets and different from how classes are handled, as classes have their "polarity" encoded at index 1:

S.scan /[[:ascii:]]/  # => [..., [:set, :class_ascii, "[:ascii:]", 1, 10], ...]
S.scan /[[:^ascii:]]/ # => [..., [:set, :class_nonascii, "[:^ascii:]", 1, 10], ...]

IMHO the ideal solution would be if all the information that tokens generate outside of sets was also available when they occur in a set. I am aware that this would require a substantial refactoring and would not be backwards compatible. But it would greatly help acting on specific tokens (escapes are another example) wherever they occur, without having to scan the token data.

Any thoughts?

Escape sequences not handled correctly in character classes

It looks like [\da-z] results in an incorrect members array in CharacterSet. Consider the following example:

Regexp::Parser.parse('[\da-z]').expressions.first.members
# => ["\\da", "-", "z"]

I think the array should contain a Digit object or '0-9' instead of the '\\d'.

Overhaul of Set#members needed

Right now, handling the content of character sets with regexp_parser is hard:

The Scanner only detects few ranges successfully, as detailed in issue #29.
The Scanner returns inconclusive information about member tokens because they all have the type :set. Issue #28 describes this for properties, but it also affects \a, \e, \n, \t, \u, \v and more.
The Parser then "throws away" even this limited information as it only relays the Token#text to Set#members. (Re-running Parser#parse on individual Set#members is a poor workaround for this.)

What I have in mind as a general solution is the following:

removing the :subset token type, leaving #set_level to differentiate between sets and subsets
using the :set token type only for tokens that are particular to sets ([, ^, &&, ] and ranges)
removing the :member, :member_hex, :range and :range_hex tokens
treating set ranges and members like any other sub-expression instead
removing the attr Set#members, leaving #expressions to access members, ranges and subsets

Thus, parsing /a[bc-d]/ could yield something like

#<Root @expressions=[
  #<Literal @type=:literal, @token=:literal, @text="a" >,
  #<CharacterSet @expressions=[
    #<Literal @type=:literal, @token=:literal, @text="b" >,
    #<Range @type=:set, @token=:range, @expressions=[
      #<Literal @type=:literal, @token=:literal, @text="c" >,
      #<Literal @type=:literal, @token=:literal, @text="d" >
    ]>
  ]>
]>

The only tricky bit is rewiring the ragel machines in the right way and catching all ranges.
On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus ., \b, and [:...:] if I am not mistaken.

What do you think, @ammar?

Fail to parse negative lookbehind with closing angle bracket

Hi! I think I found a bug: let's say we want to capture all the letters y followed by a closing angle bracket > that are not preceded by the letter x, we could use this regex:

irb(main):003:0> re = Regexp.new("(?<!x)y>")
=> /(?<!x)y>/
irb(main):004:0> ["y>", "py>", "xy>"].map { |s| re.match?(s) }
=> [true, true, false]

When we try to parse it, it fails:

irb(main):003:0> Regexp::Scanner.scan(re)
/Users/serch/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/regexp_parser-2.6.0/lib/regexp_parser/scanner.rb:2521:in `scan': Premature end of pattern at (missing group closing paranthesis) [1] (Regexp::Scanner::PrematureEndError)

without the closing angle bracket it succeeds:

irb(main):004:0> Regexp::Scanner.scan(Regexp.new("(?<!x)y"))
=> [[:assertion, :nlookbehind, "(?<!", 0, 4], [:literal, :literal, "x", 4, 5], [:group, :close, ")", 5, 6], [:literal, :literal, "y", 6, 7]]

Failure to parse `\g`

The following regexp /[a]\g/ is accepted by Ruby, yet RegexpParser raises an error:

> ruby -e "p(/[a]\g/ =~ 'ag')"
0
> ruby -r regexp_parser -e "Regexp::Parser.parse '/[a]\g/'"
Traceback (most recent call last):
	6: from -e:1:in `<main>'
	5: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/parser.rb:22:in `parse'
	4: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/parser.rb:38:in `parse'
	3: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/lexer.rb:15:in `lex'
	2: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/lexer.rb:28:in `lex'
	1: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/scanner.rb:71:in `scan'
/Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/scanner.rb:2664:in `scan': Scan error at '\\g/' (Regexp::Scanner::ScannerError)

The same reasoning as #63 applies: while technically incorrect, this Regexp is actually accepted and might exist in the wild. Would it be possible to tweak the gem to parse it the same way Ruby does?

Warnings in generated scanner.rb in some situations

In some situations (TBD), while running unit tests for https://github.com/rapid7/recog/, we get warnings like:

/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2463: warning: mismatched indentations at 'end' with 'begin' at 1713
... dozens of these ...
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2753: warning: statement not reached
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:1648: warning: assigned but unused variable - testEof
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2236: warning: duplicated when clause is ignored
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/lexer.rb:118: warning: assigned but unused variable - replace
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:4200: warning: instance variable @literal not initialized

I also got these with 0.2.1.

You can reproduce most of these with ruby -we 'require "regexp_parser"; Regexp::Scanner.scan(/(foo)/) do |token_parts| puts "capture" if token_parts.first == :group && ![:close, :passive].include?(token_parts[1]); end'

Support for Unicode blocks?

I noticed that Unicode blocks are not supported in

regexp_parser/lib/regexp_parser/syntax/tokens/unicode_property.rb

Line 103 in 3f03b4d

Script =[

Unicode blocks are basically like ranges of Unicode script: http://www.regular-expressions.info/unicode.html#block

I'm happy to make a PR for this if it's helpful.

Regexp accepted by ruby but rejected by regexp_parser

$ bundle exec irb -r regexp_parser
irb(main):001:0> Regexp::Parser.parse("\\99")
/home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:592:in `block in assign_referenced_expressions': Invalid reference 9 at pos 0 (Regexp::Parser::ParserError)
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:590:in `each'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:590:in `assign_referenced_expressions'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:45:in `parse'
	from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:22:in `parse'
	from (irb):1:in `<main>'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
	from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `load'
	from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `<top (required)>'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:58:in `load'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:58:in `kernel_load'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:23:in `run'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli.rb:492:in `exec'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
	from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli.rb:34:in `dispatch'
	... 7 levels...
irb(main):002:0> /\99/
=> /\99/

I suspect that regexp_parser has to be as lenient when it comes to backrefs as ruby itself?

Incorrect `#to_s` output for reluctant interval

2.3.0 :504 > regex = /a{3}?/
 => /a{3}?/
2.3.0 :505 > Regexp::Parser.parse(regex).to_re == regex
 => false
2.3.0 :506 > Regexp::Parser.parse(regex).to_re
 => /a{3}/

It looks like the issue is the quantifier being set with incorrect text:

2.3.0 :508 > Regexp::Parser.parse(regex).first
 => #<Regexp::Expression::Literal:0x007fedbbdc4398 @type=:literal, @token=:literal, @text="a", @ts=0, @level=0, @set_level=0, @conditional_level=0, @options=nil, @quantifier=#<Regexp::Expression::Quantifier:0x007fedbbdc4118 @token=:interval, @text="{3}", @mode=:reluctant, @min=3, @max=3>>
2.3.0 :509 > Regexp::Parser.parse(regex).first.quantifier
 => #<Regexp::Expression::Quantifier:0x007fedbbd75608 @token=:interval, @text="{3}", @mode=:reluctant, @min=3, @max=3>
2.3.0 :510 > Regexp::Parser.parse(regex).first.quantifier.text
 => "{3}"

`regexp_parser` rejects `/\xA/` but MRI accepts it

Hi,

I am working on re-introducing regexp mutation support on mutant and I noticed that since the old integration existed regexp_parser seems to have decided to stop rejecting a large % of regexps that ruby would accept (#63) but regexp_parser did not. I did find one additional case that was not documented anywhere I found (I tried brute-forcing millions of regexps to infer if there were any cases where regexp_parser was stricter than MRI and this is the only class of instances I could find).

"\xA" # => "\n"
/\xA/.match?("\n") # => true

 Regexp::Parser.parse(/\xA/) # => Regexp::Scanner::PrematureEndError: Premature end of pattern at \x

Is this a bug or intended behavior? Either is fine for my purposes since I can just add a special check to ignore errors in this case, but I was curious if this was an intended difference or not. The coverage matrix in the README suggests that hex escapes work but I guess this is a special case that was not highlighted. If it is intentional behavior, it would be helpful to document it (unless I missed where this was done already) or alternatively having parity with MRI would work for me.

Thanks!

2.9.0 gem not found

Dependabot rubocop-rspec upgrades are installing the new version of regexp_parser, 2.9.0. However, we are getting this error:

➜  bundle install
Fetching gem metadata from https://rubygems.org/.........
Your bundle is locked to regexp_parser (2.9.0) from rubygems repository https://rubygems.org/ or installed locally, but that version can no longer be found in that source. That means the author of regexp_parser (2.9.0) has removed it. You'll need to update your bundle to a version other than regexp_parser (2.9.0) that hasn't been removed in order to install.

This is the diff from the rubocop-rspec upgrade:

Support for Ruby 2.4.1 Absent Operator

See Absent Operator.

Basically it matches anything not within the parens:

>> "John Doe" =~ /\A(?~John) Doe\z/
=> nil
>> "Jane Doe" =~ /\A(?~John) Doe\z/
=> 0

Kind of weird they released this in a patch version.

error

@type tail path /tmp/grok210 pos_file /tmp/hhhh tag grokked_log @type grok grok_pattern %{IP:ip_address} @type stdout

getting neither error nor output

Alternations are sometimes not correctly parsed

Consider this regex (for matching postal codes from Ecuador): /[A-Z]\d{4}[A-Z]|(?:[A-Z]{2})?\d{6}/. The alternation in the middle should effectively split the regex down the middle, since the alternation operator should have the lowest precedence of all regex operators. The AST should look like this:

                root
                 |
                alt
               /   \
[A-Z]\d{4}[A-Z]    (?:[A-Z]{2})?\d{6}

However, the AST generated by regexp_parser looks like this:

                    root
                    /  \
                 alt    \d{6}
                /   \
[A-Z]\d{4}[A-Z]      (?:[A-Z]{2})?

I'm not sure how to go about fixing this, any thoughts?

Missing constant in v2.1.0

I started using the new release as a dependency of Ruby2JS and got this error:

Error: …/regexp_parser-2.1.0/lib/regexp_parser/scanner.rb:13:in `<class:Scanner>': uninitialized constant Regexp::Parser (NameError)

Downgrading to 2.0.3 worked again. ~~Since the error is caused by a missing constant within the gem code itself, I don't think it's directly caused by anything in Ruby2JS. (But if so, let me know if I can help with troubleshooting.)~~

Parser is too greedy with alternation

Example code:

#!/usr/bin/env ruby

require 'rubygems'
require 'regexp_parser'
require 'pp'

s = "prefi(?:x)a|b|c"

re = Regexp.new(s)
p "prefixa" =~ re
p "prefixb" =~ re
p "prefixc" =~ re

pp Regexp::Scanner.scan(s)
pp Regexp::Parser.parse(s)

Note that the regexp does match "prefixa", "prefixb", and "prefixc". However, the parse constructs an alternation node containing "prefix(?x)", "b", and "c" as the three alternatives.

The code in question is at parser.rb:94

when :alternation
  unless @node.token == :alternation
    alt = Alternation.new(token)
    seq = Sequence.new
    while @node.expressions.last
      seq.insert @node.expressions.pop
    end
    alt.alternative(seq)

    @node << alt
    @node = alt
    @node.alternative
  else
    @node.alternative
  end

The code

while @node.expression.last

is too greedy. Would it be correct to execute only absorb the last expression rather than all previous expressions?

Add `Regexp::Syntax.supported?`

It would be nice if I could do something like

Regexp::Syntax.supported?('ruby/2.3')   # => true
Regexp::Syntax.supported?('ruby/2.3.1') # => false

This would make it easier to determine which version to use for Regexp::Parser. This is motivated by mbj/mutant#595

Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}

Example:

2.3.0 :001 > /\#{str}/
 => /\#{str}/
2.3.0 :003 > require 'regexp_parser'
 => true
2.3.0 :004 > Regexp::Parser.parse('\#{str}')
Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/scanner.rb:1698:in `scan'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/lexer.rb:20:in `lex'
  from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/parser.rb:26:in `parse'
  from (irb):4
  from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'

I don't understand yet what the source of this issue is

Ruby 2.4.4 Not Supported

Ruby 2.4.4 is out but is not yet supported by regexp_parser (Unknown syntax name 'ruby/2.4.4'. Forgot to add it to Regexp::Syntax::VERSIONS?)

This looks similar to #48

Would it make sense to structure the version parsing to assume that new patch releases support the same feature set as the latest explicitly-defined patch release for the given major/minor version? E.g. ruby 2.4.4 supports all of the same features as ruby 2.4.3 unless explicitly overridden? It seems odd to need to explicitly define/whitelist new patch revisions. I can understand wanting to explicitly define new major and minor revisions, but a patch release by it's very nature should never be removing functionality. In perusing https://github.com/ammar/regexp_parser/tree/master/lib/regexp_parser/syntax/ruby, it seems that's largely the way the files are already structured, albeit explicitly.

Fails to parse lone opening brace

Attempting to parse /{/ results in an error:

Regexp::Parser.parse '{'
# Regexp::Scanner::PrematureEndError (Premature end of pattern at {)

However, for MRI this is perfectly fine (although it's equivalent to /\{/).

/{/
#=> /{/

Rethink fallbacks for formally incorrect grammar

Hi, and thanks for the awesome gem!

Recently regexp_parser started to be used in Rubocop to check regexp redundancy.

That led to uncovering of what can be considered as a bug (rubocop bug: rubocop/rubocop#8083). When parsing regexps like this, for example: /{.+}/ (which is valid Ruby regexp), regexp_parser fails (thinking that {} is incorrect quantifier). Same is related to some other forms, like /]\[/

I found out that the matter was discussed at #15, with verdict being, that it

...is an implementation quirk of the regex engine. In other words, it's not a documented feature.

...Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.

...It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.

Actually, I believe that it is not "MRI quirk", but sane behavior of the Regexp parser, that some characters have special meaning only in context. The behavior about parsing {something that is not a quantifier}, and ] is consistent through:

Ruby
Python
JS
Perl
PHP
(probably most of the rest of the implementations, at this point I stopped checking)

So, it seems that parser that fails on those cases becomes less useful than it might be.

ammar / regexp_parser Goto Github PK

regexp_parser's People

Contributors

Stargazers

Watchers

Forkers

regexp_parser's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs