ammar / regexp_parser Goto Github PK
View Code? Open in Web Editor NEWA regular expression parser library for Ruby
License: MIT License
A regular expression parser library for Ruby
License: MIT License
On my recent mutant work on regexp mutations I found cases where, regexp parser would indicate that ruby at a given version does support a specific unicode property, where ruby apparently would not support it. It could very well also be I use the regexp_parser
API wrong.
# report.rb
require 'regexp_parser'
syntax = ::Regexp::Syntax.version_class("ruby/#{RUBY_VERSION}")
puts "RUBY_VERSION: #{RUBY_VERSION}"
puts "Regexp::Parser::VERSION: #{::Regexp::Parser::VERSION}"
puts "Syntax: #{syntax.class}"
puts "Does not recognize while indicated by regexp_parser:"
syntax
.features
.fetch(:property, []).each do |property|
property_specifier = "\\p{#{property}}"
begin
/#{property_specifier}/
rescue RegexpError
puts property_specifier
end
end
I've tested with non EOL head rubies getting me that outputs:
RUBY_VERSION: 2.7.6
Regexp::Parser::VERSION: 2.3.0
Syntax: Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}
RUBY_VERSION: 3.0.4
Regexp::Parser::VERSION: 2.3.0
Syntax: Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}
RUBY_VERSION: 3.1.2
Regexp::Parser::VERSION: 2.3.0
Syntax: Class
Does not recognize while indicated by regexp_parser:
\p{egyptian_hieroglyph_format_controls}
\p{ottoman_siyaq_numbers}
\p{small_kana_extension}
\p{symbols_and_pictographs_extended_a}
\p{tamil_supplement}
A fix may be easy in removing indicated support, or well: Telling me where I use the API wrong.
Getting a Regexp::Syntax::UnknownSyntaxNameError
on Ruby 2.5.0:
Unknown syntax name 'ruby/2.5.0'. Forgot to add it to Regexp::Syntax::VERSIONS?
When would support for 2.5.0 be available?
Hey @ammar any chance you could release a new gem version? Another option would be to add a .gemspec to this project so I can add a git or github reference to it in my Gemfile. Thanks!
I also can't figure out how to build the gem or run tests. Running rake gem:release
doesn't build the gem (at least, it doesn't appear in pkg
), and rake test
fails with a huge stack trace. Could you add some guidelines to the readme?
My understanding (based on the docs) is that the following are equivalent:
/
a # bc+
/x
/
a #bc+
/x
/
a# bc+
/x
/
a#bc+
/x
/a # bc+/x
/a #bc+/x
/a# bc+/x
/a#bc+/x
in all cases, the pattern should match only a single 'a'
as whitespace and anything on a line following a #
is ignored. Indeed it seems that Ruby (2.6.6) does indeed treat them as equivalent. However, it seems that regexp_parser
does not, and only recognises the comment if:
#
(examples 1-2)To demonstrate, run the following:
require 'regexp_parser'
p Regexp::Parser::VERSION
[
/
a # bc+
/x,
/
a #bc+
/x,
/
a# bc+
/x,
/
a#bc+
/x,
/a # bc+/x,
/a #bc+/x,
/a# bc+/x,
/a#bc+/x
].each do |pat|
puts "Pattern: #{pat.inspect}"
pat =~ 'abcc'
puts "pat =~ 'abccc' last_match: #{Regexp.last_match}"
puts Regexp::Parser.parse(pat).each_expression.map { |exp, _| exp.class.name.sub(/.*::/, "") }.join(', ')
puts
end
prints:
"1.7.1"
Pattern: /
a # bc+
/x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Comment, WhiteSpace
Pattern: /
a #bc+
/x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Comment, WhiteSpace
Pattern: /
a# bc+
/x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, WhiteSpace, Literal, Literal, WhiteSpace
Pattern: /
a#bc+
/x
pat =~ 'abccc' last_match: a
WhiteSpace, Literal, Literal, WhiteSpace
Pattern: /a # bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal
Pattern: /a #bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal
Pattern: /a# bc+/x
pat =~ 'abccc' last_match: a
Literal, WhiteSpace, Literal, Literal
Pattern: /a#bc+/x
pat =~ 'abccc' last_match: a
Literal, Literal
Notice how all patterns match just 'a'
, but only the first two parse include a Comment
expression when parsed.
Am I misunderstanding something, or is this a bug?
related: rubocop/rubocop#9056
Regexp::Parser.parse((/¡#≥
/x).to_s).to_s
expected: all nodes of tree are utf-8 (they all come from a utf-8 source string)
actual: raises Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT
thank you for your great lib!
(I'm not sure if I'm using the right terminology here so if my description doesn't make sense then just look at the irb snippet below)
It looks like regexp_parser
successfully scans for escape:codepoint_list
tokens but the different ruby version syntax classes don't say they implement this type of node.
Example:
2.3.0 :001 > codepoint_list_regex = /\u{9879}/
=> /\u{9879}/
2.3.0 :002 > require 'regexp_parser'
=> true
2.3.0 :003 > Regexp::Parser.parse(codepoint_list_regex)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [escape:codepoint_list]
I have a fix which I should be able to PR soon
test case:
require 'uri'
@results = []
1000.times { Thread.new { @results << Regexp::Parser.parse(URI.regexp) } }
@results.map(&:strfre).uniq.count # => 10 or so (should be 1)
I guess this is due to the "instance" variables on the Scanner, Lexer and Parser classes existing on the classes themselves and thus being shared among executions. If so, it can be easily fixed by making their class methods into instance methods. E.g. for Scanner:
def self.scan(input_object, &block)
new.scan(input_object, &block)
end
def scan(input_object, &block)
@literal, top, stack = nil, 0, []
# ...
end
Octal escape sequences get their own representation when parsed at the top level, but not from within a character class:
irb(main):032:0> Regexp::Parser::VERSION
=> "2.6.2"
irb(main):033:0> Regexp::Scanner.scan(/\101;\x42/)
=> [[:escape, :octal, "\\101", 0, 4],
[:literal, :literal, ";", 4, 5],
[:escape, :hex, "\\x42", 5, 9]]
irb(main):034:0> Regexp::Scanner.scan(/[\101;\x42]/)
=>
[[:set, :open, "[", 0, 1],
[:escape, :literal, "\\1", 1, 3],
[:literal, :literal, "0", 3, 4],
[:literal, :literal, "1", 4, 5],
[:literal, :literal, ";", 5, 6],
[:escape, :hex, "\\x42", 6, 10],
[:set, :close, "]", 10, 11]]
Since hex is the same in both cases, getting escape literal "\\1"
instead of escape octal "\\101"
seems like a bug.
Consider
Regexp::Scanner.scan('\c;')
And note that the result is []
\c; does compile in ruby but I'm not sure what it matches; in PCRE, it matches {, but not in ruby...
Similarly,
Regexp::Scanner.scan('')
Regexp::Scanner.scan('\x')
Regexp::Scanner.scan('\x;HelloWorld')
are all []. All are invalid in ruby.
Desired Behavior:
Thanks, c.
Hello regexp_parser team!
While improving mutants regexp support I'm often challenged with "Did I cover all nodes?" type of questions, and for that reason I try to source as many test cases as possible from my dependencies. I did so in unparser re-using the parser
test suite for ruby edge cases.
And I'd love if I had the ability to do this for the regexp parser dependency also. Is there a good way to 'mass source' expressions from your libraries tests?
Cheers,
Markus
2.3.0 :012 > Regexp::Parser.parse(/\h/)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [type:hex]
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/syntax.rb:154:in `implements!'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:22:in `block in lex'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:4200:in `emit'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:3366:in `scan'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:20:in `lex'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/parser.rb:26:in `parse'
from (irb):12
from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'
2.3.0 :013 > Regexp::Parser.parse(/\H/)
Regexp::Syntax::NotImplementedError: Regexp::Syntax::Ruby::V230 does not implement: [type:nonhex]
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/syntax.rb:154:in `implements!'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:22:in `block in lex'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:4200:in `emit'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/scanner.rb:3367:in `scan'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/lexer.rb:20:in `lex'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.3/lib/regexp_parser/parser.rb:26:in `parse'
from (irb):13
from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'
At least unused variables should be caught, c.f. #78
There is no syntax file for ruby 2.2.7, which means that on a current ruby 2.2 the code now fails with "Regexp::Syntax::UnknownSyntaxNameError: Unknown syntax name 'ruby/2.2.7'. Forgot to add it in the case statement?". I realize that ruby 2.2 is old, but it is still officially maintained until March 2018.
Right now, it is hard to tell which options apply to a specific expression when passing in a ::Regexp with flags or when there are groups with option flags.
E.g.
P = Regexp::Parser
P.parse(/a/i).expressions[0].i? # => false
P.parse(/(a)/i).expressions[0].i? # => false
P.parse(/x(?i)a/).expressions.last.i? # => false
P.parse(/(?i:(a))/).expressions[0].expressions[0].i? # => false
P.parse(/(?i-i:a)/).expressions[0].i? # => true (should be false)
To correctly determine wether a
is case-insensitive, a user would have to keep track of the root options and any "currently active" option groups himself.
@ammar if you agree, I think we could improve this by setting a correct options
hash for every Token, as follows:
Token
class, :options
, in token.rb:11Parser::options
and move all option processing to the Lexer
instead. imho currently active options are similar to group_level
or set_level
.@options
based on input
in the Lexer (use an empty Hash if input
is a ::String instead of a ::Regexp)@options
whenever encountering a group with option flags[:group, :options, ...]
into two tokens: [:group, :options_local, ...]
and [:group, :options_switch, ...]
, to allow differentiating between group-local option modifications (e.g. /a(?i:b)c/
) and those that persist after the group is closed (e.g. /a(?i)bc/
)@options
to every new Token created in lexer.rb:30(@options and @options[:m]) ? true : false
to !!@options[:m]
This is similar to #10 and pretty simple. My spec output for my regexp_parser implementation (mbj/mutant#565) currently gets flooded with this warning:
/Users/johnbackus/Projects/regexp_parser/lib/regexp_parser/expression.rb:75: warning: instance variable @quantifier not initialized
I'll open a PR to address this. I also get some similar warnings on boot from scanner.rb as mentioned in #10. I will probably look into addressing this as well and include a fix in my PR if the changes are reasonable.
Similar to #41,
we're getting errors running under Ruby 2.4.3:
Unknown syntax name 'ruby/2.4.3'. Forgot to add it to Regexp::Syntax::VERSIONS?
Hi again,
I have found one interesting case where regexp_parser
will not parse one that MRI accepts:
/(?<æ>.)(.)(?<b>\d+)(\d)/.match('ab12').named_captures # => {"æ"=>"a", "b"=>"1"}
Regexp::Parser.parse(/(?<æ>.)(.)(?<b>\d+)(\d)/) # => Regexp::Scanner::InvalidGroupOption: Invalid group option in (?
See this line and this line from ruby/spec
for where I sourced this case.
Like in #75 I'm just curious if this is intended behavior and, if so, if it can be documented. Otherwise parity with MRI is preferred.
Thanks again!
The current release doesn't have a syntax file for ruby 2.3.1.
Just released. PR #49
https://www.ruby-lang.org/en/news/2017/12/25/ruby-2-5-0-released/
S = Regexp::Scanner
S.scan /[1-9]/ # => [..., [:set, :range, "1-9", 1, 4], ...]
S.scan /[a-z]/ # => [..., [:set, :range, "a-z", 1, 4], ...]
S.scan /[!-%]/ # => [..., [:set, :member, "!", 1, 2], [:set, :member, "-", 2, 3], [:set, :member, "%", 3, 4], ...]
S.scan /[ä-ü]/ # => [..., [:set, :member, "\xC3\xA4", 1, 3], [:set, :member, "-", 3, 4], [:set, :member, "\xC3\xBC", 4, 6], ...]
I think we could detect everything as a range that follows this pattern:
anything_but_a_bracket . "-" . anything_but_a_bracket
Of course, "anything" can be a unicode escape or even a codepoint list. Funny enough this works:
/[\u{41 42}-\u{55 56}]/ =~ "\u{44}" # => 0
But besides of that are there any other pitfalls I didn’t think of?
And 2.2.10, but I don't care so much about that one :)
In ruby's regexp, /#(\d+)/
matches literal #
then unnamed capture (\d+)
.
But Regexp::Parser.parse
treats it as literal '#(\d+)'
.
irb(main):001:0> /#(\d+)/.match('#123')
=> #<MatchData "#123" 1:"123">
irb(main):002:0> Regexp::Parser.parse('#(\d+)')
=> #<Regexp::Expression::Root:0x00005619a4ad5e30 @type=:expression, @token=:root, @text="", @ts=0, @level=nil, @set_level=nil, @conditional_level=nil, @nesting_level=0, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::Literal:0x00005619a4b8cb80 @type=:literal, @token=:literal, @text="#(\\d+)", @ts=0, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}>]>
irb(main):003:0> Regexp::Parser.parse('\#(\d+)')
=> #<Regexp::Expression::Root:0x00005619a4be42e0 @type=:expression, @token=:root, @text="", @ts=0, @level=nil, @set_level=nil, @conditional_level=nil, @nesting_level=0, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::EscapeSequence::Literal:0x00005619a4bf5310 @type=:escape, @token=:literal, @text="\\#", @ts=0, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}>, #<Regexp::Expression::Group::Capture:0x00005619a4bf52c0 @type=:group, @token=:capture, @text="(", @ts=2, @level=0, @set_level=0, @conditional_level=0, @nesting_level=1, @quantifier=nil, @options={}, @expressions=[#<Regexp::Expression::CharacterType::Digit:0x00005619a4bf51d0 @type=:type, @token=:digit, @text="\\d", @ts=3, @level=1, @set_level=0, @conditional_level=0, @nesting_level=2, @quantifier=#<Regexp::Expression::Quantifier:0x00005619a4bf51a8 @token=:one_or_more, @text="+", @mode=:greedy, @min=1, @max=-1>, @options={}>], @number=1, @number_at_level=1>]>
Regexp::Parser.parse('#(\d+)')
shoud return same tree as Regexp::Parser.parse('\#(\d+)')
(except for first node, it's "#"
instead of "\\#"
)
irb(main):002:0> /(a)(?('01'))/
=> /(a)(?('01'))/
irb(main):003:0> Regexp::Parser.parse("(a)(?('01'))")
/home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:290:in `conditional': Unknown Conditional token condition_open (Regexp::Parser::UnknownTokenError)
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:81:in `parse_token'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:39:in `block in parse'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:74:in `emit'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:57:in `block in lex'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2422:in `emit'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2475:in `emit_literal'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2408:in `emit'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:2172:in `scan'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/scanner.rb:21:in `scan'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:33:in `lex'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/lexer.rb:17:in `lex'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:38:in `parse'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.2/lib/regexp_parser/parser.rb:22:in `parse'
from (irb):3:in `<main>'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `load`
I've reduced this from another rubyspec
corpus case.
Summary: /o{2}{5}/ matches 10 o's (in ruby) but the {2} quantifier is lost in the parse tree.
Example code:
#!/usr/bin/env ruby
require 'rubygems'
require 'regexp_parser'
require 'pp'
re = /o{2}{5}/
pp Regexp::Parser.parse(re)
puts "o" if "o" =~ re
puts "2 o" if "o"*2 =~ re
puts "5 o" if "o"*5 =~ re
puts "10 o" if "o"*10 =~ re
Output:
#<Regexp::Expression::Root:0x105fd9d40
@expressions=
[#<Regexp::Expression::Literal:0x105fd6398
@expressions=[],
@options=nil,
@quantifier=
#<Regexp::Expression::Quantifier:0x105fd5e20
@max=5,
@min=5,
@mode=:greedy,
@text="{5}",
@token=:interval>,
@text="o",
@token=:literal,
@type=:literal>],
@options=nil,
@text="",
@token=:root,
@type=:expression>
10 o
Comments:
As far as I can tell, the nested quantifier syntax isn't documented in ruby and is illegal in pcre. Grep for example, will not match any number of o's for the given regexp. As such, I'd be content with a will-not-fix verdict. But I thought you might be interested.
Thank you for your time.
It looks like the edge case /foo(?#)bar/
is not properly parsed. I found this while running this gem against rubyspec. Here is the relevant part of the spec: https://github.com/ruby/spec/blob/master/language/regexp_spec.rb#L102-L105
Right now regexp_parser
controlled code can generate warnings. On projects that have warnings enabled (such as mutant) and have the warning generating call side in the main loop: This can lead megabytes of warnings in the build logs.
This proposoal is about:
I'm happy to take 2. and 4. in case @ammar agrees to 1. and 4.
It seems the ts
and te
values are byte index, not character index even if you feed a multibyte string to the parser. It can be hard to have to convert index values around for one to use this parser because you normally parse a regexp as a multibyte text.
Is there any plan to optionally provide character index in addition to or instead of byte index? Thanks!
Hello 👋
I was wondering if you were planning on tagging a new release soon. This commit is interesting to me a408eb2 :)
Thanks!
Outside of sets, Regexp::Scanner recognizes two types of nonproperty
:
S = Regexp::Scanner
S.scan /\p{ascii}/ # => [[:property, :ascii, "\\p{ascii}", 0, 9]]
S.scan /\p{^ascii}/ # => [[:nonproperty, :ascii, "\\p{^ascii}", 0, 10]]
S.scan /\P{ascii}/ # => [[:nonproperty, :ascii, "\\P{ascii}", 0, 9]]
Within sets, only one of them is recognized as nonproperty
:
S.scan /[\p{ascii}]/ # => [..., [:set, :ascii, "\\p{ascii}", 1, 10], ...]]
S.scan /[\p{^ascii}]/ # => [..., [:nonproperty, :ascii, "\\p{^ascii}", 1, 11], ...]]
S.scan /[\P{ascii}]/ # => [..., [:set, :ascii, "\\P{ascii}", 1, 10], ...]]
And I guess you would actually see it as a bug that \p{^...}
does not get the type :set
like everything else in a set, is that right? That would be easy to fix, and I think it would make sense from the viewpoint of consistency to change that.
However, fixing that would still make it necessary to scan the data of properties in sets for /\\(P|p\{\^)/
to detect whether they are negative. That is different from how they are handled outside of sets and different from how classes are handled, as classes have their "polarity" encoded at index 1:
S.scan /[[:ascii:]]/ # => [..., [:set, :class_ascii, "[:ascii:]", 1, 10], ...]
S.scan /[[:^ascii:]]/ # => [..., [:set, :class_nonascii, "[:^ascii:]", 1, 10], ...]
IMHO the ideal solution would be if all the information that tokens generate outside of sets was also available when they occur in a set. I am aware that this would require a substantial refactoring and would not be backwards compatible. But it would greatly help acting on specific tokens (escapes are another example) wherever they occur, without having to scan the token data.
Any thoughts?
It looks like [\da-z]
results in an incorrect members array in CharacterSet
. Consider the following example:
Regexp::Parser.parse('[\da-z]').expressions.first.members
# => ["\\da", "-", "z"]
I think the array should contain a Digit
object or '0-9'
instead of the '\\d'
.
Right now, handling the content of character sets with regexp_parser
is hard:
Scanner
only detects few ranges successfully, as detailed in issue #29.Scanner
returns inconclusive information about member tokens because they all have the type :set
. Issue #28 describes this for properties, but it also affects \a, \e, \n, \t, \u, \v and more.Parser
then "throws away" even this limited information as it only relays the Token#text
to Set#members
. (Re-running Parser#parse
on individual Set#members
is a poor workaround for this.)What I have in mind as a general solution is the following:
:subset
token type, leaving #set_level
to differentiate between sets and subsets:set
token type only for tokens that are particular to sets ([
, ^
, &&
, ]
and ranges):member
, :member_hex
, :range
and :range_hex
tokensSet#members
, leaving #expressions
to access members, ranges and subsetsThus, parsing /a[bc-d]/
could yield something like
#<Root @expressions=[
#<Literal @type=:literal, @token=:literal, @text="a" >,
#<CharacterSet @expressions=[
#<Literal @type=:literal, @token=:literal, @text="b" >,
#<Range @type=:set, @token=:range, @expressions=[
#<Literal @type=:literal, @token=:literal, @text="c" >,
#<Literal @type=:literal, @token=:literal, @text="d" >
]>
]>
]>
The only tricky bit is rewiring the ragel machines in the right way and catching all ranges.
On the other hand, it would probably lead to less code, as special treatment is only needed for a few things within sets: the set tokens plus .
, \b
, and [:...:]
if I am not mistaken.
What do you think, @ammar?
Hi! I think I found a bug: let's say we want to capture all the letters y
followed by a closing angle bracket >
that are not preceded by the letter x
, we could use this regex:
irb(main):003:0> re = Regexp.new("(?<!x)y>")
=> /(?<!x)y>/
irb(main):004:0> ["y>", "py>", "xy>"].map { |s| re.match?(s) }
=> [true, true, false]
When we try to parse it, it fails:
irb(main):003:0> Regexp::Scanner.scan(re)
/Users/serch/.rbenv/versions/3.0.2/lib/ruby/gems/3.0.0/gems/regexp_parser-2.6.0/lib/regexp_parser/scanner.rb:2521:in `scan': Premature end of pattern at (missing group closing paranthesis) [1] (Regexp::Scanner::PrematureEndError)
without the closing angle bracket it succeeds:
irb(main):004:0> Regexp::Scanner.scan(Regexp.new("(?<!x)y"))
=> [[:assertion, :nlookbehind, "(?<!", 0, 4], [:literal, :literal, "x", 4, 5], [:group, :close, ")", 5, 6], [:literal, :literal, "y", 6, 7]]
The following regexp /[a]\g/
is accepted by Ruby, yet RegexpParser raises an error:
> ruby -e "p(/[a]\g/ =~ 'ag')"
0
> ruby -r regexp_parser -e "Regexp::Parser.parse '/[a]\g/'"
Traceback (most recent call last):
6: from -e:1:in `<main>'
5: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/parser.rb:22:in `parse'
4: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/parser.rb:38:in `parse'
3: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/lexer.rb:15:in `lex'
2: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/lexer.rb:28:in `lex'
1: from /Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/scanner.rb:71:in `scan'
/Users/mal/.rvm/gems/ruby-2.7.1/gems/regexp_parser-1.7.1/lib/regexp_parser/scanner.rb:2664:in `scan': Scan error at '\\g/' (Regexp::Scanner::ScannerError)
The same reasoning as #63 applies: while technically incorrect, this Regexp is actually accepted and might exist in the wild. Would it be possible to tweak the gem to parse it the same way Ruby does?
In some situations (TBD), while running unit tests for https://github.com/rapid7/recog/, we get warnings like:
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2463: warning: mismatched indentations at 'end' with 'begin' at 1713
... dozens of these ...
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2753: warning: statement not reached
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:1648: warning: assigned but unused variable - testEof
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:2236: warning: duplicated when clause is ignored
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/lexer.rb:118: warning: assigned but unused variable - replace
/home/jhart/.rbenv/versions/2.2.1/lib/ruby/gems/2.2.0/gems/regexp_parser-0.3.0/lib/regexp_parser/scanner.rb:4200: warning: instance variable @literal not initialized
I also got these with 0.2.1.
You can reproduce most of these with ruby -we 'require "regexp_parser"; Regexp::Scanner.scan(/(foo)/) do |token_parts| puts "capture" if token_parts.first == :group && ![:close, :passive].include?(token_parts[1]); end'
I noticed that Unicode blocks are not supported in
.Unicode blocks are basically like ranges of Unicode script: http://www.regular-expressions.info/unicode.html#block
I'm happy to make a PR for this if it's helpful.
$ bundle exec irb -r regexp_parser
irb(main):001:0> Regexp::Parser.parse("\\99")
/home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:592:in `block in assign_referenced_expressions': Invalid reference 9 at pos 0 (Regexp::Parser::ParserError)
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:590:in `each'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:590:in `assign_referenced_expressions'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:45:in `parse'
from /home/mutant-dev/.gem/ruby/3.2.2/gems/regexp_parser-2.8.1/lib/regexp_parser/parser.rb:22:in `parse'
from (irb):1:in `<main>'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/gems/3.2.0/gems/irb-1.6.2/exe/irb:11:in `<top (required)>'
from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `load'
from /home/mutant-dev/.rubies/ruby-3.2.2/bin/irb:25:in `<top (required)>'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:58:in `load'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:58:in `kernel_load'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli/exec.rb:23:in `run'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli.rb:492:in `exec'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor/command.rb:27:in `run'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor/invocation.rb:127:in `invoke_command'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/vendor/thor/lib/thor.rb:392:in `dispatch'
from /home/mutant-dev/.rubies/ruby-3.2.2/lib/ruby/3.2.0/bundler/cli.rb:34:in `dispatch'
... 7 levels...
irb(main):002:0> /\99/
=> /\99/
I suspect that regexp_parser
has to be as lenient when it comes to backrefs as ruby itself?
2.3.0 :504 > regex = /a{3}?/
=> /a{3}?/
2.3.0 :505 > Regexp::Parser.parse(regex).to_re == regex
=> false
2.3.0 :506 > Regexp::Parser.parse(regex).to_re
=> /a{3}/
It looks like the issue is the quantifier being set with incorrect text:
2.3.0 :508 > Regexp::Parser.parse(regex).first
=> #<Regexp::Expression::Literal:0x007fedbbdc4398 @type=:literal, @token=:literal, @text="a", @ts=0, @level=0, @set_level=0, @conditional_level=0, @options=nil, @quantifier=#<Regexp::Expression::Quantifier:0x007fedbbdc4118 @token=:interval, @text="{3}", @mode=:reluctant, @min=3, @max=3>>
2.3.0 :509 > Regexp::Parser.parse(regex).first.quantifier
=> #<Regexp::Expression::Quantifier:0x007fedbbd75608 @token=:interval, @text="{3}", @mode=:reluctant, @min=3, @max=3>
2.3.0 :510 > Regexp::Parser.parse(regex).first.quantifier.text
=> "{3}"
Hi,
I am working on re-introducing regexp mutation support on mutant and I noticed that since the old integration existed regexp_parser
seems to have decided to stop rejecting a large % of regexps that ruby would accept (#63) but regexp_parser
did not. I did find one additional case that was not documented anywhere I found (I tried brute-forcing millions of regexps to infer if there were any cases where regexp_parser
was stricter than MRI and this is the only class of instances I could find).
"\xA" # => "\n"
/\xA/.match?("\n") # => true
Regexp::Parser.parse(/\xA/) # => Regexp::Scanner::PrematureEndError: Premature end of pattern at \x
Is this a bug or intended behavior? Either is fine for my purposes since I can just add a special check to ignore errors in this case, but I was curious if this was an intended difference or not. The coverage matrix in the README suggests that hex escapes work but I guess this is a special case that was not highlighted. If it is intentional behavior, it would be helpful to document it (unless I missed where this was done already) or alternatively having parity with MRI would work for me.
Thanks!
Dependabot rubocop-rspec upgrades are installing the new version of regexp_parser, 2.9.0. However, we are getting this error:
➜ bundle install
Fetching gem metadata from https://rubygems.org/.........
Your bundle is locked to regexp_parser (2.9.0) from rubygems repository https://rubygems.org/ or installed locally, but that version can no longer be found in that source. That means the author of regexp_parser (2.9.0) has removed it. You'll need to update your bundle to a version other than regexp_parser (2.9.0) that hasn't been removed in order to install.
This is the diff from the rubocop-rspec upgrade:
See Absent Operator.
Basically it matches anything not within the parens:
>> "John Doe" =~ /\A(?~John) Doe\z/
=> nil
>> "Jane Doe" =~ /\A(?~John) Doe\z/
=> 0
Kind of weird they released this in a patch version.
Consider this regex (for matching postal codes from Ecuador): /[A-Z]\d{4}[A-Z]|(?:[A-Z]{2})?\d{6}/
. The alternation in the middle should effectively split the regex down the middle, since the alternation operator should have the lowest precedence of all regex operators. The AST should look like this:
root
|
alt
/ \
[A-Z]\d{4}[A-Z] (?:[A-Z]{2})?\d{6}
However, the AST generated by regexp_parser
looks like this:
root
/ \
alt \d{6}
/ \
[A-Z]\d{4}[A-Z] (?:[A-Z]{2})?
I'm not sure how to go about fixing this, any thoughts?
I started using the new release as a dependency of Ruby2JS and got this error:
Error: …/regexp_parser-2.1.0/lib/regexp_parser/scanner.rb:13:in `<class:Scanner>': uninitialized constant Regexp::Parser (NameError)
Downgrading to 2.0.3 worked again. Since the error is caused by a missing constant within the gem code itself, I don't think it's directly caused by anything in Ruby2JS. (But if so, let me know if I can help with troubleshooting.)
Example code:
#!/usr/bin/env ruby
require 'rubygems'
require 'regexp_parser'
require 'pp'
s = "prefi(?:x)a|b|c"
re = Regexp.new(s)
p "prefixa" =~ re
p "prefixb" =~ re
p "prefixc" =~ re
pp Regexp::Scanner.scan(s)
pp Regexp::Parser.parse(s)
Note that the regexp does match "prefixa", "prefixb", and "prefixc". However, the parse constructs an alternation node containing "prefix(?x)", "b", and "c" as the three alternatives.
The code in question is at parser.rb:94
when :alternation
unless @node.token == :alternation
alt = Alternation.new(token)
seq = Sequence.new
while @node.expressions.last
seq.insert @node.expressions.pop
end
alt.alternative(seq)
@node << alt
@node = alt
@node.alternative
else
@node.alternative
end
The code
while @node.expression.last
is too greedy. Would it be correct to execute only absorb the last expression rather than all previous expressions?
It would be nice if I could do something like
Regexp::Syntax.supported?('ruby/2.3') # => true
Regexp::Syntax.supported?('ruby/2.3.1') # => false
This would make it easier to determine which version to use for Regexp::Parser
. This is motivated by mbj/mutant#595
Example:
2.3.0 :001 > /\#{str}/
=> /\#{str}/
2.3.0 :003 > require 'regexp_parser'
=> true
2.3.0 :004 > Regexp::Parser.parse('\#{str}')
Regexp::Scanner::PrematureEndError: Premature end of pattern at #{str}
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/scanner.rb:1698:in `scan'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/lexer.rb:20:in `lex'
from /Users/johnbackus/.rvm/gems/ruby-2.3.0/gems/regexp_parser-0.3.2/lib/regexp_parser/parser.rb:26:in `parse'
from (irb):4
from /Users/johnbackus/.rvm/rubies/ruby-2.3.0/bin/irb:11:in `<main>'
I don't understand yet what the source of this issue is
Ruby 2.4.4 is out but is not yet supported by regexp_parser
(Unknown syntax name 'ruby/2.4.4'. Forgot to add it to Regexp::Syntax::VERSIONS?
)
This looks similar to #48
Would it make sense to structure the version parsing to assume that new patch releases support the same feature set as the latest explicitly-defined patch release for the given major/minor version? E.g. ruby 2.4.4 supports all of the same features as ruby 2.4.3 unless explicitly overridden? It seems odd to need to explicitly define/whitelist new patch revisions. I can understand wanting to explicitly define new major and minor revisions, but a patch release by it's very nature should never be removing functionality. In perusing https://github.com/ammar/regexp_parser/tree/master/lib/regexp_parser/syntax/ruby, it seems that's largely the way the files are already structured, albeit explicitly.
Attempting to parse /{/
results in an error:
Regexp::Parser.parse '{'
# Regexp::Scanner::PrematureEndError (Premature end of pattern at {)
However, for MRI this is perfectly fine (although it's equivalent to /\{/
).
/{/
#=> /{/
Hi, and thanks for the awesome gem!
Recently regexp_parser started to be used in Rubocop to check regexp redundancy.
That led to uncovering of what can be considered as a bug (rubocop bug: rubocop/rubocop#8083). When parsing regexps like this, for example: /{.+}/
(which is valid Ruby regexp), regexp_parser fails (thinking that {}
is incorrect quantifier). Same is related to some other forms, like /]\[/
I found out that the matter was discussed at #15, with verdict being, that it
...is an implementation quirk of the regex engine. In other words, it's not a documented feature.
...Hence I propose to never even try to implement "Ruby" but implement a sane subset, explicitly not supporting stuff that does not make sense outside MRI implementation quirks.
...It raises exceptions now, keep it like this. But document the fact that regexp_parser does not support each MRI quirk.
Actually, I believe that it is not "MRI quirk", but sane behavior of the Regexp parser, that some characters have special meaning only in context. The behavior about parsing {something that is not a quantifier}
, and ]
is consistent through:
So, it seems that parser that fails on those cases becomes less useful than it might be.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.