qntm / greenery Goto Github PK

Regular expression manipulation library

License: MIT License

Python 100.00%

greenery's Introduction

greenery

Tools for parsing and manipulating regular expressions. Note that this is a very different concept from that of simply creating and using those regular expressions, functionality which is present in basically every programming language in the world, Python included.

This project was undertaken because I wanted to be able to compute the intersection between two regular expressions. The "intersection" is the set of strings which both regular expressions will accept, represented as a third regular expression.

Installation

pip install greenery

Example

from greenery import parse

print(parse("abc...") & parse("...def"))
# "abcdef"

print(parse("\d{4}-\d{2}-\d{2}") & parse("19.*"))
# "19\d{2}-\d{2}-\d{2}"

print(parse("\W*") & parse("[a-g0-8$%\^]+") & parse("[^d]{2,8}"))
# "[$%\^]{2,8}"

print(parse("[bc]*[ab]*") & parse("[ab]*[bc]*"))
# "([ab]*a|[bc]*c)?b*"

print(parse("a*") & parse("b*"))
# ""

print(parse("a") & parse("b"))
# "[]"

In the penultimate example, the empty string is returned, because only the empty string is in both of the regular languages a* and b*. In the final example, an empty character class has been returned. An empty character class can never match anything, which means greenery can use this to represent a regular expression which matches no strings at all. Note that this is different from only matching the empty string.

Internally, greenery works by converting regular expressions to finite state machines, computing the intersection of the two FSMs as a third FSM, and using the Brzozowski algebraic method (q.v.) to convert the third FSM back to a regular expression.

API

parse(string)

This function takes a regular expression (i.e. a string) as input and returns a Pattern object (see below) representing that regular expression.

The following metacharacters and formations have their usual meanings: ., *, +, ?, {m}, {m,}, {m,n}, (), |, [], ^ within [] character ranges only, - within [] character ranges only, and \ to escape any of the preceding characters or itself.

These character escapes are possible: \t, \r, \n, \f, \v.

These predefined character sets also have their usual meanings: \w, \d, \s and their negations \W, \D, \S. . matches any character, including new line characters and carriage returns.

An empty charclass [] is legal and matches no characters: when used in a regular expression, the regular expression may match no strings.

Unsupported constructs

This method is intentionally rigorously simple, and tolerates no ambiguity. For example, a hyphen must be escaped in a character class even if it appears first or last. [-abc] is a syntax error, write [\-abc]. Escaping something which doesn't need it is a syntax error too: [\ab] resolves to neither [\\ab] nor [ab].
The ^ and $ metacharacters are not supported. By default, greenery assumes that all regexes are anchored at the start and end of any input string. Carets and dollar signs will be parsed as themselves. If you want to not anchor at the start or end of the string, put .* at the start or end of your regex respectively.

This is because computing the intersection between .*a.* and .*b.* (1) is largely pointless and (2) usually results in gibberish coming out of the program.
The non-greedy operators *?, +?, ?? and {m,n}? are permitted but do nothing. This is because they do not alter the regular language. For example, abc{0,5}def and abc{0,5}?def represent precisely the same set of strings.
Parentheses are used to alternate between multiple possibilities e.g. (a|bc) only, not for capture grouping. Here's why:
```
print(parse("(ab)c") & parse("a(bc)"))
# "abc"
```
The (?:...) syntax for non-capturing groups is permitted, but does nothing.
Other (?...) constructs are not supported (and most are not regular in the computer science sense).
Back-references, such as ([aeiou])\1, are not regular.

Pattern

A Pattern represents a regular expression and exposes various methods for manipulating it and combining it with other regular expressions. Patterns are immutable.

A regular language is a possibly-infinite set of strings. With this in mind, Pattern implements numerous methods like those on frozenset, as well as many regular expression-specific methods.

It's not intended that you construct new Pattern instances directly; use parse(string), above.

Method	Behaviour
`pattern.matches("a")` `"a" in pattern`	Returns `True` if the regular expression matches the string or `False` if not.
`pattern.strings()` `for string in pattern`	Returns a generator of all the strings that this regular expression matches.
`pattern.empty()`	Returns `True` if this regular expression matches no strings, otherwise `False`.
`pattern.cardinality()` `len(pattern)`	Returns the number of strings which the regular expression matches. Throws an `OverflowError` if this number is infinite.
`pattern1.equivalent(pattern2)`	Returns `True` if the two regular expressions match exactly the same strings, otherwise `False`.
`pattern.copy()`	Returns a shallow copy of `pattern`.
`pattern.everythingbut()`	Returns a regular expression which matches every string not matched by the original. `pattern.everythingbut().everythingbut()` matches the same strings as `pattern`, but is not necessarily identical in structure.
`pattern.reversed()` `reversed(pattern)`	Returns a reversed regular expression. For each string that `pattern` matched, `reversed(pattern)` will match the reversed string. `reversed(reversed(pattern))` matches the same strings as `pattern`, but is not necessarily identical.
`pattern.times(star)` `pattern * star`	Returns the input regular expression multiplied by any `Multiplier` (see below).
`pattern1.concatenate(pattern2, ...)` `pattern1 + pattern2 + ...`	Returns a regular expression which matches any string of the form a·b·... where a is a string matched by `pattern1`, b is a string matched by `pattern2` and so on.
`pattern1.union(pattern2, ...)` `pattern1 \| pattern2 \| ...`	Returns a regular expression matching any string matched by any of the input regular expressions. This is also called alternation.
`pattern1.intersection(pattern2, ...)` `pattern1 & pattern2 & ...`	Returns a regular expression matching any string matched by all input regular expressions. The successful implementation of this method was the ultimate goal of this entire project.
`pattern1.difference(pattern2, ...)` `pattern1 - pattern2 - ...`	Subtract the set of strings matched by `pattern2` onwards from those matched by `pattern1` and return the resulting regular expression.
`pattern1.symmetric_difference(pattern2, ...)` `pattern1 ^ pattern2 ^ ...`	Returns a regular expression matching any string accepted by `pattern1` or `pattern2` but not both.
`pattern.derive("a")`	Return the Brzozowski derivative of the input regular expression with respect to "a".
`pattern.reduce()`	Returns a regular expression which is equivalent to `pattern` (i.e. matches exactly the same strings) but is simplified as far as possible. See dedicated section below.

pattern.reduce()

Call this method to try to simplify the regular expression object. The follow simplification heuristics are supported:

(ab|cd|ef|)g to (ab|cd|ef)?g
([ab])* to [ab]*
ab?b?c to ab{0,2}c
aa to a{2}
a(d(ab|a*c)) to ad(ab|a*c)
0|[2-9] to [02-9]
abc|ade to a(bc|de)
xyz|stz to (xy|st)z
abc()def to abcdef
a{1,2}|a{3,4} to a{1,4}

The value returned is a new Pattern object.

Note that in a few cases this did not result in a shorter regular expression.

Multiplier

A combination of a finite lower Bound (see below) and a possibly-infinite upper Bound.

from greenery import parse, Bound, INF, Multiplier

print(parse("a") * Multiplier(Bound(3), INF)) # "a{3,}"

STAR

Special Multiplier, equal to Multiplier(Bound(0), INF). When it appears in a regular expression, this is {0,} or the Kleene star *.

QM

Special Multiplier, equal to Multiplier(Bound(0), Bound(1)). When it appears in a regular expression, this is {0,1} or ?.

PLUS

Special Multiplier, equal to Multiplier(Bound(1), INF). When it appears in a regular expression, this is {1,} or +.

Bound

Represents a non-negative integer or infinity.

INF

Special Bound representing no limit. Can be used as an upper bound only.

Charclass

This class represents a character class such as a, \w, ., [A-Za-z0-9_], and so on. Charclasses must be constructed longhand either using a string containing all the desired characters, or a tuple of ranges, where each range is a pair of characters to be used as the range's inclusive endpoints. Use ~ to negate a Charclass.

a = Charclass("a")
[abyz] = Charclass("abyz")
[a-z] = Charclass("abcdefghijklmnopqrstuvwxyz") or Charclass((("a", "z"),))
\w = Charclass((("a", "z"), ("A", "Z"), ("0", "9"), ("_", "_")))
[^x] = ~Charclass("x")
\D = ~Charclass("0123456789")
. = ~Charclass(())

Fsm

An Fsm is a finite state machine which accepts strings (or more generally iterables of Unicode characters) as input. This is used internally by Pattern for most regular expression manipulation operations.

In theory, accepting strings as input means that every Fsm's alphabet is the same: the set of all 1,114,112 possible Unicode characters which can make up a string. But this is a very large alphabet and would result in extremely large transition maps, and have very poor performance. So, in practice, Fsm uses not single characters but Charclasses (see above) for its alphabet and its map transitions.

# FSM accepting only the string "a"
a = Fsm(
    alphabet={Charclass("a"), ~Charclass("a")},
    states={0, 1, 2},
    initial=0,
    finals={1},
    map={
        0: {Charclass("a"): 1, ~Charclass("a"): 2},
        1: {Charclass("a"): 2, ~Charclass("a"): 2},
        2: {Charclass("a"): 2, ~Charclass("a"): 2},
    },
)

Notes:

The Charclasses which make up the alphabet must partition the space of all Unicode characters - every Unicode character must be a member of exactly one Charclass in the alphabet.
States must be integers.
The map must be complete. Omitting transition symbols or states is not permitted.

A regular language is a possibly-infinite set of strings. With this in mind, Fsm implements several methods like those on frozenset.

Method	Behaviour
`fsm.accepts("a")`	Returns `True` if the FSM accepts string or `False` if not.
`fsm.strings()`	Returns a generator of all the strings which this FSM accepts.
`fsm.empty()`	Returns `True` if this FSM accepts no strings, otherwise `False`.
`fsm.cardinality()`	Returns the number of strings which the FSM accepts. Throws an `OverflowError` if this number is infinite.
`fsm1.equivalent(fsm2)`	Returns `True` if the two FSMs accept exactly the same strings, otherwise `False`.
`fsm.copy()`	Returns a shallow copy of `fsm`.
`fsm.everythingbut()`	Returns an FSM which accepts every string not matched by the original. `fsm.everythingbut().everythingbut()` matches the same strings as `fsm`.
`fsm1.concatenate(fsm2, ...)`	Returns an FSM which accepts any string of the form a·b·... where a is a string accepted by `fsm1`, b is a string accepted by `fsm2` and so on.
`fsm.times(multiplier)`	Returns the input FSM concatenated with itself `multiplier` times. `multiplier` must be a non-negative integer.
`fsm.star()`	Returns an FSM which is the Kleene star closure of the original.
`fsm1.union(fsm2, ...)`	Returns an FSM accepting any string matched by any of the input FSMs. This is also called alternation.
`fsm1.intersection(fsm2, ...)`	Returns an FSM accepting any string matched by all input FSMs.
`fsm1.difference(fsm2, ...)`	Subtract the set of strings matched by `fsm2` onwards from those matched by `fsm1` and return the resulting FSM.
`fsm1.symmetric_difference(fsm2, ...)`	Returns an FSM matching any string accepted by `fsm1` or `fsm2` but not both.
`fsm.derive(string)`	Return the Brzozowski derivative of the input FSM with respect to the input string.
`fsm.reduce()`	Returns an FSM which is equivalent to `fsm` (i.e. accepts exactly the same strings) but has a minimal number of states.

Note that methods combining FSMs usually output new FSMs with modified alphabets. For example, concatenating an FSM with alphabet {Charclass("a"), ~Charclass("a")} and another FSM with alphabet {Charclass("abc"), ~Charclass("abc")} usually results in a third FSM with a repartitioned alphabet of {Charclass("a"), Charclass("bc"), ~Charclass("abc")}. Notice how all three alphabets partition the space of all Unicode characters.

Several other methods on Fsm instances are available - these should not be used, they're subject to change.

EPSILON

Special Fsm which accepts only the empty string.

NULL

Special Fsm which accepts no strings.

Development

Running tests

pip install -r requirements.dev.txt
isort .
black .
mypy greenery
flake8 --count --statistics --show-source --select=E9,F63,F7,F82 .
flake8 --count --statistics --exit-zero --max-complexity=10 .
pylint --recursive=true .
pytest

Building and publishing new versions

Update the version in ./setup.py
Trash ./dist
python -m build - creates a ./dist directory with some stuff in it
python -m twine upload dist/*

greenery's People

Contributors

Stargazers

Watchers

greenery's Issues

deepcopy changes finite state machine

I am trying to use the finite state machines and regex together in my code but the finite state machine changes when I copy the object

regex = parse(".*")
fsm = regex.to_fsm()
fsm_copy = copy.deepcopy(fsm)
fsm.equivalent(regex.to_fsm())  # True
fsm_copy.equivalent(regex.to_fsm()) # False
fsm_copy.equivalent(fsm) # False

(a{2})* should not reduce to a*

The routine for multiplying two multipliers together is faulty, it assumes that this is possible in all situations when in fact it is not. For example, {2,2} multiplied by {0,inf} gives the possibilities 0, 2, 4, 6, 8, ... which cannot be expressed as a continuous range. At the moment we return {0, inf} instead of throwing an exception.

Strange behaviour

>>> print(parse("(\d{2})+") & parse("(\d{3})+") == parse("(\d{6})+"))
False

I don't see how \d(\d{6})*\d{5} differs from \d{6} (neither I see why it's computed to be a minimal regex).

Can't convert this seemingly simple FSM to a regex in good time

Looking at this Twitter thread, we see a task which is exactly the kind of thing which greenery is intended to accomplish: building finite state machines and then converting them to regular expressions. I decided to apply greenery to this problem, and as an intermediate phase of this work, I ended up with the following finite state machine:

  name final? D  L  R  U
--------------------------
* 0    False  1        2
  1    False     3
  2    False     4
  3    False           5
  4    False  6
  5    False        7
  6    False        8
  7    False           9
  8    False  10
  9    False     11
  10   False     12
  11   False     13
  12   False     14
  13   False  15
  14   False           16
  15   False        17
  16   False        18
  17   False  19
  18   False           19
  19   False     20
  20   True

which can be built using the Python expression:

fsm(alphabet = {'R', 'L', 'U', 'D'}, states = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, initial = 0, finals = {20}, map = {0: {'D': 1, 'U': 2}, 1: {'L': 3}, 2: {'L': 4}, 3: {'U': 5}, 4: {'D': 6}, 5: {'R': 7}, 6: {'R': 8}, 7: {'U': 9}, 8: {'D': 10}, 9: {'L': 11}, 10: {'L': 12}, 11: {'L': 13}, 12: {'L': 14}, 13: {'D': 15}, 14: {'U': 16}, 15: {'R': 17}, 16: {'R': 18}, 17: {'D': 19}, 18: {'U': 19}, 19: {'L': 20}, 20: {}})

This FSM is simple enough that it's possible to convert it to a regular expression by hand. The result is:

(DLURULLDRD|ULDRDLLUR)L

However, running greenery.lego.from_fsm against this FSM seems to take minutes, at least, on my machine - long enough that it hasn't finished in the time it's taking me to raise this issue.

This indicates some kind of performance problem in greenery, or possibly some kind of accidental infinite loop.

"aa" & "aaaa" throws exception

The preferred response from this operation would be nothing ("[]").

Reverse regexes and FSMs

Would be nice to be able to call reverse() to get a reversed regex or FSM.

Unicode regexes not supported

[21:53:07]>>> parse(u'我')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 1641, in __repr__
    string += ", ".join(repr(c) for c in self.concs)
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 1641, in <genexpr>
    string += ", ".join(repr(c) for c in self.concs)
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 1404, in __repr__
    string += ", ".join(repr(m) for m in self.mults)
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 1404, in <genexpr>
    string += ", ".join(repr(m) for m in self.mults)
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 1169, in __repr__
    string += repr(self.multiplicand)
  File "/home/d33tah/virtualenv-py2/local/lib/python2.7/site-packages/greenery/lego.py", line 636, in __repr__
    string += repr("".join(str(char) for char in sorted(self.chars, key=str)))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u6211' in position 0: ordinal not in range(128)

Greenery fails to output meaningful regex for particular FSM

My sample python file:

from greenery import fsm, lego

S5, S26, S45, S63, S80, S97, S113, S127, S140, S152, S163, S175, S182 = range(13)
char0, char1, char2, char3, char4, char5, char6, char7, char8 = '_', 'a', 'd', 'e', 'g', 'm', 'n', 'o', 'p'

machine = fsm.fsm(
alphabet = { char0, char1, char2, char3, char4, char5, char6, char7, char8 },
states = {S5, S26, S45, S63, S80, S97, S113, S127, S140, S152, S163, S175, S182 },

    initial = S5,
    finals = {S182},
    map = {
                    S113: {char0: S127},
                    S127: {char7: S140},
                    S140: {char6: S152},
                    S152: {char0: S163},
                    S163: {char5: S175},
                    S175: {char8: S182},
                    S26: {char1: S45},
                    S45: {char5: S63},
                    S5: {char2: S26},
                    S63: {char1: S80},
                    S80: {char4: S97},
                    S97: {char3: S113},
    },

)

rex = lego.from_fsm(machine)
print(rex)

print(machine)

No errors reported and the output is:

[]
name final? _ a d e g m n o p

0 False 1
1 False 2
2 False 3
3 False 4
4 False 5
5 False 6
6 False 7
7 False 8
8 False 9
9 False 10
10 False 11
11 False 12
12 True

What am I supposed to change in the python file to see regex damage_on_mp ?

[\w] is not supported

glitchmr@strawberry ~/g/greenery> python3 main.py '[\w]' 'blah'
Traceback (most recent call last):
  File "main.py", line 15, in <module>
    p = parse(regexes[0])
  File "/home/glitchmr/git/greenery/lego.py", line 64, in parse
    raise Exception("Could not parse '" + string + "' beyond index " + str(i))
Exception: Could not parse '[\w]' beyond index 0

Don't call `reduce()` automatically on every new `lego` object

The @reduce_after decorator is used on numerous methods in the lego package to ensure that the resulting lego object is minimal. However, this call is potentially performance-intensive and ends up getting called numerous times, frequently unnecessarily, during normal FSM<->regex conversions. Really, it should be up to the user to call reduce() when they feel it to be necessary.

(In fact, it often isn't... legibility is difficult to achieve even in theory for many of the results of this package, nor is it necessary because Python can handle the complexity.)

Method for testing two regular expressions for equivalence

For two regular expressions (or FSMs) A and B it is possible to determine whether they express the same regular languages quite easily, by computing (A & ~~B) | (~~A & B) and checking that the result is the empty language. (I think.) Anyway it would be cool to expose this as a useful method in both classes.

New release?

Hi,

Thanks for a great project!

I was inquiring why we haven't seen a new release on pypi since ver. 3.1 on April 2018?
greenery went through several improvements/bug fixes over the past period and it would be very useful if a new release is available on pypi... especially for libraries which directly depend on greenery (like our https://github.com/IBM/jsonsubschema) library.

Is there anything we can help with to speed up this?

Regards,

Unicode support

I've encountered some problems using this library with Russian language. As far as I can see, the problem is in the usage of str's and str literals in code, as well as hardcoding character lists for English. I've managed to make some things work in Py2.7 by adding "from future import unicode_literals" to code and tests, adding Russian to character lists and replacing str to unicode in some places. Maybe I'll come up to a pull request, but in the meantime I'd like to know whether you think it's really needed or you have any ideas on how to implement it the most straightforward way.

Parse escaped characters `\x??`, `\u????` and `\U????????`

It seems that greenery does not support escaped characters:

import greenery
greenery.parse(
    '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$'
)

... throws

greenery.parse.NoMatch: Could not parse '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$' beyond index 1

However, Python's re module works with the escapes:

import re
re.compile(
    '^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$'
)

I would expect greenery to match built-in module re in this regard. Or is this behavior by design?

What is the license?

I'd like to use this code, but I'm not sure what the license is. Can you add a LICENSE file?

Support NFAs

If I run parse('(0|1)*1'+'(0|1)'*n).to_fsm(), I get a FSM with 2^(n+1) states.
This is a common issue with using DFAs over NFAs for regular expression parsing: While they support faster parsing (linear time instead of quadratic) they can take exponential space.

For my use I don't mind paying the cost of running the NFA using breadth first search.
But I'd rather not accidentally allocate exponential memory and crash my program.
A lot of operations, like union/concat etc. are also easier with NFAs, as you can insert epsilon edges to solve a lot of issues.

Issue for isdisjoin()

Hello,

I am seeing the following output.

from greenery.lego import parse
parse("/etc/.").to_fsm().isdisjoint(parse("/etc/something.").to_fsm())
True

However, I can verify that:

parse("/etc/.").to_fsm().accepts("/etc/something")
True
parse("/etc/something.").to_fsm().accepts("/etc/something")
True

Am I misinterpreting the semantics of the isdisjoint() method, or is this a bug.

Some background info: I have a similar motivation to yours, I'm interested in the intersection of two regular expressions. However, I am really only interested in knowing whether this intersection is empty. Is there a more efficient way to doing this than calculating 'parse(rx1) & parse(rx2)' and checking if it is empty? I am really trying to to check if two regular expressions that describe file system paths will overlap to some degree.

Cannot parse `[.-]`.

As far as I know, in regular expressions, when the hyphen (-) is used inside a character class ([...]), and it is last character within the class, it is treated as a literal hyphen.

However, greenery fails on this regex:

    def parse(string: str):
        '''
            Parse a full string and return a `Pattern` object. Fail if
            the whole string wasn't parsed
        '''
        obj, i = match_pattern(string, 0)
        if i != len(string):
>           raise Exception(
                f"Could not parse '{string}' beyond index {str(i)}"
            )
E           Exception: Could not parse '[.-]' beyond index 0

The standard re library works fine:

>>> re.compile('[.-]')
re.compile(r'[.-]', re.UNICODE)

If I escape the dash, like [.\-], greenery accepts it, but it shouldn't be necessary.

It is impossible to define a true Mealy machine

From my understanding, the current behavior of this library is that all transitions are assigned the same output symbol as their input symbol. The mathematical definition of a Mealy machine requires transitions to have separate inputs and outputs (See: wikipedia.) For example, a transition from state A to state B whose input is the symbol X must be able to output a different symbol Y when that transition occurs.

Would it be possible to add this capability to this library?

Infinite loop in from_fsm() on Python 2

Converting the following simple 2-state automaton into a regexp causes an infinite recursion, when run on Python 2.7:

from greenery import fsm
from greenery import lego
f = fsm.fsm(alphabet={'a','b'}, states={0,1}, initial=0, finals={0}, map={0: {'a':0, 'b':1}, 1: {'a':1, 'b':0}})
print(f)
l = lego.from_fsm(f)
print(l)

In particular, I get a maximum recursion depth exceeded error when running from_fsm() in the script above, on Python 2.7:

Traceback (most recent call last):
  File "testlego.py", line 5, in <module>
    l = lego.from_fsm(f)
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 159, in from_fsm
    return brz[f.initial][outside].reduce()
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 44, in new_method
    result = method(self, *args, **kwargs)
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 1441, in reduce
    reduced = [m.reduce() for m in self.mults]
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 44, in new_method
    result = method(self, *args, **kwargs)
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 1277, in reduce
    reduced = self.multiplicand.reduce()
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 47, in new_method
    return result.reduce()
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 47, in new_method
    return result.reduce()
[repeats many times]
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 47, in new_method
    return result.reduce()
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 44, in new_method
    result = method(self, *args, **kwargs)
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 1719, in reduce
    if reduced != self.concs:
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 1395, in __eq__
    return self.mults == other.mults
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 1160, in __eq__
    and self.multiplier == other.multiplier
  File ".local/lib/python2.7/site-packages/greenery/lego.py", line 990, in __eq__
    return self.min == other.min and self.max == other.max
RuntimeError: maximum recursion depth exceeded

I'm using Python 2.7:

$ python
Python 2.7.13 (default, Jan 12 2017, 17:59:37)
[GCC 6.3.1 20161221 (Red Hat 6.3.1-1)] on linux2

When I run the same script with Python 3.5, it works fine. Python's sys.getrecursionlimit() returns 1000 for both Python 2 and Python 3 on my platform. I installed greenery via pip --user, and I have version 3.0 of the greenery library (the latest that Pip could find).

Why are instance attributes initialized with `self.dict`?

Less of an issue, more of a curiosity.

In all the class __init__(self) methods, there is something like:

self.__dict__["chars"]   = chars
self.__dict__["negated"] = negateMe

Instead of:

self.chars   = chars
self.negated = negateMe

This breaks static checks, but since this is mostly your repository I don't think that's a problem. However, I am curious why you initialize attributes this way. Does it give some special functionality that I missed?

fsm.lego() should raise an error if you used non-characters in your alphabet

At present it looks there is no particular check during a call to fsm.lego() to make sure that the alphabet used in the FSM consists entirely of single-character strings (or lego.otherchars). It's not possible to produce a meaningful regex under these circumstances so a clear error should be raised instead.

Allow strings of length 2 or more for FSMs being converted to regexes

At present the fsm.lego() method only works if your FSM's alphabet contains only single-character strings (or lego.otherchars). It would be neat to also accept longer strings here. (Maybe even zero-character strings, and more complex lego objects?)

Sorting on alphabet set fails on unpickled fsm object

Hi,
I have found that many of the functions of the fsm class will fail on an unpickled fsm object. Below is an example of checking if one regex is a subset of another both before and after pickling and unpickling two fsm objects.

import pickle
from greenery import fsm, lego

r1 = "[A-Za-z0-9]{0,4}"
r2 = "[A-Za-z0-9]{0,3}"

l1: lego.lego = lego.parse(r1)
l2: lego.lego = lego.parse(r2)

f1: fsm.fsm = l1.to_fsm()
f2: fsm.fsm = l2.to_fsm()

if f2 < f1:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

with open("/tmp/f1.bin", "wb") as f:
    pickle.dump(f1, f)

with open("/tmp/f2.bin", "wb") as f:
    pickle.dump(f2, f)

with open("/tmp/f1.bin", "rb") as f:
    f1_unpickled: fsm.fsm = pickle.load(f)

with open("/tmp/f2.bin", "rb") as f:
    f2_unpickled: fsm.fsm = pickle.load(f)

if f2_unpickled < f1_unpickled:
    print("r2 is a proper subset of r1")
else:
    print("r2 is NOT a proper subset of r1")

In the first if-statement it will correctly print out r2 is a proper subset of r1, but in the second one it will fail with the following traceback:

Traceback (most recent call last):
  File "/home/test/test/test.py", line 33, in <module>
    if f2_unpickled < f1_unpickled:
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 615, in __lt__
    return self.ispropersubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 608, in ispropersubset
    return self <= other and self != other
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 601, in __le__
    return self.issubset(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 594, in issubset
    return (self - other).empty()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 542, in __sub__
    return self.difference(other)
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 539, in difference
    return parallel(fsms, lambda accepts: accepts[0] and not any(accepts[1:]))
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 757, in parallel
    return crawl(alphabet, initial, final, follow).reduce()
  File "/home/test/test/.venv/lib/python3.8/site-packages/greenery/fsm.py", line 782, in crawl
    for symbol in sorted(alphabet, key=key):
TypeError: '<' not supported between instances of 'anything_else_cls' and 'str'

As can be seen it seems to be caused by not being able to sort the alphabet set because it cannot compare instances of anything_else_cls and str.
I have found that casting the symbol variable to a string in the key function like below will fix the issue, but I don't know if it is the correct way to do it?

def key(symbol):
    '''Ensure `fsm.anything_else` always sorts last'''
    return (symbol is anything_else, str(symbol))

Pylance sees greenery.lego.parse as returning NoReturn

This code fails to pass Pylance with a 'Cannot access member "everythingbut" for type "NoReturn"'

from greenery.lego import parse

test_regex = parse(".")
test_regex.everythingbut()

The heart of the issue is that the lego class has a number of functions that do not have a return type hint and are not marked with @abstractmethod thus forcing Pylance to infer that the class never actually returns and thus is labeled NoReturn.

Suggestions would be to add at least return type hints to the lego and pattern classes.

Installation issue with pip

There seems to be an issue in the setup that prevents pip from installing the package:

pip3 install https://github.com/ferno/greenery/archive/master.zip
 Collecting https://github.com/ferno/greenery/archive/master.zip        
  Downloading https://github.com/ferno/greenery/archive/master.zip     
     - 51.9 kB 3.2 MB/s 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error

  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [6 lines of output]
      Traceback (most recent call last):        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "C:\Users\username\AppData\Local\Temp\pip-req-build-j7vs03bj\setup.py", line 3, in <module>
          from greenery import __version__
      ImportError: cannot import name '__version__' from 'greenery' (C:\Users\lukas.rudischhauser\AppData\Local\Temp\pip-req-build-j7vs03bj\greenery\__init__.py)
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.```

Do not instantiate character ranges?

Thank you a lot for this really nice library!

We are trying to merge a couple of patterns, where one pattern consists of a series of character ranges. Namely, we look into characters permitted by XML 1.0:

import greenery

greenery.parse('^[\t\n\r -\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]*$')

We noticed that greenery immediately instantiates the character ranges into individual characters. Merging this pattern with any other pattern becomes obviously intractable on personal computers.

Have you ever considered handling of character ranges in a special manner rather than instantiating them immediately?

Regexp standard .posix or w3c?

I want to know the format and standard of regex this tool is complaint with..thanks.

Whether it's w3c based or posix based?

Character | is left unescaped

Sample script:

from greenery import fsm, lego

S1, S5 = range(2)
char0, char1, char2 = '\n', '\r', '|'

machine = fsm.fsm(
alphabet = { char0, char1, char2 },
states = {S1, S5 },

initial = S1,
finals = {S5},
map = {
		S1: {char2: S5, char1: S5, char0: S5},
},

)

rex = lego.from_fsm(machine)
print(rex)

Currently output is [\n\r|]
I assume it should be [\n\r|] as it becomes [\n\r[] when I substitute | to [

`pip install` command recommended in README doesn't work

$ pip install -r requirements.dev.txt
Collecting black
  Using cached black-23.1.0-py3-none-any.whl (174 kB)
ERROR: Cannot install flake8 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested flake8
    The user requested (constraint) flake8==6.0.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

I'll be honest: I do not comprehend how this could possibly constitute a dependency conflict. It seems like flake8 with no version specifier is perfectly reconcilable with flake8==6.0.0. @rwe

Should be able to build, parse and reduce() the regex for email addresses

As described in RFC5322 and numerous others places, such as (http://code.iamcal.com/php/rfc822/full_regexp.txt). This is kind of an acid test for regex engines. Although this regex is strictly regular, greenery cannot currently handle it due to some unrecognised constructs: hex literals and non-capturing groups. There may also be significant performance issues in the reduction phase.

greedy vs non-greedy, which is default?

In the readme there is the sentence

The greedy operators *?, +?, ?? and {m,n}? are not supported, since they do not alter the regular language.

As far as I know, in regex context the operators with ? are the non-greedy operations (https://stackoverflow.com/a/34806154/1739884) and the operatos (*, + and ?) are the greedy ones.
This seems to be a contradiction.

So what does the sentence mean? There are two options:

Is it that the non-greedy operators (*?, +, ?) are not supported, or
That the greedy operators (*, +, ?) are not supported (and that *?, +? and ?? is what greenery implementes)?

If it is 2., my question would be if the greedy operators are really regular in the computer science sense?
Example: The greedy regular expression A.*Z would accept "AxZ", "AZ" and "AZZ", but the non-greedy A.*?Z will accept "AxZ" and "AZ" but not "AZZ". So greedy uses some kind of backtracking (https://stackoverflow.com/questions/5319840/greedy-vs-reluctant-vs-possessive-qualifiers) not sure if this makes the regex "regular"

More efficient way to evaluate empty intersections?

Hey, not sure if this is possible, but the computation takes a really long time for expressions with wildcards at the beginning and end.

Is there some way to stop the computation early when computing the intersection if we identify that it's non-empty (rather than computing the full intersection)? Thanks!

Edit: Noticing that it only seems takes a while for expressions where I see there is certainly a non-empty intersection. Is it safe to assume that if the intersection takes more than maybe 30 seconds to compute then it's not empty?

The patterns for XML text and RFC 8089 can not be intersected

I observe that the following patterns parse:

import greenery
r1 = greenery.parse(
    "^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\U00010000-\U0010FFFF]*$"
)
r2 = greenery.parse(
    "^file:(//((localhost|(\[((([0-9A-Fa-f]{1,4}:){6}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|::([0-9A-Fa-f]{1,4}:){5}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|([0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){4}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(([0-9A-Fa-f]{1,4}:)?[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){3}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(([0-9A-Fa-f]{1,4}:){2}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:){2}([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(([0-9A-Fa-f]{1,4}:){3}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}:([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(([0-9A-Fa-f]{1,4}:){4}[0-9A-Fa-f]{1,4})?::([0-9A-Fa-f]{1,4}:[0-9A-Fa-f]{1,4}|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5]))|(([0-9A-Fa-f]{1,4}:){5}[0-9A-Fa-f]{1,4})?::[0-9A-Fa-f]{1,4}|(([0-9A-Fa-f]{1,4}:){6}[0-9A-Fa-f]{1,4})?::)|[vV][0-9A-Fa-f]+\.([a-zA-Z0-9\-._~]|[!$&'()*+,;=]|:)+)\]|([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])|([a-zA-Z0-9\-._~]|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'()*+,;=])*)))?/((([a-zA-Z0-9\-._~]|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'()*+,;=]|[:@]))+(/(([a-zA-Z0-9\-._~]|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'()*+,;=]|[:@]))*)*)?|/((([a-zA-Z0-9\-._~]|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'()*+,;=]|[:@]))+(/(([a-zA-Z0-9\-._~]|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'()*+,;=]|[:@]))*)*)?)$"
)

... but when I try to merge them, greenery does not return and seems to run forever:

print(r1 & r2)

Just for the context: the first pattern corresponds to any XML text and the second pattern corresponds to RFC 8089.

fsm.accepts() doesn't work alongside lego.otherchars

It is possible to add lego.otherchars to the alphabet of your FSM, e.g. {"a", "b", "c", lego.otherchars}. If the FSM then accepts only "b", "c" or lego.otherchars, then as a regular expression this will be written out as [^a].

However, if you then pass the string "d" into the fsm.accepts(), you'll get an exception of some kind because "d" is not in the FSM's alphabet. The "d" should be recognised as one of the many characters represented by lego.otherchars, and accepted. What you should get is True. (Or, in the case of other FSMs, perhaps False.)

Default "dead" state for FSMs with missing transitions

I find a great deal of the time I need to add a dead state to my FSM. For example, this, for recognising block comments in C-like languages:

ws = greenery.fsm.fsm(
    alphabet = {"/", "*", greenery.lego.otherchars},
    states = {0, 1, 2, 3, 4, None},
    initial = 0,
    finals = {4},
    map = {
        0    : { "/" : 1   , "*" : None, greenery.lego.otherchars : None },
        1    : { "/" : None, "*" : 2   , greenery.lego.otherchars : None },
        2    : { "/" : 2   , "*" : 3   , greenery.lego.otherchars : 2    },
        3    : { "/" : 4   , "*" : 3   , greenery.lego.otherchars : 2    },
        4    : { "/" : None, "*" : None, greenery.lego.otherchars : None },
        None : { "/" : None, "*" : None, greenery.lego.otherchars : None },
    }
)

If there were a default "dead" state, I could reduce this to:

ws = greenery.fsm.fsm(
    alphabet = {"/", "*", greenery.lego.otherchars},
    states = {0, 1, 2, 3, 4},
    initial = 0,
    finals = {4},
    map = {
        0    : { "/" : 1},
        1    : { "*" : 2},
        2    : { "/" : 2, "*" : 3, greenery.lego.otherchars : 2},
        3    : { "/" : 4, "*" : 3, greenery.lego.otherchars : 2},
    }
)

which is far simpler and more legible. There are probably performance improvements to be had here as well - the missing state and transitions need not explicitly appear in the fsm object, which can help matters with very large FSMs.

Stop using asserts in live code

So I guess I use assert all over the code of this package? Yikes! Bad form, and also makes unit testing using assert itself troublesome.

Are you interested in contributions?

I finally picked up greenery this afternoon, and have had a lot of fun with it. Before I spend much more time on it though I thought I'd check if you're interested in contributions:

I wrote an alternative string-to-lego parser which leans heavily on the CPython sre_parse module - master...Zac-HD:pyregex. It supports every construct that can be re.compiled, but has somewhat worse errors at the moment (eg no context given when bailing on a groupref). Also needs more testing for eg repeats 😄
I also wrote some property-based tests with Hypothesis, master...Zac-HD:property-tests. This has already turned up a few bugs of the form x != parse(str(x)) for some lego object x, but there's little point looking for more if you don't consider this a bug worth fixing. (I originally started this on the parser branch, but similar problems seem to exist on master too)

Either way, thanks for a great little library and a fun evening poking at it!

Kleene closure is broken

>>> parse('(ab*)*').to_fsm().accepts('bb')
True

>>> parse('(ab*)*').to_fsm()
fsm(alphabet = set(['a', anything_else, 'b']), states = set([0]), initial = 0, finals = set([0]), map = {0: {'a': 0, 'b': 0}})

Eliminate reliance on `hasattr` in `lego` module

This is a bad general practice and speaks to some kind of nebulous poor design choice. In my view the problem here is that charclass, mult, conc and pattern are all too public and it is possible to add them together in arbitrary combinations, like add a mult to a conc and still expect it to work.

I have a mind to conceal these inner classes from the user entirely and provide public access only to the pattern class, and handle all other forms of reduction from there. This way, the only possible operations are pattern-on-pattern operations. On the other hand this still wouldn't eliminate the ambiguity of having both charclass and pattern be acceptable multiplicands. Not really sure what the desired public API here looks like.

Multiple tests fail for lego_test.py, on Python 2

On Python 2.7, 24 of the 95 tests in lego_test.py fail. This includes:

Infinite recursion for test_binary_3, test_base_N, test_everythingbut, test_isinstance_bug, and others (maybe related to issue #32 ?)
Assertion failures for test_conc_common, test_pattern_commonconc, test_pattern_commonconc_suffix, test_conc_dock, and others
Exception raised for test_mult_doc and others

On Python 3.5, all the tests pass. I'm using greenery version 3.0, as installed by Pip.

Suggesting a unit test

Hi! Thanks for quick reaction on the issue #48 yesterday.
I have another real life FSM, just want to share for unit test.
Just in case: the actual version processes it just fine.
Most likely I will have more examples. If you are interested, contact me [email protected]

from greenery import fsm, lego

S40, S59 = range(2)
char0, char1, char10, char2, char3, char4, char5, char6, char7, char8, char9 = '0', '1', '.', '2', '3', '4', '5', '6', '7', '8', '9'

machine = fsm.fsm(
alphabet = { char0, char1, char10, char2, char3, char4, char5, char6, char7, char8, char9 },
states = {S40, S59},

    initial = S59,
    finals = {S40},
    map = {
         S40: {char7: S40, char3: S40, char4: S40, char0: S40, char8: S40, char5: S40, char1: S40, char9: S40, char6: S40, char2: S40},
         S59: {char7: S59, char3: S59, char10: S40, char4: S59, char0: S59, char8: S59, char5: S59, char1: S59, char9: S59, char6: S59, char2: S59},
    },

)

rex = lego.from_fsm(machine)
print(rex)

Outputs as expected: \d*\.\d*

fsm.union() with no args fails, on Python 2; causes failing test case in fsm_test.py

Calling fsm.union() with no arguments fails on Python 2.7:

>>> from greenery.fsm import fsm
>>> fsm.union()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unbound method union() must be called with fsm instance as first argument (got nothing instead)

The method documentation makes it sound like it should succeed. This succeeds on Python 3.5.

This causes one of the test cases in fsm_test.py to fail, on Python 2.7:

    def test_new_set_methods(a, b):
>   	assert len(fsm.union()) == 0
E    TypeError: unbound method union() must be called with fsm instance as first argument (got nothing instead)

In contrast, all tests pass for fsm_test.py on Python 3.5. I'm using version 3.0 of the greenery library, as installed via Pip.

Error when parse lego.parse('[\S]') or similar regex

When I parse regex like [\S], it will give error like

>>> lego.parse('[\S\s]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/yyy/.pyenv/versions/xx/lib/python3.6/site-packages/greenery/lego.py", line 68, in parse
    return pattern.parse(string)
  File "/Users/yy/.pyenv/versions/xx/lib/python3.6/site-packages/greenery/lego.py", line 244, in parse
    raise Exception("Could not parse '" + string + "' beyond index " + str(i))
Exception: Could not parse '[\S\s]' beyond index 0

`lego` should depend on `fsm`, `fsm` should never refer to `lego`

After much soul-searching I have concluded that the fsm module should stand completely alone, making no references to the lego sibling module. This will involve moving the lego.otherchars constant over to fsm and moving the fsm.lego() constructor-esque thing over to the lego class, as well as some unit tests.

lego, however, requires the fsm module to do many of its most useful regular expression manipulations, so it should continue to depend on that module as a prerequisite.

(..+)? wrongly reduces to .*

>>> str(lego.parse('(..+)?').reduce())
'.*'
>>> lego.parse('(..+)?').matches('a')
False
>>> lego.parse('(..+)?').reduce().matches('a')
True

Rename the `lego` module to something more sensible and less trademarked

lego is a silly name. At the moment I am thinking either re (it wouldn't clash with Python 3's re because this one is in the greenery namespace) or rx for novelty. Anybody want to jump in here and offer an opinion?

Version 3.3.4 breaks support for Python versions earlier than 3.9

The reason is in fsm.py, line 59:

finals: set[state_type]

In Python versions earlier than 3.9 this causes an error:

TypeError: 'type' object is not subscriptable

This is because Type Hinting Generics in Standard Collections was only introduced in Python 3.9.

Matching strings generator

I would like to be able to put in a regex (or FSM) and be returned a list of strings which that regex matches. E.g. "b(ee|ea)r" => ["beer", "bear"] Given that this list is probably infinite, maybe "yield" would be worth investigating. E.g. "be_r" => ["br", "ber", "beer", "beeer", ...]A breadth-first-search might be the best approach since otherwise something like "a_b" could go on to infinite depth without returning a hit.

Accept unescaped caret in character ranges

(I am not quite sure what part of the regular expression is problematic for greenery, so please change the title accordingly.)

I can compile the following pattern with re:

import re
re.compile(
    '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$'
)

... but greenery fails:

import greenery

greenery.parse(
    '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$'
)

with the exception:

greenery.parse.NoMatch: Could not parse '^([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+/([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+([ \t]*;[ \t]*([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+=(([!#$%&\'*+\\-.^_`|~0-9a-zA-Z])+|"(([\t !#-\\[\\]-~]|[\\x80-\\xff])|\\\\([\t !-~]|[\\x80-\\xff]))*"))*$' beyond index 1

qntm / greenery Goto Github PK

greenery's Introduction

greenery

Installation

Example

API

parse(string)

Unsupported constructs

Pattern

pattern.reduce()

Multiplier

STAR

QM

PLUS

Bound

INF

Charclass

Fsm

EPSILON

NULL

Development

Running tests

Building and publishing new versions

greenery's People

Contributors

Stargazers

Watchers

Forkers

greenery's Issues

[] name final? _ a d e g m n o p

Recommend Projects

Recommend Topics

Recommend Org

Jobs

[]
name final? _ a d e g m n o p