python-hyper / rfc3986 Goto Github PK

View Code? Open in Web Editor NEW

182.0 182.0 31.0 387 KB

A Python Implementation of RFC3986 including validations

Home Page: https://rfc3986.readthedocs.io/en/latest/

License: Other

Python 100.00%

rfc3986's People

Contributors

Stargazers

Watchers

rfc3986's Issues

1.5.0: pytest `DeprecationWarning` warnings

I'm trying to package your module as an rpm package. So I'm using the typical build, install and test cycle used on building packages from non-root account.

"setup.py build"
"setup.py install --root </install/prefix>"
"pytest with PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>
Here is the pytest output:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
.................................................................................................................................................................... [  5%]
.................................................................................................................................................................... [ 11%]
.................................................................................................................................................................... [ 17%]
.................................................................................................................................................................... [ 23%]
.................................................................................................................................................................... [ 29%]
.................................................................................................................................................................... [ 35%]
.................................................................................................................................................................... [ 40%]
.................................................................................................................................................................... [ 46%]
.................................................................................................................................................................... [ 52%]
.................................................................................................................................................................... [ 58%]
.................................................................................................................................................................... [ 64%]
.................................................................................................................................................................... [ 70%]
.................................................................................................................................................................... [ 75%]
.................................................................................................................................................................... [ 81%]
.................................................................................................................................................................... [ 87%]
.................................................................................................................................................................... [ 93%]
.................................................................................................................................................................... [ 99%]
.......................                                                                                                                                              [100%]
============================================================================= warnings summary =============================================================================
tests/test_api.py: 1 warning
tests/test_uri.py: 254 warnings
tests/test_unicode_support.py: 3 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:116: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

tests/test_api.py: 1 warning
tests/test_uri.py: 254 warnings
tests/test_unicode_support.py: 3 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:172: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

tests/test_api.py: 1 warning
tests/test_uri.py: 253 warnings
tests/test_unicode_support.py: 3 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:144: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

tests/test_api.py: 1 warning
tests/test_uri.py: 246 warnings
tests/test_unicode_support.py: 2 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:191: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

tests/test_api.py: 1 warning
tests/test_uri.py: 245 warnings
tests/test_unicode_support.py: 2 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:210: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

tests/test_api.py: 1 warning
tests/test_uri.py: 244 warnings
tests/test_unicode_support.py: 2 warnings
  /home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:229: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/warnings.html
2811 passed, 1517 warnings in 6.75s
pytest-xprocess reminder::Be sure to terminate the started process by running 'pytest --xkill' if you have not explicitly done so in your fixture with 'xprocess.getinfo(<process_name>).terminate()'.

helpers for RFC 7230 productions

Hello! I'm looking at improving h11's handling of URI-related stuff, and of course RFC 7230 delegates a bunch of the heavy lifting to RFC 3986. And I'd kinda rather not have to implement an RFC 3986 parser from scratch.

Annoyingly, though, RFC 7230 likes to refer directly to some of the intermediate productions inside RFC 3986. Specifically, I need to be able to:

check if a string matches the production origin-form = absolute-path [ "?" query] (where absolute-path and query are from RFC 3986)
check if a string matches the production authority, with empty user-info (I guess check for "no user-info" is easy if you can check for authority, since an authority has a "@" in it iff it has a non-empty user-info
check if a string matches the production host ":" port
check if a string is a valid absolute-URI, and if so, split it into scheme, authority, and everything else (path + query).

AFAICT rfc3986 has all the stuff it needs for doing these things, but most of its not exposed. (Except for the last one -- I think I could implement that using parsed = rfc3986.uri_reference(purported_url); assert parsed.fragment is None; everything_else = parsed.path + "?" + parsed.query.)

Ideally what I'd want is the regex text (as opposed to compiled regex objects) for each of those productions. Is that something rfc3986 could easily provide?

(P.S.: what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference then I get regular strs back?)

UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 31: invalid start byte

Trying the following:

from rfc3986 import is_valid_uri
assert is_valid_uri("http://fr.dbpedia.org/resource/\201dimbourg")

This doesn't raise an assertion error. Instead, it produces a traceback:

Traceback (most recent call last):
  File "./validate.py", line 2, in <module>
    assert is_valid_uri("http://fr.dbpedia.org/resource/\201dimbourg")
  File "/home/rob/tmp/rfc3986/rfc3986/api.py", line 62, in is_valid_uri
    return URIReference.from_string(uri, encoding).is_valid(**kwargs)
  File "/home/rob/tmp/rfc3986/rfc3986/uri.py", line 69, in from_string
    uri_string = to_str(uri_string, encoding)
  File "/home/rob/tmp/rfc3986/rfc3986/compat.py", line 24, in to_str
    b = b.decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 31: invalid start byte

Using GerritHub

Document flow for GerritHub

Incorrectly parses URLs with ports but no scheme.

There should be a way to work around this. We should strive to be better than the stdlib, not just marginally better.

>>> rfc3986.urlparse('172.25.241.180:8181/os-releases/11.1.0/')
ParseResult(scheme=u'172.25.241.180', userinfo=None, host=None, port=None, path=u'8181/os-releases/11.1.0/', query=None, fragment=None)

To encode or not encode - best practices for "uncommon" uri characters, including whitespaces (%20)?

Dear all,

I am currently struggeling with a whitespace problem which I guess should not be that complicated - so I probably missing something here.

MWE:

import rfc3986.builder
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1").extend_path("path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1/path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:path 1").extend_path("path2").geturl()
# outout: 'scheme:/path%201/path2'

therefore: If i am having a whitespace in the from_uri-part, it gets escaped by %20, whereby having the whitespace as part of the parameter to extend_path, it gets used as is.

From the broader scope, I am storing URIs in a database which get constructed on one component "from scratch" (containing whitespaces ...), whereas they are passed in a url-encoded - conformant manner in another component.
I already figured out that there is an equivalence when passing maybe-url-encoded strings to from_uri:

from_uri_a=rfc3986.builder.URIBuilder.from_uri("scheme:/path 1/path2").finalize()
from_uri_b=rfc3986.builder.URIBuilder.from_uri("scheme:/path%201/path2").finalize()
from_uri_a == from_uri_b
# is True

My main goal is to store the URIs in a future-proof way in my database and from the requirements I am having it does not really make a big difference whether or not I am storing the URLs encoded or not - but from the broader scope I am unsure whether the current implementation is desired or not (aka. a bug or a feature).

From the rfc, sec. 2.4, I guess that an encoding should take place in the extend_path method:

Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data. Once produced, a URI is always
in its percent-encoded form.

Any thoughts on this?

Allow URIBuilder to be initialized with a URL

For example,

print(URIBuilder.from_uri('https://github.com').add_path('/python-hyper').finalize().unsplit())
# => https://github.com/python-hyper

normalisation of urls containing non-ascii domains is broken and loses data

Initial parsing works:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')

Subsequent normalisation silently loses data:

>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')

normalize_uri replaces 6 with %36

"6" omitted in important_characters['unreserved_chars']:
https://github.com/sigmavirus24/rfc3986/blob/master/rfc3986/misc.py#L35

Allow specifying which schemes you want in validation.

Right now if you want to validate that an URI is valid and it has one of a set of schemes you have to parse the URI, check it's validity, then manually check the scheme. It'd be nice to be able to just do it all in one go, something like:

# Require schemes, but doesn't matter which.
is_valid_uri("...", require_scheme=True)

# Require schemes, and must be either http or https.
is_valid_uri("...", require_scheme={"http", "https"})

support idna :)

Perhaps have normalize() use it.

Add documentation

Add sphinx docs
Have them hosted on ReadTheDocs

rfc3986 cannot be installed on Python 3 with locale encoding different than UTF-8

This bug was already fixed 3 months ago by the pull request #14 (commit cf15373). The problem is that no new release was done since this commit was merge. Can you please release a new version?

By the way, since the project doesn't need 2to3 or other hack to support Python 2 and Python 3, it looks like rfc3986 works well on Python 3 without any change and that you can release an "universal wheel": a single binary package for Python 2 and Python 3. Or at least, please upload also a wheel package for Python 3.

2.0.1 patch release?

Apologies for asking this as an issue, but is a 2.0.1 patch release likely anytime soon? It would be very helpful to not have to deal with the constant warnings anymore (#95).

Empty (but present) authority component lost

The library in its latest pip version (1.5.0; also observed in 1.4 as shipped in Debian) does not distinguish between present and absent authority components:

>>> rfc3986.urlparse('foo:///').unsplit()
'foo:/'

I don't see how RFC3986 would justify normalizing this away, especially given the comment on page41 that in some cases a URI would be normalized to one that has an empty (not absent) authority.

Parsing of fragments through ParseResult.fromString can be cut short by a newline character

Background

According to RFC 3986 line termination characters like \n are not allowed in any part of a URI but the percent-encoded versions %0A are allowed. For other sections of the URL, such as the query and the path, ParseResult.fromString will normalize \n characters into percent-encoded characters and accept them. This is not true for the fragment section of the URI.

The Bug

Inserting any line termination character into the fragment section of the URL will result in the parsing of the fragment section being cut short.

Minimal Reproducible Example

import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://[email protected]:80/path?query#Fragment\nThatIsIllusive', 'utf-8', True, False)
print("Fragment: " + parsed_url.fragment)

This will print Fragment: Fragment

In contrast Furl, Hyperlink, Urllib, and Yarl all return Fragment: Fragment%0AThatIsIllusive

Cause

This is the regex used to parse different parts of a URI

SCHEME_RE = "[a-zA-Z][a-zA-Z0-9+.-]*"
_AUTHORITY_RE = "[^\\\\/?#]*"
_PATH_RE = "[^?#]*"
_QUERY_RE = "[^#]*"
_FRAGMENT_RE = ".*"

This bug is a result of the use of .* in the fragment regex. The . symbol in regex accepts every character except for line termination characters.

PATH_EMPTY regex does not actually match empty path

rfc3986/src/rfc3986/abnf_regexp.py

Line 153 in 1640734

PATH_EMPTY = "^$"

The PATH_EMPTY regex won't ever match anything when embedded within another regex, as it is in several places like here:

rfc3986/src/rfc3986/abnf_regexp.py

Lines 184 to 190 in 1640734

 HIER_PART_RE = "(//{}{}|{}|{}|{})".format( 

 COMPONENT_PATTERN_DICT["authority"], 

 PATH_ABEMPTY, 

 PATH_ABSOLUTE, 

 PATH_ROOTLESS, 

 PATH_EMPTY, 

 )

This causes the URIReference.is_absolute() method to return an incorrect result for a URI with a scheme but no path. For example, http: (or even http:?q=v) is an absolute URI according to section-4.3, but URIReference.from_string('http:').is_absolute() returns False.

There are likely other places where this causes intended matches to fail, but I have not tried to investigate further. It's a little odd looking, but a fix that would probably work is just to change the value of PATH_EMPTY to an empty string.

Better handling of escaping in regex patterns

Currently we erroneously accept \\ in the USERINFO_RE (and probably many more!) due to regex escaping acting differently when inside or outside of a character class.

Example: http://user\\@google.com shouldn't be a valid URL but is accepted by us.

2.0.1 patch release?

Apologies for this probably not being the right place to ask about this, but is there a timeline for a 2.0.1 release? I am currently having to put a git commit in my pyproject.toml for a contracting project in order to keep it from emitting tons of warnings that were fixed in PR #95. I suspect my clients would be happier if rfc3986 were being installed form PyPI 🙂

Please let me know if there's anytihng I can do to help facilitate a patch release.

IPv4 regex can match any arbitrary value

A ValueError is raised when trying to validate the host as an IPv4:

Traceback (most recent call last):
  File "/lib/python3.7/site-packages/rfc3986/uri.py", line 354, in normalize
    (self.userinfo, self.host, self.port)),
  File "/lib/python3.7/site-packages/rfc3986/uri.py", line 198, in userinfo
    authority = self.authority_info()
  File "/lib/python3.7/site-packages/rfc3986/uri.py", line 169, in authority_info
    validators.valid_ipv4_host_address(host)):
  File "/lib/python3.7/site-packages/rfc3986/validators.py", line 376, in valid_ipv4_host_address
    return all([0 <= int(byte, base=10) <= 255 for byte in host.split('.')])
  File "/lib/python3.7/site-packages/rfc3986/validators.py", line 376, in <listcomp>
    return all([0 <= int(byte, base=10) <= 255 for byte in host.split('.')])
ValueError: invalid literal for int() with base 10: '6g9m8V6'

Indeed, when the value 6g9m8V6 is tested against the current regex ([0-9]{1,3}.){3}[0-9]{1,3}, it matches. This is due to the . symbol not being escaped so it can match any character what fades out the goal of only matching IPv4 addresses.

A corrected regex can be: ([0-9]{1,3}\.){3}[0-9]{1,3}.

geturl() of the ParseResultBytes can't get correct url

When I run geturl() of the instance of the ParseResultBytes, it just return "http:"

>>> print(parsed)
ParseResultBytes(scheme=b'https', userinfo=None, host=b'xn--i-7iq.ws', port=None, path=None, query=None, fragment=None)
>>> parsed.geturl()
b'https:'

document how to remove components (using copy_with()?)

Is the copy_with() method combined with empty strings (note that None doesn't work) the correct way to remove components like a fragment altogether?

>>> s = 'http://example.org/?this-is-the-query#this-is-the-fragment'
>>> rfc3986.uri_reference(s).copy_with(fragment='').unsplit()
'http://example.org/?this-is-the-query

If so, this may be worth mentioning in the docs. If not, please point me to the proper approach. :)

Update for RFC 6874 (Zone Identifiers in IPv6 addresses)

https://tools.ietf.org/html/rfc6874

Resolution result is not consistent with RFC 3986 when the base has "rootless" path without authority

remove_dot_segments defined in RFC 3986 (section 5.2.4) sometimes add a leading slash when the base path is "rootless".

STEP   OUTPUT BUFFER         INPUT BUFFER

 1 :                         foo/../baz
 2E:   foo                   /../baz
 2C:                         /baz
 2E:   /baz

So, resolving ../baz against scheme:foo/bar should result in scheme:/baz.
However, output of this library differs from that.

$ python
Python 3.9.9 (main, Jan 10 2022, 18:52:39)
[GCC 11.2.1 20211127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rfc3986 import uri_reference
>>> b = uri_reference('scheme:foo/bar')
>>> r = uri_reference('../baz')
>>> t = r.resolve_with(b)
>>> t
URIReference(scheme='scheme', authority=None, path='baz', query=None, fragment=None)
>>> t.unsplit()
'scheme:baz'
>>>

I think there should be special handling such as "when .. segment appears but output stack is empty, set prepend_slash flag" or something like that.

Path segment normalization should be consistent with RFC 3986 Section 5.2.4

https://tools.ietf.org/html/rfc3986#section-5.2.4

The remove_dot_segments() function returns a relative path for an absolute input path after removing dot-dot segments.

The remove_dot_segments() function returns an empty-string path for an absolute input path after removing dot-dot segments if the number of dot-dot segments is greater than the deepest path level. This is not compatible with the algorithm suggested by RFC3986 section-5.2.4.

INPUT: '/a/b/c/../../../../'
EXPECTED OUTPUT: '/'
Got: ''

Code snippet to reproduce the issue:

from rfc3986.normalizers import remove_dot_segments
assert remove_dot_segments('/a/b/c/../../../../') == '/'

ParseResult.fromString accepts ports that aren't preceded by a ':' character

Background

RFC 3986 defines an authority as such

authority = [ userinfo "@" ] host [ ":" port ]

WHAT WG says that

An opaque-host-and-port string must be either the empty string or: a valid opaque-host string, optionally followed by U+003A (:) and a URL-port string.
A scheme-relative-special-URL string must be "//", followed by a valid host string, optionally followed by U+003A (:) and a URL-port string, optionally followed by a path-absolute-URL string.

The Bug

ParseResult.from_string will parse a port number which is not preceded by a colon. This conflicts with both specifications.

Minimally Reproducable Example

import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://[v1.ip]8000/path', 'utf-8', True, False)
print("Host: " + str(parsed_url.host)) # prints 'Host: [v1.ip]'
print("Port: " + str(parsed_url.port)) # prints 'Port: 8000'

Cause

This is the regex used to parse the authority component in misc.py

SUBAUTHORITY_MATCHER = re.compile(
    (
        "^(?:(?P<userinfo>{})@)?"  # userinfo
        "(?P<host>{})"  # host
        ":?(?P<port>{})?$"  # port
    ).format(
        abnf_regexp.USERINFO_RE, abnf_regexp.HOST_PATTERN, abnf_regexp.PORT_RE
    )
)

This bug is a result of the first '?' character in the regex used for the port :?(?P<port>{})?$. This regex allows the colon to be optional independently of an optional port number. However, according to the specs a port number and colon should always be paired.

Why doesn't `normalize_url` encode `%` as `%25`?

Sorry if this is a stupid question or was addressed elsewhere.

any way to validate only relative path?

Any way to validate only relative path? For example, just check characters.

reg-name RE is not sufficient

Your pattern for the reg-name component is::

reg_name = '[\w\d.]+'

However, this is not sufficient, I believe -- the Python regex tokens \w match any "word" characters which notably include "alpha" characters, plus numbers, plus the underscore, but not the full range of characters permitted in reg-names from RFC 3986 section 3.2.2 (despite the comment in your source). The spec specifies this formation for reg-name::

reg-name    = *( unreserved / pct-encoded / sub-delims )

Your reg_name particle should get redefined to support the fuller character set as per the RFC.

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1030: ordinal not in range(128) during install on python 3.4

Copied from openstack nova launchpad bug:

https://bugs.launchpad.net/nova/+bug/1460206

http://logs.openstack.org/15/186315/4/check/gate-nova-python34/c48b0f5/console.html#_2015-05-29_15_57_16_241

2015-05-29 15:57:16.241 | Collecting rfc3986>=0.2.0 (from -r /home/jenkins/workspace/gate-nova-python34/requirements.txt (line 46))
2015-05-29 15:57:16.241 | Downloading http://pypi.region-b.geo-1.openstack.org/packages/source/r/rfc3986/rfc3986-0.2.2.tar.gz
2015-05-29 15:57:16.241 | Complete output from command python setup.py egg_info:
2015-05-29 15:57:16.241 | Traceback (most recent call last):
2015-05-29 15:57:16.242 | File "<string>", line 20, in <module>
2015-05-29 15:57:16.242 | File "/tmp/tmp.ALfXvU5OeC/pip-build-ottfpn10/rfc3986/setup.py", line 22, in <module>
2015-05-29 15:57:16.242 | readme = f.read()
2015-05-29 15:57:16.242 | File "/home/jenkins/workspace/gate-nova-python34/.tox/py34/lib/python3.4/encodings/ascii.py", line 26, in decode
2015-05-29 15:57:16.242 | return codecs.ascii_decode(input, self.errors)[0]
2015-05-29 15:57:16.242 | UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1030: ordinal not in range(128)
2015-05-29 15:57:16.242 |
2015-05-29 15:57:16.242 | ----------------------------------------
2015-05-29 15:57:16.242 | Command "python setup.py egg_info" failed with error code 1 in /tmp/tmp.ALfXvU5OeC/pip-build-ottfpn10/rfc3986

Looks like it might be an encoding problem with the README.rst?

Incorrect result returned by is_valid_uri()

According to RFC 3986, there are some ABNF rules:

URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / ".")

And I tried:

>>> from rfc3986 import is_valid_uri
>>> is_valid_uri("1")
True
>>> is_valid_uri("//")
True
>>> is_valid_uri("http#:/")
True

I think these results are not correct.

ccccccgcdkhvdkfdifgedbbhgrdkkbnnchvufulbtlhn

Encode userinfo appropriately

If we have a ParseResult with a userinfo section like foo:b@r#/z we should properly encode that so @, #, and / are percent-encoded.

We might also consider how this will affect the underlying URIReference object.

Simple typo in the URIReference docstring in src/rfc3986/uri.py

Should be address rather than adddres - originally reported in https://github.com/urllib3/urllib3/pull/1642/files who suggested fixing here.

Host ']'

The following malformed URL is accepted by rfc3986:

B://]

Although the character ']' is allowed in a host, it must be in the context of an IPv6 or an IPvFuture, which this is not.

This malformed URL is rejected by urllib, urllib3, hyperlink, yarl, furl, and Boost.URL.

URIMixin.resolve_with emits a warning

URIMixin.resolve_with (which is not itself deprecated) emits warning "Please use rfc3986.validators.Validator instead".

Normalize RFC 4007 delimiter for IPv6 Zone IDs

RFC 4007 allows % as the delimiter for IPv6 zone IDs whereas RFC 6874 specifically uses %25 to comply with the host component rules in RFC 3986. We need to assume 6874 and normalize 4007 -> 6874. See discussion in urllib3/urllib3#1531.

Allow for methods to extend components instead of replacing them

Right now the add_<component> methods will replace the existing value. We should probably allow for someone to extend these in an intelligent way.

For example: It might be nice to have an existing path, e.g., /python-hyper and allow them to add something "underneath" that, e.g., /rfc3986 such that the final path is /python-hyper/rfc3986.

Also it may be nice to allow someone to add to the query argument list instead of replacing it wholesale and interact with it as a list of tuples.

Newest release not are not shown on https://rfc3986.readthedocs.io

I tried to check the release notes for the newest 1.4.0 release, but the latest release notes shown on https://rfc3986.readthedocs.io/en/latest/release-notes/index.html# are for version 1.1.0

Then I checked the commits on github and found 25dffd6 with the newest release notes.
By the way, these release notes have a bug, because the heading shows the wrong version number (1.3.0 instead 1.4.0)

New coverage release breaks the tests on Python 3.2

https://travis-ci.org/sigmavirus24/rfc3986/jobs/86443853

May as well stop (explicitly) supporting 3.2 at this point.

`test_encode_invalid_iri[http://\U0002f868.com]` fails

Running tox -e py27 -v -- tests/test_iri.py::test_encode_invalid_iri I got the error:

F..                                                                                     [100%]
========================================== FAILURES ===========================================
_______________________ test_encode_invalid_iri[http://\U0002f868.com] ________________________

iri = 'http://㛼.com'

    @requires_idna
    @pytest.mark.parametrize("iri", [
        u'http://㛼.com',
        u'http://♥.net',
        u'http://\u0378.net',
    ])
    def test_encode_invalid_iri(iri):
        # import pdb;pdb.set_trace()
        iri_ref = rfc3986.iri_reference(iri)
        with pytest.raises(InvalidAuthority):
>           iri_ref.encode()
E           Failed: DID NOT RAISE <class 'rfc3986.exceptions.InvalidAuthority'>

tests/test_iri.py:60: Failed

The backtrace:

-> testfunction(**testargs)
  /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/tests/test_iri.py(60)test_encode_invalid_iri()
-> iri_ref.encode()
  /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/iri.py(132)encode()
-> if self.host:
  /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/_mixin.py(61)host()
-> authority = self.authority_info()
  /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/_mixin.py(31)authority_info()
-> match = self._match_subauthority()
> /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/iri.py(76)_match_subauthority()
-> return misc.ISUBAUTHORITY_MATCHER.match(self.authority)
(Pdb) self.authority
u'\U0002f868.com'
(Pdb) misc.ISUBAUTHORITY_MATCHER.match(self.authority)
(Pdb)

The important note ( system Python2 is congured to use UCS-2):

(Pdb) import sys
(Pdb) sys.maxunicode > 0xFFFF
False

Consider Andrey's Thoughts

From @shazow:

17:18.45      shazow it's the on-demand parsing that is the problem
17:18.49      shazow urlparse does the same thing
17:19.15      shazow by the time i access url.port or whatever, i should be able to assume that url is safely parsed
17:19.34      shazow otherwise i gotta shove try/except around every single time i access a property, rather than in one sensible place
17:19.48      shazow or i gotta maintain my own parsed state outside of the one you give me

bug report (fuzz testing)

'http+unix://%2Fvar%2Frun%2Fsocket/path?key=value'

rfc3986/src/rfc3986/abnf_regexp.py

Line 74 in 4102c50

REGULAR_NAME_RE = REG_NAME = '(({0})*|[{1}]*)'.format(

Resolving a relative URI Reference against a base with only an authority does not work

From IETF RFC 3986 Uniform Resource Identifier (URI): Generic Syntax I would expect the following test to pass:

from rfc3986 import uri_reference  # type: ignore[import]
from rfc3986.uri import URIReference  # type: ignore[import]


def test_schema_only_base() -> None:
    relative_uri: URIReference = uri_reference("john.smith")
    base_uri: URIReference = uri_reference("example:")
    resolved_uri: URIReference = relative_uri.resolve_with(base_uri, strict=True)
    assert resolved_uri.unsplit() == "example:/john.smith"

This is because of:

5.2.1. Pre-parse the Base URI

Note that only the scheme component is required to be
present in a base URI; the other components may be empty or
undefined.

5.2.2. Transform References

# Branches that won't execute elided with [...]
if defined(R.scheme) then
   [...]
else
   if defined(R.authority) then
      [...]
   else
      if (R.path == "") then
         [...]
      else
         if (R.path starts-with "/") then
            [...]
         else
            T.path = merge(Base.path, R.path);
            T.path = remove_dot_segments(T.path);
         endif;
         T.query = R.query;
      endif;
      T.authority = Base.authority;
   endif;
   T.scheme = Base.scheme;
endif;

T.fragment = R.fragment;

5.2.3. Merge Paths

The pseudocode above refers to a "merge" routine for merging a
relative-path reference with the path of the base URI. This is
accomplished as follows:

If the base URI has a defined authority component and an empty
path, then return a string consisting of "/" concatenated with the
reference's path; otherwise,

So according to this, my understanding is that, R, Base and T should be as follow:

R = URIReference(scheme=None, authority=None, path='john.smith', query=None, fragment=None)
Base = URIReference(scheme='example', authority=None, path=None, query=None, fragment=None)
T = URIReference(scheme='example', authority=None, path='/john.smith', query=None, fragment=None)

And T, when unsplit, will be "example:/john.smith". I may of course be missing something though, and if I am please do help me out here in seeing what it is.

When I run the actual test code it fails, so clearly the current implementation of RFC3986 in this library does not share my interpretation, it may be

$ pytest test_rfc3986.py -rA --log-level DEBUG
============================================================================ test session starts ============================================================================
platform linux -- Python 3.10.1, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/gitlab.com/aucampia/scraps
plugins: asyncio-0.16.0
collected 1 item                                                                                                                                                            

test_rfc3986.py F                                                                                                                                                     [100%]

================================================================================= FAILURES ==================================================================================
___________________________________________________________________________ test_schema_only_base ___________________________________________________________________________

    def test_schema_only_base() -> None:
        relative_uri: URIReference = uri_reference("john.smith")
        base_uri: URIReference = uri_reference("example:")
>       resolved_uri: URIReference = relative_uri.resolve_with(base_uri, strict=True)

test_rfc3986.py:8: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = URIReference(scheme=None, authority=None, path='john.smith', query=None, fragment=None)
base_uri = URIReference(scheme='example', authority=None, path=None, query=None, fragment=None), strict = True

    def resolve_with(self, base_uri, strict=False):
        """Use an absolute URI Reference to resolve this relative reference.
    
        Assuming this is a relative reference that you would like to resolve,
        use the provided base URI to resolve it.
    
        See http://tools.ietf.org/html/rfc3986#section-5 for more information.
    
        :param base_uri: Either a string or URIReference. It must be an
            absolute URI or it will raise an exception.
        :returns: A new URIReference which is the result of resolving this
            reference using ``base_uri``.
        :rtype: :class:`URIReference`
        :raises rfc3986.exceptions.ResolutionError:
            If the ``base_uri`` is not an absolute URI.
        """
        if not isinstance(base_uri, URIMixin):
            base_uri = type(self).from_string(base_uri)
    
        if not base_uri.is_absolute():
>           raise exc.ResolutionError(base_uri)
E           rfc3986.exceptions.ResolutionError: example: is not an absolute URI.

/home/iwana/.local/lib/python3.10/site-packages/rfc3986/_mixin.py:266: ResolutionError
========================================================================== short test summary info ==========================================================================
FAILED test_rfc3986.py::test_schema_only_base - rfc3986.exceptions.ResolutionError: example: is not an absolute URI.
============================================================================= 1 failed in 0.10s =============================================================================

URLs ending in "\n" are stripped of the new-line before normalization

Caused by the URI_MATCHER and IRI_MATCHER not using re.DOTALL.

The first path segment is unexpectedly interpreted as an authority after normalization

scheme:/..///bar has scheme="scheme", authority=None, path=/..///bar.
However, after normalization, it has scheme="scheme", authority="bar".
Consider t1 as an IRI ..///bar resolved against scheme:.
t1 should have scheme="scheme" and authority=None (since ..///bar does not contain authority).
However, resulting string is scheme://bar, it has authority=bar.

And some more examples:

$ python
Python 3.9.9 (main, Jan 10 2022, 18:52:39)
[GCC 11.2.1 20211127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rfc3986 import uri_reference
>>> b = uri_reference('scheme:')
>>> r1 = uri_reference('..///bar')
>>> t1 = r1.resolve_with(b)
>>> t1
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> t1.unsplit()
'scheme://bar'
>>> r2 = uri_reference('/..///bar')
>>> r2.resolve_with(b)
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> uri_reference('scheme:/..///bar').normalize()
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> uri_reference('scheme:/..///bar').normalize().unsplit()
'scheme://bar'

I'm not sure how this should handled.
Collapsing the // at the beginning is not explicitly allowed by RFC 3986, so I think the normalization and the resolution cannot produce valid output and should fail in this case.
(But RFC 3986 does not seem to state that they can fail!)

This can caused by normalization during resolution, so #84 may also be affected by this issue.

http://[ffff::] deemed invalid incorrectly

From abnf_regexp.py#110
should be changed from
IPv6_RE = '(({0})|({1})|({2})|({3})|({4})|({5})|({6})|({7}))'.format( *variations )

IPv6_RE = '(({0})|({1})|({2})|({3})|({4})|({5})|({6})|({7})|({8}))'.format( *variations )

So, the rule

[ *6( h16 ":" ) h16 ] "::"

could be applied

":" in relative path within relative reference

'.://' should not parse. It is not an absolute URI because '.' is not a valid scheme, and it is not a relative URI because a path-noscheme cannot begin with a ':'.

The relevant grammar rules from the RFC:

   scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
   relative-ref  = relative-part [ "?" query ] [ "#" fragment ]
   relative-part = "//" authority path-abempty
                 / path-absolute
                 / path-noscheme
                 / path-empty
   path-noscheme = segment-nz-nc *( "/" segment )
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

Document how to create a pattern to match any IP address hostname

Put it within abnf_regex, useful for libraries that need to determine if a host is an IP address or not (For IDNA encoding, for example).

	HIER_PART_RE = "(//{}{}\|{}\|{}\|{})".format(
	COMPONENT_PATTERN_DICT["authority"],
	PATH_ABEMPTY,
	PATH_ABSOLUTE,
	PATH_ROOTLESS,
	PATH_EMPTY,
	)

python-hyper / rfc3986 Goto Github PK

rfc3986's People

Contributors

Stargazers

Watchers

Forkers

rfc3986's Issues

Background

The Bug

Minimal Reproducible Example

Cause

Background

The Bug

Cause

[ *6( h16 ":" ) h16 ] "::"

Recommend Projects

Recommend Topics

Recommend Org

Jobs