python-hyper / rfc3986 Goto Github PK
View Code? Open in Web Editor NEWA Python Implementation of RFC3986 including validations
Home Page: https://rfc3986.readthedocs.io/en/latest/
License: Other
A Python Implementation of RFC3986 including validations
Home Page: https://rfc3986.readthedocs.io/en/latest/
License: Other
I'm trying to package your module as an rpm package. So I'm using the typical build, install and test cycle used on building packages from non-root account.
+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra
.................................................................................................................................................................... [ 5%]
.................................................................................................................................................................... [ 11%]
.................................................................................................................................................................... [ 17%]
.................................................................................................................................................................... [ 23%]
.................................................................................................................................................................... [ 29%]
.................................................................................................................................................................... [ 35%]
.................................................................................................................................................................... [ 40%]
.................................................................................................................................................................... [ 46%]
.................................................................................................................................................................... [ 52%]
.................................................................................................................................................................... [ 58%]
.................................................................................................................................................................... [ 64%]
.................................................................................................................................................................... [ 70%]
.................................................................................................................................................................... [ 75%]
.................................................................................................................................................................... [ 81%]
.................................................................................................................................................................... [ 87%]
.................................................................................................................................................................... [ 93%]
.................................................................................................................................................................... [ 99%]
....................... [100%]
============================================================================= warnings summary =============================================================================
tests/test_api.py: 1 warning
tests/test_uri.py: 254 warnings
tests/test_unicode_support.py: 3 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:116: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
tests/test_api.py: 1 warning
tests/test_uri.py: 254 warnings
tests/test_unicode_support.py: 3 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:172: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
tests/test_api.py: 1 warning
tests/test_uri.py: 253 warnings
tests/test_unicode_support.py: 3 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:144: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
tests/test_api.py: 1 warning
tests/test_uri.py: 246 warnings
tests/test_unicode_support.py: 2 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:191: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
tests/test_api.py: 1 warning
tests/test_uri.py: 245 warnings
tests/test_unicode_support.py: 2 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:210: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
tests/test_api.py: 1 warning
tests/test_uri.py: 244 warnings
tests/test_unicode_support.py: 2 warnings
/home/tkloczko/rpmbuild/BUILDROOT/python-rfc3986-1.5.0-2.fc35.x86_64/usr/lib/python3.8/site-packages/rfc3986/_mixin.py:229: DeprecationWarning: Please use rfc3986.validators.Validator instead. This method will be eventually removed.
warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/warnings.html
2811 passed, 1517 warnings in 6.75s
pytest-xprocess reminder::Be sure to terminate the started process by running 'pytest --xkill' if you have not explicitly done so in your fixture with 'xprocess.getinfo(<process_name>).terminate()'.
Hello! I'm looking at improving h11's handling of URI-related stuff, and of course RFC 7230 delegates a bunch of the heavy lifting to RFC 3986. And I'd kinda rather not have to implement an RFC 3986 parser from scratch.
Annoyingly, though, RFC 7230 likes to refer directly to some of the intermediate productions inside RFC 3986. Specifically, I need to be able to:
origin-form = absolute-path [ "?" query]
(where absolute-path
and query
are from RFC 3986)authority
, with empty user-info
(I guess check for "no user-info" is easy if you can check for authority
, since an authority
has a "@"
in it iff it has a non-empty user-info
host ":" port
absolute-URI
, and if so, split it into scheme, authority, and everything else (path + query).AFAICT rfc3986 has all the stuff it needs for doing these things, but most of its not exposed. (Except for the last one -- I think I could implement that using parsed = rfc3986.uri_reference(purported_url); assert parsed.fragment is None; everything_else = parsed.path + "?" + parsed.query
.)
Ideally what I'd want is the regex text (as opposed to compiled regex objects) for each of those productions. Is that something rfc3986 could easily provide?
(P.S.: what's going on with unicode handling? It seems like on py3, if I pass a byte-string to uri_reference
then I get regular str
s back?)
Trying the following:
from rfc3986 import is_valid_uri
assert is_valid_uri("http://fr.dbpedia.org/resource/\201dimbourg")
This doesn't raise an assertion error. Instead, it produces a traceback:
Traceback (most recent call last):
File "./validate.py", line 2, in <module>
assert is_valid_uri("http://fr.dbpedia.org/resource/\201dimbourg")
File "/home/rob/tmp/rfc3986/rfc3986/api.py", line 62, in is_valid_uri
return URIReference.from_string(uri, encoding).is_valid(**kwargs)
File "/home/rob/tmp/rfc3986/rfc3986/uri.py", line 69, in from_string
uri_string = to_str(uri_string, encoding)
File "/home/rob/tmp/rfc3986/rfc3986/compat.py", line 24, in to_str
b = b.decode('utf-8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 31: invalid start byte
There should be a way to work around this. We should strive to be better than the stdlib, not just marginally better.
>>> rfc3986.urlparse('172.25.241.180:8181/os-releases/11.1.0/')
ParseResult(scheme=u'172.25.241.180', userinfo=None, host=None, port=None, path=u'8181/os-releases/11.1.0/', query=None, fragment=None)
Dear all,
I am currently struggeling with a whitespace problem which I guess should not be that complicated - so I probably missing something here.
MWE:
import rfc3986.builder
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1").extend_path("path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:").extend_path("path 1/path2").geturl()
# outout: 'scheme:/path 1/path2'
rfc3986.builder.URIBuilder.from_uri("scheme:path 1").extend_path("path2").geturl()
# outout: 'scheme:/path%201/path2'
therefore: If i am having a whitespace in the from_uri
-part, it gets escaped by %20
, whereby having the whitespace as part of the parameter to extend_path, it gets used as is.
From the broader scope, I am storing URIs in a database which get constructed on one component "from scratch" (containing whitespaces ...), whereas they are passed in a url-encoded - conformant manner in another component.
I already figured out that there is an equivalence when passing maybe-url-encoded strings to from_uri
:
from_uri_a=rfc3986.builder.URIBuilder.from_uri("scheme:/path 1/path2").finalize()
from_uri_b=rfc3986.builder.URIBuilder.from_uri("scheme:/path%201/path2").finalize()
from_uri_a == from_uri_b
# is True
My main goal is to store the URIs in a future-proof way in my database and from the requirements I am having it does not really make a big difference whether or not I am storing the URLs encoded or not - but from the broader scope I am unsure whether the current implementation is desired or not (aka. a bug or a feature).
From the rfc, sec. 2.4, I guess that an encoding should take place in the extend_path method:
Under normal circumstances, the only time when octets within a URI
are percent-encoded is during the process of producing the URI from
its component parts. This is when an implementation determines which
of the reserved characters are to be used as subcomponent delimiters
and which can be safely used as data. Once produced, a URI is always
in its percent-encoded form.
Any thoughts on this?
For example,
print(URIBuilder.from_uri('https://github.com').add_path('/python-hyper').finalize().unsplit())
# => https://github.com/python-hyper
Initial parsing works:
>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment')
URIReference(scheme='http', authority='æåëý.com', path='/path', query='query', fragment='fragment')
Subsequent normalisation silently loses data:
>>> rfc3986.uri_reference('http://æåëý.com/path?query#fragment').normalize()
URIReference(scheme='http', authority=None, path='/path', query='query', fragment='fragment')
"6" omitted in important_characters['unreserved_chars']:
https://github.com/sigmavirus24/rfc3986/blob/master/rfc3986/misc.py#L35
Right now if you want to validate that an URI is valid and it has one of a set of schemes you have to parse the URI, check it's validity, then manually check the scheme. It'd be nice to be able to just do it all in one go, something like:
# Require schemes, but doesn't matter which.
is_valid_uri("...", require_scheme=True)
# Require schemes, and must be either http or https.
is_valid_uri("...", require_scheme={"http", "https"})
Perhaps have normalize()
use it.
This bug was already fixed 3 months ago by the pull request #14 (commit cf15373). The problem is that no new release was done since this commit was merge. Can you please release a new version?
By the way, since the project doesn't need 2to3 or other hack to support Python 2 and Python 3, it looks like rfc3986 works well on Python 3 without any change and that you can release an "universal wheel": a single binary package for Python 2 and Python 3. Or at least, please upload also a wheel package for Python 3.
Apologies for asking this as an issue, but is a 2.0.1 patch release likely anytime soon? It would be very helpful to not have to deal with the constant warnings anymore (#95).
The library in its latest pip version (1.5.0; also observed in 1.4 as shipped in Debian) does not distinguish between present and absent authority components:
>>> rfc3986.urlparse('foo:///').unsplit()
'foo:/'
I don't see how RFC3986 would justify normalizing this away, especially given the comment on page41 that in some cases a URI would be normalized to one that has an empty (not absent) authority.
According to RFC 3986 line termination characters like \n
are not allowed in any part of a URI but the percent-encoded versions %0A
are allowed. For other sections of the URL, such as the query and the path, ParseResult.fromString will normalize \n
characters into percent-encoded characters and accept them. This is not true for the fragment section of the URI.
Inserting any line termination character into the fragment section of the URL will result in the parsing of the fragment section being cut short.
import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://[email protected]:80/path?query#Fragment\nThatIsIllusive', 'utf-8', True, False)
print("Fragment: " + parsed_url.fragment)
This will print Fragment: Fragment
In contrast Furl, Hyperlink, Urllib, and Yarl all return Fragment: Fragment%0AThatIsIllusive
This is the regex used to parse different parts of a URI
SCHEME_RE = "[a-zA-Z][a-zA-Z0-9+.-]*"
_AUTHORITY_RE = "[^\\\\/?#]*"
_PATH_RE = "[^?#]*"
_QUERY_RE = "[^#]*"
_FRAGMENT_RE = ".*"
This bug is a result of the use of .*
in the fragment regex. The .
symbol in regex accepts every character except for line termination characters.
rfc3986/src/rfc3986/abnf_regexp.py
Line 153 in 1640734
The PATH_EMPTY
regex won't ever match anything when embedded within another regex, as it is in several places like here:
rfc3986/src/rfc3986/abnf_regexp.py
Lines 184 to 190 in 1640734
This causes the URIReference.is_absolute()
method to return an incorrect result for a URI with a scheme but no path. For example, http:
(or even http:?q=v
) is an absolute URI according to section-4.3, but URIReference.from_string('http:').is_absolute()
returns False
.
There are likely other places where this causes intended matches to fail, but I have not tried to investigate further. It's a little odd looking, but a fix that would probably work is just to change the value of PATH_EMPTY
to an empty string.
Currently we erroneously accept \\
in the USERINFO_RE
(and probably many more!) due to regex escaping acting differently when inside or outside of a character class.
Example: http://user\\@google.com
shouldn't be a valid URL but is accepted by us.
Apologies for this probably not being the right place to ask about this, but is there a timeline for a 2.0.1 release? I am currently having to put a git commit in my pyproject.toml
for a contracting project in order to keep it from emitting tons of warnings that were fixed in PR #95. I suspect my clients would be happier if rfc3986
were being installed form PyPI 🙂
Please let me know if there's anytihng I can do to help facilitate a patch release.
A ValueError
is raised when trying to validate the host as an IPv4:
Traceback (most recent call last):
File "/lib/python3.7/site-packages/rfc3986/uri.py", line 354, in normalize
(self.userinfo, self.host, self.port)),
File "/lib/python3.7/site-packages/rfc3986/uri.py", line 198, in userinfo
authority = self.authority_info()
File "/lib/python3.7/site-packages/rfc3986/uri.py", line 169, in authority_info
validators.valid_ipv4_host_address(host)):
File "/lib/python3.7/site-packages/rfc3986/validators.py", line 376, in valid_ipv4_host_address
return all([0 <= int(byte, base=10) <= 255 for byte in host.split('.')])
File "/lib/python3.7/site-packages/rfc3986/validators.py", line 376, in <listcomp>
return all([0 <= int(byte, base=10) <= 255 for byte in host.split('.')])
ValueError: invalid literal for int() with base 10: '6g9m8V6'
Indeed, when the value 6g9m8V6
is tested against the current regex ([0-9]{1,3}.){3}[0-9]{1,3}
, it matches. This is due to the .
symbol not being escaped so it can match any character what fades out the goal of only matching IPv4 addresses.
A corrected regex can be: ([0-9]{1,3}\.){3}[0-9]{1,3}
.
When I run geturl() of the instance of the ParseResultBytes, it just return "http:"
>>> print(parsed)
ParseResultBytes(scheme=b'https', userinfo=None, host=b'xn--i-7iq.ws', port=None, path=None, query=None, fragment=None)
>>> parsed.geturl()
b'https:'
Is the copy_with()
method combined with empty strings (note that None
doesn't work) the correct way to remove components like a fragment altogether?
>>> s = 'http://example.org/?this-is-the-query#this-is-the-fragment'
>>> rfc3986.uri_reference(s).copy_with(fragment='').unsplit()
'http://example.org/?this-is-the-query
If so, this may be worth mentioning in the docs. If not, please point me to the proper approach. :)
remove_dot_segments
defined in RFC 3986 (section 5.2.4) sometimes add a leading slash when the base path is "rootless".
STEP OUTPUT BUFFER INPUT BUFFER
1 : foo/../baz
2E: foo /../baz
2C: /baz
2E: /baz
So, resolving ../baz
against scheme:foo/bar
should result in scheme:/baz
.
However, output of this library differs from that.
$ python
Python 3.9.9 (main, Jan 10 2022, 18:52:39)
[GCC 11.2.1 20211127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rfc3986 import uri_reference
>>> b = uri_reference('scheme:foo/bar')
>>> r = uri_reference('../baz')
>>> t = r.resolve_with(b)
>>> t
URIReference(scheme='scheme', authority=None, path='baz', query=None, fragment=None)
>>> t.unsplit()
'scheme:baz'
>>>
I think there should be special handling such as "when ..
segment appears but output
stack is empty, set prepend_slash
flag" or something like that.
The remove_dot_segments() function returns an empty-string path for an absolute input path after removing dot-dot segments if the number of dot-dot segments is greater than the deepest path level. This is not compatible with the algorithm suggested by RFC3986 section-5.2.4.
INPUT: '/a/b/c/../../../../'
EXPECTED OUTPUT: '/'
Got: ''
Code snippet to reproduce the issue:
from rfc3986.normalizers import remove_dot_segments
assert remove_dot_segments('/a/b/c/../../../../') == '/'
RFC 3986 defines an authority as such
authority = [ userinfo "@" ] host [ ":" port ]
WHAT WG says that
An opaque-host-and-port string must be either the empty string or: a valid opaque-host string, optionally followed by U+003A (:) and a URL-port string.
A scheme-relative-special-URL string must be "//", followed by a valid host string, optionally followed by U+003A (:) and a URL-port string, optionally followed by a path-absolute-URL string.
ParseResult.from_string will parse a port number which is not preceded by a colon. This conflicts with both specifications.
Minimally Reproducable Example
import rfc3986
# Arguments are (URL, encoding, strict, lazy_normalize)
parsed_url = rfc3986.ParseResult.from_string('scheme://[v1.ip]8000/path', 'utf-8', True, False)
print("Host: " + str(parsed_url.host)) # prints 'Host: [v1.ip]'
print("Port: " + str(parsed_url.port)) # prints 'Port: 8000'
This is the regex used to parse the authority component in misc.py
SUBAUTHORITY_MATCHER = re.compile(
(
"^(?:(?P<userinfo>{})@)?" # userinfo
"(?P<host>{})" # host
":?(?P<port>{})?$" # port
).format(
abnf_regexp.USERINFO_RE, abnf_regexp.HOST_PATTERN, abnf_regexp.PORT_RE
)
)
This bug is a result of the first '?' character in the regex used for the port :?(?P<port>{})?$
. This regex allows the colon to be optional independently of an optional port number. However, according to the specs a port number and colon should always be paired.
Sorry if this is a stupid question or was addressed elsewhere.
Any way to validate only relative path? For example, just check characters.
Your pattern for the reg-name component is::
reg_name = '[\w\d.]+'
However, this is not sufficient, I believe -- the Python regex tokens \w match any "word" characters which notably include "alpha" characters, plus numbers, plus the underscore, but not the full range of characters permitted in reg-names from RFC 3986 section 3.2.2 (despite the comment in your source). The spec specifies this formation for reg-name::
reg-name = *( unreserved / pct-encoded / sub-delims )
Your reg_name particle should get redefined to support the fuller character set as per the RFC.
Copied from openstack nova launchpad bug:
https://bugs.launchpad.net/nova/+bug/1460206
2015-05-29 15:57:16.241 | Collecting rfc3986>=0.2.0 (from -r /home/jenkins/workspace/gate-nova-python34/requirements.txt (line 46))
2015-05-29 15:57:16.241 | Downloading http://pypi.region-b.geo-1.openstack.org/packages/source/r/rfc3986/rfc3986-0.2.2.tar.gz
2015-05-29 15:57:16.241 | Complete output from command python setup.py egg_info:
2015-05-29 15:57:16.241 | Traceback (most recent call last):
2015-05-29 15:57:16.242 | File "<string>", line 20, in <module>
2015-05-29 15:57:16.242 | File "/tmp/tmp.ALfXvU5OeC/pip-build-ottfpn10/rfc3986/setup.py", line 22, in <module>
2015-05-29 15:57:16.242 | readme = f.read()
2015-05-29 15:57:16.242 | File "/home/jenkins/workspace/gate-nova-python34/.tox/py34/lib/python3.4/encodings/ascii.py", line 26, in decode
2015-05-29 15:57:16.242 | return codecs.ascii_decode(input, self.errors)[0]
2015-05-29 15:57:16.242 | UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1030: ordinal not in range(128)
2015-05-29 15:57:16.242 |
2015-05-29 15:57:16.242 | ----------------------------------------
2015-05-29 15:57:16.242 | Command "python setup.py egg_info" failed with error code 1 in /tmp/tmp.ALfXvU5OeC/pip-build-ottfpn10/rfc3986
Looks like it might be an encoding problem with the README.rst?
According to RFC 3986, there are some ABNF rules:
URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / ".")
And I tried:
>>> from rfc3986 import is_valid_uri
>>> is_valid_uri("1")
True
>>> is_valid_uri("//")
True
>>> is_valid_uri("http#:/")
True
I think these results are not correct.
If we have a ParseResult
with a userinfo section like foo:b@r#/z
we should properly encode that so @
, #
, and /
are percent-encoded.
We might also consider how this will affect the underlying URIReference object.
Should be address rather than adddres - originally reported in https://github.com/urllib3/urllib3/pull/1642/files who suggested fixing here.
The following malformed URL is accepted by rfc3986:
B://]
Although the character ']'
is allowed in a host, it must be in the context of an IPv6 or an IPvFuture, which this is not.
This malformed URL is rejected by urllib, urllib3, hyperlink, yarl, furl, and Boost.URL.
URIMixin.resolve_with
(which is not itself deprecated) emits warning "Please use rfc3986.validators.Validator instead".
RFC 4007 allows %
as the delimiter for IPv6 zone IDs whereas RFC 6874 specifically uses %25
to comply with the host component rules in RFC 3986. We need to assume 6874 and normalize 4007 -> 6874. See discussion in urllib3/urllib3#1531.
Right now the add_<component>
methods will replace the existing value. We should probably allow for someone to extend these in an intelligent way.
For example: It might be nice to have an existing path, e.g., /python-hyper
and allow them to add something "underneath" that, e.g., /rfc3986
such that the final path is /python-hyper/rfc3986
.
Also it may be nice to allow someone to add to the query argument list instead of replacing it wholesale and interact with it as a list of tuples.
I tried to check the release notes for the newest 1.4.0 release, but the latest release notes shown on https://rfc3986.readthedocs.io/en/latest/release-notes/index.html# are for version 1.1.0
Then I checked the commits on github and found 25dffd6 with the newest release notes.
By the way, these release notes have a bug, because the heading shows the wrong version number (1.3.0 instead 1.4.0)
https://travis-ci.org/sigmavirus24/rfc3986/jobs/86443853
May as well stop (explicitly) supporting 3.2 at this point.
Running tox -e py27 -v -- tests/test_iri.py::test_encode_invalid_iri
I got the error:
F.. [100%]
========================================== FAILURES ===========================================
_______________________ test_encode_invalid_iri[http://\U0002f868.com] ________________________
iri = 'http://㛼.com'
@requires_idna
@pytest.mark.parametrize("iri", [
u'http://㛼.com',
u'http://♥.net',
u'http://\u0378.net',
])
def test_encode_invalid_iri(iri):
# import pdb;pdb.set_trace()
iri_ref = rfc3986.iri_reference(iri)
with pytest.raises(InvalidAuthority):
> iri_ref.encode()
E Failed: DID NOT RAISE <class 'rfc3986.exceptions.InvalidAuthority'>
tests/test_iri.py:60: Failed
The backtrace:
-> testfunction(**testargs)
/usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/tests/test_iri.py(60)test_encode_invalid_iri()
-> iri_ref.encode()
/usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/iri.py(132)encode()
-> if self.host:
/usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/_mixin.py(61)host()
-> authority = self.authority_info()
/usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/_mixin.py(31)authority_info()
-> match = self._match_subauthority()
> /usr/src/RPM/BUILD/python-module-rfc3986-1.3.1/.tox/py27/lib/python2.7/site-packages/rfc3986/iri.py(76)_match_subauthority()
-> return misc.ISUBAUTHORITY_MATCHER.match(self.authority)
(Pdb) self.authority
u'\U0002f868.com'
(Pdb) misc.ISUBAUTHORITY_MATCHER.match(self.authority)
(Pdb)
The important note ( system Python2 is congured to use UCS-2):
(Pdb) import sys
(Pdb) sys.maxunicode > 0xFFFF
False
From @shazow:
17:18.45 shazow it's the on-demand parsing that is the problem
17:18.49 shazow urlparse does the same thing
17:19.15 shazow by the time i access url.port or whatever, i should be able to assume that url is safely parsed
17:19.34 shazow otherwise i gotta shove try/except around every single time i access a property, rather than in one sensible place
17:19.48 shazow or i gotta maintain my own parsed state outside of the one you give me
'http+unix://%2Fvar%2Frun%2Fsocket/path?key=value'
rfc3986/src/rfc3986/abnf_regexp.py
Line 74 in 4102c50
From IETF RFC 3986 Uniform Resource Identifier (URI): Generic Syntax I would expect the following test to pass:
from rfc3986 import uri_reference # type: ignore[import]
from rfc3986.uri import URIReference # type: ignore[import]
def test_schema_only_base() -> None:
relative_uri: URIReference = uri_reference("john.smith")
base_uri: URIReference = uri_reference("example:")
resolved_uri: URIReference = relative_uri.resolve_with(base_uri, strict=True)
assert resolved_uri.unsplit() == "example:/john.smith"
This is because of:
Note that only the scheme component is required to be
present in a base URI; the other components may be empty or
undefined.
# Branches that won't execute elided with [...] if defined(R.scheme) then [...] else if defined(R.authority) then [...] else if (R.path == "") then [...] else if (R.path starts-with "/") then [...] else T.path = merge(Base.path, R.path); T.path = remove_dot_segments(T.path); endif; T.query = R.query; endif; T.authority = Base.authority; endif; T.scheme = Base.scheme; endif; T.fragment = R.fragment;
The pseudocode above refers to a "merge" routine for merging a
relative-path reference with the path of the base URI. This is
accomplished as follows:
- If the base URI has a defined authority component and an empty
path, then return a string consisting of "/" concatenated with the
reference's path; otherwise,
So according to this, my understanding is that, R
, Base
and T
should be as follow:
R = URIReference(scheme=None, authority=None, path='john.smith', query=None, fragment=None)
Base = URIReference(scheme='example', authority=None, path=None, query=None, fragment=None)
T = URIReference(scheme='example', authority=None, path='/john.smith', query=None, fragment=None)
And T, when unsplit, will be "example:/john.smith". I may of course be missing something though, and if I am please do help me out here in seeing what it is.
When I run the actual test code it fails, so clearly the current implementation of RFC3986 in this library does not share my interpretation, it may be
$ pytest test_rfc3986.py -rA --log-level DEBUG
============================================================================ test session starts ============================================================================
platform linux -- Python 3.10.1, pytest-6.2.5, py-1.11.0, pluggy-1.0.0
rootdir: /home/iwana/sw/d/gitlab.com/aucampia/scraps
plugins: asyncio-0.16.0
collected 1 item
test_rfc3986.py F [100%]
================================================================================= FAILURES ==================================================================================
___________________________________________________________________________ test_schema_only_base ___________________________________________________________________________
def test_schema_only_base() -> None:
relative_uri: URIReference = uri_reference("john.smith")
base_uri: URIReference = uri_reference("example:")
> resolved_uri: URIReference = relative_uri.resolve_with(base_uri, strict=True)
test_rfc3986.py:8:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = URIReference(scheme=None, authority=None, path='john.smith', query=None, fragment=None)
base_uri = URIReference(scheme='example', authority=None, path=None, query=None, fragment=None), strict = True
def resolve_with(self, base_uri, strict=False):
"""Use an absolute URI Reference to resolve this relative reference.
Assuming this is a relative reference that you would like to resolve,
use the provided base URI to resolve it.
See http://tools.ietf.org/html/rfc3986#section-5 for more information.
:param base_uri: Either a string or URIReference. It must be an
absolute URI or it will raise an exception.
:returns: A new URIReference which is the result of resolving this
reference using ``base_uri``.
:rtype: :class:`URIReference`
:raises rfc3986.exceptions.ResolutionError:
If the ``base_uri`` is not an absolute URI.
"""
if not isinstance(base_uri, URIMixin):
base_uri = type(self).from_string(base_uri)
if not base_uri.is_absolute():
> raise exc.ResolutionError(base_uri)
E rfc3986.exceptions.ResolutionError: example: is not an absolute URI.
/home/iwana/.local/lib/python3.10/site-packages/rfc3986/_mixin.py:266: ResolutionError
========================================================================== short test summary info ==========================================================================
FAILED test_rfc3986.py::test_schema_only_base - rfc3986.exceptions.ResolutionError: example: is not an absolute URI.
============================================================================= 1 failed in 0.10s =============================================================================
Caused by the URI_MATCHER
and IRI_MATCHER
not using re.DOTALL
.
scheme:/..///bar
has scheme="scheme"
, authority=None
, path=/..///bar
."scheme"
, authority="bar"
...///bar
resolved against scheme:
."scheme"
and authority=None
(since ..///bar
does not contain authority).scheme://bar
, it has authority=bar
.And some more examples:
$ python
Python 3.9.9 (main, Jan 10 2022, 18:52:39)
[GCC 11.2.1 20211127] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from rfc3986 import uri_reference
>>> b = uri_reference('scheme:')
>>> r1 = uri_reference('..///bar')
>>> t1 = r1.resolve_with(b)
>>> t1
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> t1.unsplit()
'scheme://bar'
>>> r2 = uri_reference('/..///bar')
>>> r2.resolve_with(b)
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> uri_reference('scheme:/..///bar').normalize()
URIReference(scheme='scheme', authority=None, path='//bar', query=None, fragment=None)
>>> uri_reference('scheme:/..///bar').normalize().unsplit()
'scheme://bar'
I'm not sure how this should handled.
Collapsing the //
at the beginning is not explicitly allowed by RFC 3986, so I think the normalization and the resolution cannot produce valid output and should fail in this case.
(But RFC 3986 does not seem to state that they can fail!)
This can caused by normalization during resolution, so #84 may also be affected by this issue.
From abnf_regexp.py#110
should be changed from
IPv6_RE = '(({0})|({1})|({2})|({3})|({4})|({5})|({6})|({7}))'.format( *variations )
to
IPv6_RE = '(({0})|({1})|({2})|({3})|({4})|({5})|({6})|({7})|({8}))'.format( *variations )
So, the rule
[ *6( h16 ":" ) h16 ] "::"
could be applied
'.://'
should not parse. It is not an absolute URI because '.'
is not a valid scheme, and it is not a relative URI because a path-noscheme
cannot begin with a ':'
.
The relevant grammar rules from the RFC:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
relative-ref = relative-part [ "?" query ] [ "#" fragment ]
relative-part = "//" authority path-abempty
/ path-absolute
/ path-noscheme
/ path-empty
path-noscheme = segment-nz-nc *( "/" segment )
segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
; non-zero-length segment without any colon ":"
Put it within abnf_regex, useful for libraries that need to determine if a host is an IP address or not (For IDNA encoding, for example).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.