GithubHelp home page GithubHelp logo

Comments (14)

vinniefalco avatar vinniefalco commented on July 17, 2024

related:
https://stackoverflow.com/questions/11490326/is-array-syntax-using-square-brackets-in-url-query-strings-valid

from url.

vinniefalco avatar vinniefalco commented on July 17, 2024

This additional signature could work:

bool parse_uri( url& dest, string_view s, error_code& ec, parse_options const& opt = {} );

It would require a separate bnf to handle non-compliant queries.

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

We need to reach an agreement before fixing that.

My proposed solution is we differentiate between producers url and consumers url_view as defined by the RFC. url would always encode most gen-delims but url_view would accept unencoded gen-delims that are not ambiguous without any "loose" parsing mode.

I have two reasons and some evidence for each:

  1. The reserved characters change depending on the URL component. Even for producers (url), the RFC allows more than the reserved characters in some subcomponents.
    1. The general case forbids gen-delims: Of the ASCII character set, the characters : / ? # [ ] @ (gen-delims) are reserved for use as delimiters of the generic URI components and must be percent-encoded – for example, %3F for a question mark. RFC3986 2.2
    2. The general case allows sub-delims that are not ambiguous:
      1. The characters ! $ & ' ( ) * + , ; = are permitted by generic URI syntax to be used unencoded in the user information, host, and path as delimiters. RFC3986 3.2.2 and RFC3986 3.3
      2. Additionally, : and @ may appear unencoded within the path, query, and fragment; and ? and / may appear unencoded as data within the query or fragment. RFC3986 3.3, RFC3986 3.4, and RFC3986 3.5
  2. Consumers should accept reserved characters that are not ambiguous. For producers (url), the RFC tells us to usually encode the reserved characters gen-delims, but it also says very often consumers (url_view) should accept reserved characters that not ambiguous in that component.
    1. RFC3986 and RFC2396 define a difference between producers and consumers, even though they talk much more about producers and these references are sparse.
    2. Producers should use unencoded chars sometimes
      1. Even for producers, it is sometimes recommended for usability to avoid percent-encoding some reserved characters in sub-delims. RFC3986 3.4
    3. Consumers should accept unencoded chars that are not ambiguous. There's no need for a "loose" parsing mode.
      1. The regular expression ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? is also considered a valid URL for consumers. This includes reserved delimiters that are not ambiguous. RFC3986 B
      2. Everything between the first ? and the first # fits the spec's definition of a query. It can include any characters such as : / . ?. RFC3986 3.4 and SO question
      3. HTML establishes that a form submitted via HTTP GET should encode the form values as name-value pairs in the form "?key1=value1&key2=value2..." (properly encoded). Parsing of the query string is up to the server-side code (e.g. Java servlet engine). URIs should support that.
      4. Accepting reserved chars that are not ambiguous is common practice for consumers: I replicated the consumer algorithm used by Apache here. It basically accepts anything that is not ambiguous. For instance, anything after # is a valid fragment. For instance, anything but # after ? is a valid query, and so on. All other libraries I checked, including Apache, Javascript URL and folly, present the same behaviour.
      5. I don't know what was on their mind when allowing consumers to accept non-ambiguous delimiters, but this relaxation allows parsers to be faster. For instance, the Apache algorithm just looks for the delimiters ? and then #, and something similar happens for other components.

from url.

vinniefalco avatar vinniefalco commented on July 17, 2024

I think we should focus on the query instead of broadening the question to the entire URL

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

I think we should focus on the query instead of broadening the question to the entire URL

We could just change the query BNF to accept [ and ] and fix this. It's just that related issues keep coming up all the time and they're probably not going to stop.

from url.

vinniefalco avatar vinniefalco commented on July 17, 2024

I think that the grammar for query should accept any unescaped character except the pound sign ( # ), and that any percent sign ( % ) must be followed by two valid hex digits. When converting an unencoded string into a percent-encoded query string, it should use the general character specified in the RFC:

    query           = *( pchar / "/" / "?" )

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

I think that the grammar for query should accept any unescaped character except the pound sign ( # )

This is what #124 ended up doing. The only reserved chars it didn't accept before were ['#', '[', ']'].

any percent sign ( % ) must be followed by two valid hex digits

We should probably expand that to other components

from url.

vinniefalco avatar vinniefalco commented on July 17, 2024

We should probably expand that to other components

I think it already works that way, right ?

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

I think it already works that way, right ?

key_chars accepts whatever is in unreserved_chars + subdelim_chars + ':' + '@' + + '/' + '?'- '&' - '='. You mean we should wrap it pct_encoded_rule like pct_encoded_rule<fragment_chars_t>, right?

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

If a query doesn't need to be interpreted as params, why do we parse a range of key/value pairs in query_rule? If we remove this constraint, we could parse a query_rule as pct_encoded_rule<query_chars_t> and just make it:

constexpr
    query_chars_t() noexcept
        : grammar::lut_chars(
            pchars + '/' + '?' + '[' + ']')
    {
    }

or something directly more permissive like

constexpr
    query_chars_t() noexcept
        : grammar::lut_chars(
            pchars + gen_delim_chars - '#')
    {
    }

from url.

vinniefalco avatar vinniefalco commented on July 17, 2024

Because we need to know how many key/value pairs there are

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

Because we need to know how many key/value pairs there are

OK. As key_chars and value_chars are already wrapped in pct_encoded_rule, I believe PR #124 is ready then.

pct_encoded_rule<
        query_rule::key_chars> t0;
    pct_encoded_rule<
        query_rule::value_chars> t1;

from url.

mkarasevych avatar mkarasevych commented on July 17, 2024

Hi! Although this issue is closed, I'm encountering a similar problem. Could you clarify whether square brackets are accepted as part of a query? My test fails in boost 1.84:

    auto origin1 = "/path/path?key=value";
    auto rv1 = boost::urls::parse_origin_form( origin1 );
    BOOST_CHECK( rv1.has_value() );

    auto origin2 = "/path/path?key[]=value";
    auto rv2 = boost::urls::parse_origin_form( origin2 );
    BOOST_CHECK( rv2.has_value() ); // fails
    BOOST_CHECK( rv2.error().message() == "leftover" ); // passes

from url.

alandefreitas avatar alandefreitas commented on July 17, 2024

@mkarasevych Thanks for reporting that.

Yes. It seems like #124 wasn't enough. I'll have a look.

from url.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.