Comments (3)
First we need to research the subject and figure out what a conforming implementation looks like (see rfc3986). We should also check what other URL libraries are doing in terms of comparison. The notes can be collected here in this issue.
from url.
So it seems like this issue and #8 are almost the same issues. Each normalization strategy represents a comparison strategy. The main difference for us, because we care about memory allocation, is that we probably want normalization algorithms and comparison algorithms to work as if the underlying strings were normalized, instead of reusing the algorithms.
Boost.URL design (TL;DR)
The final choice here is between String comparison and Syntax-Based normalization. For url::normalize
, Syntax-Based normalization is the only possibility that makes sense.
For url_view::operator==
, Syntax-Based normalization also seems like the best alternative to enable containers of URLs and lead to a syntax that's self-explanatory, because we can always perform a string comparison with:
assert( u1.string() == u2.string() );
and Syntax-Based comparison with:
assert( u1 == u2 );
The only difference between normalization and comparison is one of them acts as if they are normalized. This has a cost when the URLs are different but it's always constant on code-point and does not require any memory allocations from the heap.
Other libs
Other libraries, such as folly and apache, don't include normalization. Javascript's URL library doesn't include normalization but there are some famous libraries such as sindresorhus/normalize-url people seem to use.
sindresorhus/normalize-url, however, has options that are not very related to what the RFC 3986 describes as normalization. It implicitly assumes URLs are always http, doesn't include RFC normalization rules, and includes lots of rules that would go beyond even http normalization because normalized URLs would point to different HTTP resources. In practice, they are at most some useful conversion functions for URLs, and Boost.URL provides alternatives for each of these functions.
Methods
These are the normalization/comparison strategies by their probability of false negatives and cost.
- String comparison
- Syntax-Based normalization
- Scheme-Based Normalization
- Protocol-Based Normalization
Only the first two make sense for the level of abstraction of Boost.URL. Scheme-Based Normalization fits better in Boost.HTTP. Protocol-Based Normalization makes sense in some web spiders.
Trade-offs
False negatives (two URLs to the same resource being considered different) can never be completely eliminated because they depend a lot on the context. Example: the same website served from two servers: Comparison will return false for the same resource.
We can only eliminate false positives with rules augmented by the scheme, protocol, and contextual rules. Minimizing false negatives also has an extra cost for each normalization.
So the goal is to minimize false negatives and completely eliminate false positives.
Note on Relative references
In applications, relative references should not be compared directly by applications to identify resources. Fragments should often be ignored when compared to select a network action. A positive example is HTML anchors, which represent the same resource. A negative example is a Git commit tag, which represents different resources.
More details on each method:
Comparison Methods and Normalizations by the probability of false negatives and cost:
- String comparison:
- No direct relationship with resource
- False negatives are mostly caused by URI aliases
- Example: two strings pointing to the same resource: http://myothersite/user/65 != http://myothersite/user/bob
- Implementations will, in their own best interest, be consistent in providing URI references though
- Syntax-Based normalization:
- Used by Web user agents, such as browsers, to determine if a cached response is available
- Uses definitions provided by RFC3986 itself
- Example: "example://a/b/c/%7Bfoo%7D" == "eXAMPLE://a/./b/../b/%63/%7bfoo%7d"
- Case normalization:
- the hexadecimal digits are case insensitive: normalize to uppercase
- scheme and host are case-insensitive: normalize to lowercase
- The other generic syntax components case-sensitive
- Percent-encoding normalization:
- Replace percent-encode octets that do not require percent-encoding (unreserved characters 2.3)
- Path Segment Normalization:
- Keep "." and ".." only in relative references (4.1)
- Remove them in non-relative paths (5.2)
- Use the remove_dot_segments algorithm (5.2.4)
- Scheme-Based Normalization:
- Example: HTTP default port 80, empty path == "/", so these are the same:
- http://example.com, http://example.com/, http://example.com:/, and http://example.com:80/ are equivalent - Other common normalizations:
- Normalize HTTP empty path to "/"
- Remove HTTP port 80 or empty port
- Normalize empty host to localhost or error
- Normalize empty host to localhost, a default, or error
- Don't remove host if userinfo or port are not empty
- Lowercase other components
- Example: HTTP default port 80, empty path == "/", so these are the same:
- Protocol-Based Normalization:
- Example: identifying that always "http://example.com/data" redirects to "http://example.com/data/"
- Usually requires actually accessing resources at least once
from url.
Some updates:
I've implemented the Syntax-Based comparison as if the components were normalized to avoid allocating memory. The problem we have is the remove_dot_segments
algorithm (5.2.4) requires a stack of some form. Creating the stack would allocate memory but not creating the stack would make the algorithm O(n^2)
.
In practice, I think this is rarely going to be Θ(n^2)
- because not many paths will have many ".."s specifically meant to make it
Θ(n^2)
, and - we can always create a logical stack buffer that goes up to a point, and
- URLs themselves are limited to ~2000 chars
So I'm still exploring the solution with no allocations and then we can later include some of these optimizations.
from url.
Related Issues (20)
- Boost CMake testing procedure doesn't work for URL
- Reconfiguring with BUILD_TESTING=OFF doesn't disable tests
- `boost::urls::resolve` gives wrong result when there are more `..`s in relative reference HOT 6
- sanitize_uri moves host to path
- Slash in query param not being encoded as %2F HOT 5
- docs build tmp files HOT 4
- UB Sanitizer implicit-integer-sign-change warning in boost::urls::grammar::detail::find_if_not_pred HOT 6
- Missing coverage
- Source files should not include header guards
- coverage job is generating an empty file
- detail symbol in reference
- Missing codecov token in GHA
- Improve coverage
- When compiling the boost.url libs comes the errors HOT 3
- Exclude tests from Antora compile commands
- craypp crash compiling segments_view.cpp HOT 1
- Test libraries and executables should be declared EXCLUDE_FROM_ALL
- Fix security vulnerabilities detected in Antora docs HOT 1
- Fix urls::errc reference
- Missing StringToken exposition
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from url.