ada-url / ada Goto Github PK
View Code? Open in Web Editor NEWWHATWG-compliant and fast URL parser written in modern C++
Home Page: https://ada-url.com
License: Apache License 2.0
WHATWG-compliant and fast URL parser written in modern C++
Home Page: https://ada-url.com
License: Apache License 2.0
A pointer can default on null.
Similar to prettier.
The convention in the C++ runtime is that capacity() is merely an indication of the memory allocation that might have been done. However, it is not considered safe to write to this memory as far as I know.
Potentially unsafe usage:
Line 53 in e79260c
Harcoded constants are changed for m1 max. We should investigate: https://gist.github.com/ibireme/173517c208c7dc333ba962c1f0d67d12#file-kpc_demo-c-L868-L1069
cc @ibireme
I think that this should be the tests for it...
"comment": [
"#Additional tests designed by the ada team."
],
"pathname": [
{
"href": "https://lemire.me",
"new_value": "école",
"encoding": "UTF-8",
"expected": {
"href": "https://example.net/%C3%A9cole",
"pathname": "/%C3%A9cole"
}
},
{
"href": "https://lemire.me",
"new_value": "école",
"encoding": "UTF-16LE",
"expected": {
"href": "https://example.net/%E9%00c%00o%00l%00e%00",
"pathname": "/%E9%00c%00o%00l%00e%00"
}
},
{
"href": "https://example.net#nav",
"new_value": "école",
"encoding": "UTF-16BE",
"expected": {
"href": "https://lemire.me/%00%E9%00c%00o%00l%00e",
"pathname": "/%00%E9%00c%00o%00l%00e"
}
}
]
}
This commit failed on a CI while running wpt_tests. We need to investigate if there's a bug within our testing infra. 0e36cba
http://./
as an input is valid for both safari & chrome, but it's invalid for us.
We should allow compile-time selection of a logging flag so that a deep understanding of the execution is easy (without having to use a debugger).
See for example how we did it with simdjson:
simdjson/simdjson#1938
Now that we have a parse_host function, the set_host function should call it directly instead of parsing a whole URL.
It is not impossibly hard to drop the ICU dependency. It would simplify builds and improve engineering.
I estimate that it would take about a week of work.
The following seems like a minimum:
It might look as follows...
std::string(input_url.get_scheme())
+":"
+ (input_url.host.has_value() ?
"//"
+ input_url.username
+ (input_url.password.empty() ? "" : ":" + input_url.password)
+ (input_url.includes_credentials() ? "@" : "")
+ input_url.host.value()
+ (input_url.port.has_value() ? ":" + std::to_string(input_url.port.value()) : "")
: "")
+ input_url.path
+ (input_url.query.has_value() ? "?" +input_url.query.value() : "")
+ (input_url.fragment.has_value() ? "#" + input_url.fragment.value() : "");
The current code is still young with lots and lots of temporary comments. We should clean that out.
Both of them have the same code, except for a single line:
If state override is given and state override is [hostname state](https://url.spec.whatwg.org/#hostname-state), then return.
What is the purpose of this optional URL?
Line 22 in 20cac12
If it is dead code, it should be removed.
Current work in progress: nodejs/node#46410
It appears that the percent_encoding tests in our unit tests have not been completed.
Take the URI https://faß.ExAmPlE/ (encoded as https://fa\xc3\x9f.ExAmPlE/).
The Brave browser and Microsoft Edge will map this to https://fass.example
Firefox and Safari maps it https://xn--fa-hia.example
The command line curl tool can't seem to process it.
The command line wget tool maps it to fa\303\237.example.
If you try the following in curl...
#include <curl/curl.h>
#include <stdio.h>
int main() {
CURLU *url = curl_url();
CURLUcode rc = curl_url_set(url, CURLUPART_URL, "https://fa\xc3\x9f.ExAmPlE/",
CURLU_URLENCODE);
if (!rc) {
char *host;
rc = curl_url_get(url, CURLUPART_HOST, &host, 0);
if (!rc) {
printf("the host is %s\n", host);
curl_free(host);
}
rc = curl_url_get(url, CURLUPART_HOST, &host, CURLU_URLENCODE);
if (!rc) {
printf("the host is %s\n", host);
curl_free(host);
}
}
curl_url_cleanup(url);
}
(Compile as c++ test.cpp -lcurl
), I get...
the host is faß.ExAmPlE
the host is faß.ExAmPlE
Maybe I am misusing curl?
There is an additional flag in curl related to punycode, but it is only available in curl 7.88.0 which is seemingly unrelease as I write this lines: https://curl.se/changes.html
We need to re-tune the performance before any release.
When you receive a domain name or label, you should preserve its case. The rationale for this choice is that we may someday need to add full binary domain names for new services; existing services would not be changed.
RFC 1034 : https://www.rfc-editor.org/rfc/rfc1034
I do not find anything at https://url.spec.whatwg.org/#url-parsing saying that we should lower the case. They refer to case-insensitve matching, but that can be accomplished by lowercasing the strings, but that's not the same thing as storing the lowercase version of the domain.
We should, similarly, check whether other strings that we manipulate should be stored with their case changed.
We should run tests in CI where you do cmake --install
and check then a project can make use of the result.
Here is how it is done in other projects...
cmake --build . &&
ctest -j --output-on-failure &&
cmake --install . &&
cd ../tests/installation_tests/find &&
mkdir build && cd build && cmake -DCMAKE_INSTALL_PREFIX:PATH=../../../build/destination .. && cmake --build .
(see simdutf)
The current implementation of parse_url does not have an ideal implementation as far as maintainability is concerned.
std::string_view::iterator
instances. The finite state machine would sometimes decrement the iterators before the string start, only to soon increment it.std::string_view::iterator
and now work with with integers (input_position
). The integer is maintained in a range between 0 and the input length (inclusively).@ronag proposed what I feel is a better design at: #169
Instead of systematically incrementing the integer position with each pass, he just expects the states to increment as needed. This is simpler because we don't need to decrement and then reincrement.
It should be possible to implement parse_url in a forward-only design (where you never go back, you only go forward).
Alternatively, it could be implemented with a string_view that gets shorter and shorter.
Node has domainToUnicode
and domainToUnicode
coupled with the URL parser. We should have a API layer that does the same thing:
void DomainToUnicode(const FunctionCallbackInfo<Value>& args) {
Environment* env = Environment::GetCurrent(args);
CHECK_GE(args.Length(), 1);
CHECK(args[0]->IsString());
Utf8Value value(env->isolate(), args[0]);
URLHost host;
// Assuming the host is used for a special scheme.
host.ParseHost(*value, value.length(), true, true);
if (host.ParsingFailed()) {
args.GetReturnValue().Set(FIXED_ONE_BYTE_STRING(env->isolate(), ""));
return;
}
std::string out = host.ToStringMove();
args.GetReturnValue().Set(
String::NewFromUtf8(env->isolate(), out.c_str()).ToLocalChecked());
}
Currently, unit tests are not passing.
RFC 1034 specifically gives this example poneria.ISI.EDU.
We should add tests.
Currently, we assume valid UTF-8 inputs and we only test with valid UTF-8 inputs. We do not actually check that we have valid UTF-8.
We need to determine whether the library is expected to handle invalid unicode inputs productively.
Supported states:
wpt_tests are missing the href
setter tests.
We need a WPT runner to understand where we are and what kind of errors we have. This commit has a script to update the results from the Git repository of WPT. It would be satisfactory if we can parse the JSON and run the tests according to each object.
The scheme parsing should be pulled out from the large state machine.
This project needs a logo.
/Users/yagiz/Developer/url-parser/tests/basic_fuzzer.cpp:37:59: warning: multiple unsequenced modifications to 'counter' [-Wunsequenced]
copy.insert(copy.begin()+(211311*counter++)%copy.size(), char(counter++*777));
^ ~~
1 warning generated.
[55/55] Linking CXX executable benchmarks/bench
Build finished
For example Node uses the following struct. We need to construct a ada::url
from a struct. If we use ada::url value;
and then use value.set_port()
it might not work, since there are limitations on the setters side. (For example, you can't set port
if host
does not have a value etc. This needs to also properly set the private variables inside our ada::url
struct.
struct url {
int32_t flags = URL_FLAGS_NONE;
int port = -1;
std::string scheme;
std::string username;
std::string password;
std::string host;
std::string query;
std::string fragment;
std::vector<std::string> path;
std::string href;
};
with the following flags:
#define FLAGS(XX) \
XX(URL_FLAGS_NONE, 0) \
XX(URL_FLAGS_FAILED, 0x01) \
XX(URL_FLAGS_CANNOT_BE_BASE, 0x02) \
XX(URL_FLAGS_INVALID_PARSE_STATE, 0x04) \
XX(URL_FLAGS_TERMINATED, 0x08) \
XX(URL_FLAGS_SPECIAL, 0x10) \
XX(URL_FLAGS_HAS_USERNAME, 0x20) \
XX(URL_FLAGS_HAS_PASSWORD, 0x40) \
XX(URL_FLAGS_HAS_HOST, 0x80) \
XX(URL_FLAGS_HAS_PATH, 0x100) \
XX(URL_FLAGS_HAS_QUERY, 0x200) \
XX(URL_FLAGS_HAS_FRAGMENT, 0x400) \
XX(URL_FLAGS_IS_DEFAULT_SCHEME_PORT, 0x800) \
We mix implementation (definition) and declaration in our headers. We should split the headers into something.h
(just declaration) and implementation-inl.h
(inline definitions).
It says....
A URL’s path is either a URL path segment or a list of zero or more URL path segments, usually identifying a location. It is initially « ». A special URL’s path is always a list, i.e., it is never opaque. A URL has an opaque path if its path is a URL path segment.
Are we standard compliant ?
We have set_host, set_path and set_scheme take an encoding parameter, but set_search does not. We should make sure that it is the desired API.
....
We fail to parse this scenario:
{
"input": "http:foo.com",
"base": "http://example.org/foo/bar",
"href": "http://example.org/foo/foo.com",
"origin": "http://example.org",
"protocol": "http:",
"username": "",
"password": "",
"host": "example.org",
"hostname": "example.org",
"port": "",
"pathname": "/foo/foo.com",
"search": "",
"hash": ""
},
Also in this scenario:
{
"input": "a:\t foo.com",
"base": "http://example.org/foo/bar",
"href": "a: foo.com",
"origin": "null",
"protocol": "a:",
"username": "",
"password": "",
"host": "",
"hostname": "",
"port": "",
"pathname": " foo.com",
"search": "",
"hash": ""
}
There are other failing cases.
The specification should not allow it. ICU is aware of these limits. It should be tested and forbidden.
Currently, we call to_ascii
with be_strict
set to false. Thus we allow overly long domain names deliberately.
Is that what we want to do?
Right now, cmake includes both benchmark and tests folder if tests are enabled, but it reduces the build time for both ourselves, and for CI. It would be good to have a flags to distinguish tests and benchmarks.
The parse_url
function returns a URL. In many URL parsing library, there is no parse_url
and you just do...
ada::url url("http://google.com")
or the equivalent.
Now, having a parsing
function could make sense if you have a parser. So you do...
ada::parser p;
ada::url = p.parse("http://google.com");
This can make sense because the parser instance can hold some ressources (such as space for temporary buffers) and it can also hold configuration switches.
The entire code base assumes UTF-8. To support UTF-16, we simply need to transcode (easy!).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.