Comments (8)
Here are two cases I came across recently:
- Page level TTFB exceeds range of smallint (was a site in China tested from Dulles - but can't find the waterfall ATM)
- Response code for a request is -2 so fails to import due to requests.status being unsigned.
This test on Ilya's site is one example - http://www.webpagetest.org/result/140601_Q7_PTB/1/details/
Edit:
The second issue might be fixed by catchpoint/WebPageTest#256
from legacy.httparchive.org.
Here's an example of the current crawl's batch_report:
failed: 3014 (11%)
0 - submission failed
0 - WPT test failed
735 - test result processing failed
2279 - HAR import failed
More detail is written to batch.log. I removing unique info (pageid, url, time) and sorted by error type:
1146 The first request () failed with status 12007.
352 The first request () failed with status 12029.
155 The first request () failed with status 404.
100 The first request () failed with status 403.
97 no first HTML URL found.
97 AggregateStats failed. Purging pageid
38 The first request () failed with status 12031.
34 The first request () failed with status 503.
25 The first request () failed with status 500.
12 The first request () failed with status 12152.
11 The first request () failed with status 504.
9 The first request () failed with status 502.
9 The first request () failed with status 400.
8 The first request () failed with status 522.
4 The first request () failed with status 401.
3 The first request () failed with status 523.
3 The first request () failed with status 520.
2 The first request () failed with status 405.
1 The first request () failed with status 521.
1 The first request () failed with status 512.
1 The first request () failed with status 505.
1 The first request () failed with status 408.
1 The first request () failed with status 12055.
The "first request () failed" errors occur midway during HAR parsing, so the HAR was successfully downloaded but there were no valid entries.
I found these 12xxx error code definitions in chromeExtensionUtils.js:
'net::ERR_NAME_NOT_RESOLVED': 12007,
'net::ERR_CONNECTION_ABORTED': 12030,
'net::ERR_ADDRESS_UNREACHABLE': 12029,
'net::ERR_CONNECTION_REFUSED': 12029,
'net::ERR_CONNECTION_TIMED_OUT': 12029,
'net::ERR_CONNECTION_RESET': 12031
What do 12055 and 12152 mean?
I increased the "retry" value to "3" (so it makes 2 more passes through the failed URLs). The failure rate drops from ~11% to ~5% (need to verify this once the current crawl is done). It would be interesting to track the URLs that failed once or twice and then worked. That might help indicate why they failed initially.
Then we should add some code to track URLs that always fail and remove them from the crawl.
from legacy.httparchive.org.
Here are the wininet error codes: http://msdn.microsoft.com/en-us/library/windows/desktop/aa385465%28v=vs.85%29.aspx
12055 - The SSL certificate contains errors
12152 - The server response could not be parsed.
We might also be able to do a pre-crawl step where we take the URL list and run it through a script that uses CURL with an IE user agent on the base page to weed out broken domains/pages.
In particular we can use it to figure out if www. works or if we should just use the bare domain.
That won't necessarily help if there is a transient server or network issue but it would help reduce the amount of time spent testing invalid pages.
For the cases where no first URL is found it would be nice to know if the HAR file itself failed to generate (maybe retry in case there was a WPT server issue) or something else.
from legacy.httparchive.org.
Is the batch log somewhere I can download it? I'd like to see what the errors were that were fixed with a re-run (or is that already included?). If page X was re-submitted twice and failed all 3 times are there 3 entries in the counts above or just the last failure?
from legacy.httparchive.org.
The batch log is not downloadable. A failed page is only counted once, but the errors will show up 3 times in the log.
from legacy.httparchive.org.
Some notes: The "urls" table has some helpful columns like "optout" and "urlFixed". These could be used to address this bug. For example,
- If the Alexa zip file contains a domain that no longer exists, you could set optout=true in the urls table and it will no longer be tested
- if our HA code is INcorrectly converted a domain to http://www.foo.com and that produces an error but http://foo.com works, then you could set urlFixed to http://foo.com and I believe the crawl code already prefers urlFixed and then falls back to urlOrig
from legacy.httparchive.org.
Not sure if this is directly related but I picked up really a couple of minor errors with the September 1st run (both mobile and desktop). A couple of sites have NULL
numDomains
which is a constraint violation when I import them into Postgres and I think this constraint is correct, hence these are errors.
DETAIL: Failing row contains (3448296, 1473341436, All, Sep 1 2016, , 0, http://www.mc361.com/, 1473336845, 12088, 11430, 36559, null, null, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, null, 2016-09-01, 4576, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2226, 0, 0, 0, 0, 0, 0, 0, 38050, null, 14523, 40100, 221861, 7247, null, null, 454, null, ).
CONTEXT: COPY urls, line 3868: "3448296 1473341436 All Sep 1 2016 454 0 http://www.mc361.com/ \N 1473336845 4576 12088 11430 36559 ..."
2016-09-12T18:05:11.142000+02:00 ERROR Database error 23502: null value in column "numDomains" violates not-null constraint
DETAIL: Failing row contains (3448299, 1473341436, All, Sep 1 2016, , 0, http://www.kurogal.com/, 1473337075, 12192, 19818, 25634, null, null, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, null, 2016-09-01, 8366, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1305, 0, 0, 0, 0, 0, 0, 0, 27839, null, 20503, 26100, 361553, 19184, null, null, 454, null, width=device-width, initial-scale=1.0, maximum-scale=1.0).
CONTEXT: COPY urls, line 3: "3448299 1473341436 All Sep 1 2016 454 0 http://www.kurogal.com/ \N 1473337075 8366 12192 19818 2563..."
2016-09-12T18:05:11.548000+02:00 ERROR Database error 23502: null value in column "numDomains" violates not-null constraint
DETAIL: Failing row contains (3448300, 1473341437, All, Sep 1 2016, , 0, http://www.digitalsummit.com/, 1473336869, 7662, 4288, 25830, null, null, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, null, 2016-09-01, 2059, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 488, 0, 0, 0, 0, 0, 0, 0, 33855, null, 8677, 11400, 1289606, 81990, null, null, 454, null, width=device-width, initial-scale=1, maximum-scale=1).
CONTEXT: COPY urls, line 1: "3448300 1473341437 All Sep 1 2016 454 0 http://www.digitalsummit.com/ \N 1473336869 2059 7662 4288 ..."
from legacy.httparchive.org.
Closing as obsolete. Feel free to reopen if there is still interest in this.
FYI the error rate is lower now that we are using the CrUX corpus.
from legacy.httparchive.org.
Related Issues (20)
- Legacy website explorer limited to July 2018 HOT 2
- Crawlid 558 missing from stats HOT 3
- Legacy Website Reports are Missing Historical Data HOT 3
- Update FAQs HOT 1
- Video summary needs updating HOT 3
- Data from 2018-12-15 contains duplicate tests for some sites HOT 4
- Data from 2019-01-01 contains unknown crawls HOT 13
- Calculation of reqTotal incorrect for many sites for 2019-03-01 data HOT 1
- Legacy website not reachable HOT 4
- stats download for November is empty HOT 5
- A11Y metrics bugs HOT 3
- wpt_bodies meta description and robots gathering is invalid as the selector used is case sensitive HOT 1
- Create documentation file listing the contents of each custom metrics file
- Add a shorter timeout for fetches in custom metrics HOT 13
- Better script element custom metrics
- Create new scripts to detect importScripts() and usage of SW methods inside pwa.js HOT 1
- New event-names and pwa metrics did not use JSON.stringify HOT 7
- Add nativeSource to a11y custom metric
- Improve avif detection
- Improve a11y metric for captioned tables HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from legacy.httparchive.org.