cdx_writer.py
Python script to create CDX index files of WARC data.
Usage
Usage: cdx_writer.py [options] warc.gz
Options:
-h, --help show this help message and exit
--format=FORMAT A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX Path prefix for warc file name in the 'g' field.
Useful if you are going to relocate the warc.gz file
after processing it.
--all-records By default we only index http responses. Use this flag
to index all WARC records in the file
--screenshot-mode Special Wayback Machine mode for handling WARCs
containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE Output json file containing statistics
Output is written to stdout. The first line of output is the CDX header.
This header line begins with a space so that the cdx file can be passed
through sort
while keeping the header at the top.
Format
The supported format options are:
M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *
* in alexa-made dat file
** in alexa-made dat file meta-data line
More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php
Installation
Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.
Differences between cdx_writer.py and archive-access cdx files
The CDX files produced by the archive-access and that produced by cdx_writer.py differ in these cases:
Differences in SURTs:
- archive-access doesn't encode the %7F character in SURTs
Differences in MIME Type:
- archive-access does not parse mime type for large warc payloads, and just returns 'unk'
- If the HTTP Content-Type header is sent with a blank value, archive-access
returns the value of the previous header as the mime type. cdx_writer.py
returns 'unk' in this case. Example WARC Record (returns "close" as the mime type):
...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n
Differences in Redirect urls:
- archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
- archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
- archive-access does not decode html entities in redirect urls
- archive-access sometimes does not turn relative URLs into absolute urls
- archive-access sometimes does not remove /../ from redirect urls
- archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
- cdx_writer.py only looks for http-equiv=refresh meta tag inside
HEAD
element
Differences in Meta Tags:
- cdx_writer.py only looks for meta tags in the
HEAD
element - archive-access version doesn't parse multiple html meta tags, only the first one
- archive-access misses FI meta tags sometimes
- cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order
Differences in HTTP Response Codes
- archive-access returns response code 0 if HTTP header line contains unicode:
HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...