cdx_writer.py

Python script to create CDX index files of WARC data.

Usage

Usage: cdx_writer.py [options] warc.gz

Options:

-h, --help                  show this help message and exit
--format=FORMAT             A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path             Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX   Path prefix for warc file name in the 'g' field.
                            Useful if you are going to relocate the warc.gz file
                            after processing it.
--all-records               By default we only index http responses. Use this flag
                            to index all WARC records in the file
--screenshot-mode           Special Wayback Machine mode for handling WARCs
                            containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE     Output json file containing statistics

Output is written to stdout. The first line of output is the CDX header. This header line begins with a space so that the cdx file can be passed through sort while keeping the header at the top.

Format

The supported format options are:

M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *

* in alexa-made dat file
** in alexa-made dat file meta-data line

More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php

Installation

Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.

Differences between cdx_writer.py and archive-access cdx files

The CDX files produced by the archive-access and that produced by cdx_writer.py differ in these cases:

Differences in SURTs:

archive-access doesn't encode the %7F character in SURTs

Differences in MIME Type:

archive-access does not parse mime type for large warc payloads, and just returns 'unk'
If the HTTP Content-Type header is sent with a blank value, archive-access returns the value of the previous header as the mime type. cdx_writer.py returns 'unk' in this case. Example WARC Record (returns "close" as the mime type): ...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n

Differences in Redirect urls:

archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
archive-access does not decode html entities in redirect urls
archive-access sometimes does not turn relative URLs into absolute urls
archive-access sometimes does not remove /../ from redirect urls
archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
cdx_writer.py only looks for http-equiv=refresh meta tag inside HEAD element

Differences in Meta Tags:

cdx_writer.py only looks for meta tags in the HEAD element
archive-access version doesn't parse multiple html meta tags, only the first one
archive-access misses FI meta tags sometimes
cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order

Differences in HTTP Response Codes

archive-access returns response code 0 if HTTP header line contains unicode: HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...

rlugojr / cdx-writer Goto Github PK

cdx-writer's Introduction

cdx_writer.py

Usage

Format

Installation

Differences between cdx_writer.py and archive-access cdx files

Differences in SURTs:

Differences in MIME Type:

Differences in Redirect urls:

Differences in Meta Tags:

Differences in HTTP Response Codes

cdx-writer's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs