GithubHelp home page GithubHelp logo

rlugojr / cdx-writer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from internetarchive/cdx-writer

0.0 2.0 0.0 4.7 MB

Python script to create CDX index files of WARC data

Home Page: http://archive.org

License: GNU Affero General Public License v3.0

Python 34.69% Arc 65.31%

cdx-writer's Introduction

cdx_writer.py

Python script to create CDX index files of WARC data.

Build Status

Usage

Usage: cdx_writer.py [options] warc.gz

Options:

-h, --help                  show this help message and exit
--format=FORMAT             A space-separated list of fields [default: 'N b a m s k r M S V g']
--use-full-path             Use the full path of the warc file in the 'g' field
--file-prefix=FILE_PREFIX   Path prefix for warc file name in the 'g' field.
                            Useful if you are going to relocate the warc.gz file
                            after processing it.
--all-records               By default we only index http responses. Use this flag
                            to index all WARC records in the file
--screenshot-mode           Special Wayback Machine mode for handling WARCs
                            containing screenshots
--exclude-list=EXCLUDE_LIST File containing url prefixes to exclude
--stats-file=STATS_FILE     Output json file containing statistics

Output is written to stdout. The first line of output is the CDX header. This header line begins with a space so that the cdx file can be passed through sort while keeping the header at the top.

Format

The supported format options are:

M meta tags (AIF) *
N massaged url
S compressed record size
V compressed arc file offset *
a original url **
b date **
g file name
k new style checksum *
m mime type of original document *
r redirect *
s response code *

* in alexa-made dat file
** in alexa-made dat file meta-data line

More information about the CDX format syntax can be found here: http://www.archive.org/web/researcher/cdx_legend.php

Installation

Unfortunately, this script is not propery packaged and cannot be installed via pip. See the .travis.yml file for hints on how to get it running.

Differences between cdx_writer.py and archive-access cdx files

The CDX files produced by the archive-access and that produced by cdx_writer.py differ in these cases:

Differences in SURTs:

  • archive-access doesn't encode the %7F character in SURTs

Differences in MIME Type:

  • archive-access does not parse mime type for large warc payloads, and just returns 'unk'
  • If the HTTP Content-Type header is sent with a blank value, archive-access returns the value of the previous header as the mime type. cdx_writer.py returns 'unk' in this case. Example WARC Record (returns "close" as the mime type): ...Content-Length: 0\r\nConnection: close\r\nContent-Type: \r\n\r\n\r\n\r\n

Differences in Redirect urls:

  • archive-access does not escape whitespace, cdx_writer.py uses %20 escaping so we can split these files on whitespace.
  • archive-access removes unicode characters from redirect urls, cdx_writer.py version keeps them
  • archive-access does not decode html entities in redirect urls
  • archive-access sometimes does not turn relative URLs into absolute urls
  • archive-access sometimes does not remove /../ from redirect urls
  • archive-access uses the value from the previous HTTP header for the redirect url if the location header is empty
  • cdx_writer.py only looks for http-equiv=refresh meta tag inside HEAD element

Differences in Meta Tags:

  • cdx_writer.py only looks for meta tags in the HEAD element
  • archive-access version doesn't parse multiple html meta tags, only the first one
  • archive-access misses FI meta tags sometimes
  • cdx_writer.py always returns tags in A, F, I order. archive-access does not use a consistent order

Differences in HTTP Response Codes

  • archive-access returns response code 0 if HTTP header line contains unicode: HTTP/1.1 302 D\xe9plac\xe9 Temporairement\r\n...

cdx-writer's People

Contributors

galgeek avatar jcushman avatar kngenie avatar nlevitt avatar rajbot avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.