GithubHelp home page GithubHelp logo

syzyyp / httrack2warc Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nla/httrack2warc

0.0 0.0 0.0 151 KB

Converts HTTrack crawls to WARC files

License: Apache License 2.0

Java 99.64% Shell 0.36%

httrack2warc's Introduction

httrack2warc

Converts HTTrack crawls to WARC files.

Status: Working on many crawls but needs more testing on corner cases. We're not using it in production yet.

This tool works by reading the HTTrack cache directory (hts-cache) and any available log files to reconstruct an approximation of the original requests and responses. This process is not perfect as not all the necessary information is always available. Some of the information that is available is only present in debug log messages that were never intended for machine consumption. Please see the list of known issues and limitations below.

Usage

Download the latest release jar and run it under Java 8 or later.

Usage:
  java -jar httrack2warc-0.2.1-shaded.jar [OPTIONS...] -o outdir crawldir

Options:
  -h, --help                   Show this screen.
  -o, --outdir DIR             Directory to write output (default: current working directory).
  -s, --size BYTES             WARC size target (default: 1GB).
  -n, --name PATTERN           WARC name pattern (default: crawl-%d.warc.gz).
  -Z, --timezone ZONEID        Timezone of HTTrack logs (default: system local time).
  -I, --warcinfo 'KEY: VALUE'  Add extra lines to warcinfo record.
  -C, --compression none|gzip  Type of compression to use (default: gzip).
  --cdx FILENAME               Write a CDX index file for the generated WARCs.
  --strict                     Abort on issues normally considered a warning.

Example

Conduct a crawl into a temporary directory (/tmp/crawl) using HTTrack:

$ httrack -O /tmp/crawl http://www.example.org/
Mirror launched on Mon, 08 Jan 2018 13:50:40 by HTTrack Website Copier/3.49-2 [XR&CO'2014]
mirroring http://www.example.org/ with the wizard help..
Done.www.example.org/ (1270 bytes) - OK
Thanks for using HTTrack!

Run httrack2warc over the output to produce a WARC file. By default the output file will be named crawl-0.warc.gz.

$ java -jar httrack2warc-shaded-0.2.0.jar /tmp/crawl
Httrack2Warc - www.example.org/index.html -> http://www.example.org/

Replay the ingested WARC files using a replay tool like pywb:

$ pip install --user pywb
$ PATH="$PATH:$HOME/.local/bin"
$ wb-manager init test
$ wb-manager add test crawl-*.warc.gz
[INFO]: Copied crawl-0.warc.gz to collections/test/archive
$ wayback
[INFO]: Starting pywb Wayback Web Archive Replay on port 8080
# Open in browser: http://localhost:8080/test/*/example.org/

Known issues and limitations

HTTP headers

By default HTTrack does not record HTTP headers. If the --debug-headers option is specified however the file hts-ioinfo.txt will be produced containing a log of the request and response headers.

When headers are available httrack2warc produces WARC records of type request and response. When headers are unavailable only WARC resource records are produced.

The Transfer-Encoding header is always stripped as the encoded bytes of the message are not recorded by HTTrack.

Redirects and error codes

Currently without hts-ioinfo.txt and an entry in the cache zip (newer versions of HTTrack), non-200 status code responses are converted to resource records and the status code is lost. See issue #3.

IP addresses and DNS records

HTTrack does not record DNS records or the IP addresses of hostnames therefore httrack2warc cannot produce WARC-IP-Address or DNS records.

HTTrack version compatiblity

Some testing has been done against crawls generated by the following versions: 3.01, 3.21-4, 3.49-2. Not all combinations of options have been tested.

Link rewriting

For cases when the original HTML is unavailable there is an experimental --rewrite-links option which will modify the HTML changing links from filenames to absolute URLs. This feature somewhat primitive and does not currently attempt to rewrite URLs inside CSS or JavaScript.

Compilation

Install Java JDK 8 (or later) and Maven. On Fedora Linux:

dnf install java-1.8.0-openjdk-devel maven

Then compile using Maven from the top-level of this repository:

 cd httrack2warc
 mvn package

This will produce an executable jar file which you can run like so:

java -jar target/httrack2warc-*-shaded.jar --help

License

Copyright (C) 2017 National Library of Australia

Licensed under the Apache License, Version 2.0.

Similar Projects

httrack2warc's People

Contributors

ato avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.