webrecorder / pywb Goto Github PK

View Code? Open in Web Editor NEW

1.3K 61.0 206.0 33.51 MB

Core Python Web Archiving Toolkit for replay and recording of web archives

Home Page: https://pypi.python.org/pypi/pywb

License: GNU General Public License v3.0

Python 38.34% JavaScript 57.93% CSS 0.12% HTML 1.50% Shell 0.09% Arc 0.07% Dockerfile 0.02% Vue 1.92%

python wayback pywb web-archiving web-archives

pywb's Introduction

Webrecorder pywb 2.8

Web Archiving Tools for All

View the full pywb documentation

pywb is a Python 3 web archiving toolkit for replaying web archives large and small as accurately as possible. The toolkit now also includes new features for creating high-fidelity web archives.

This toolset forms the foundation of Webrecorder project, but also provides a generic web archiving toolkit that is used by other web archives, including the traditional "Wayback Machine" functionality.

New Features

The 2.x release included a major overhaul of pywb and introduces many new features, including the following:

Dynamic multi-collection configuration system with no-restart updates.
New recording capability to create new web archives from the live web or other archives.
Componentized architecture with standalone Warcserver, Recorder and Rewriter components.
Support for Memento API aggregation and fallback chains for querying multiple remote and local archival sources.
HTTP/S Proxy Mode with customizable certificate authority for proxy mode recording and replay.
Flexible rewriting system with pluggable rewriters for different content-types.
Standalone, modular client-side rewriting system (wombat.js) to handle most modern web sites.
Improved 'calendar' query UI with incremental loading, grouping results by year and month, and updated replay banner.
Extensible UI customizations system for modifying all aspects of the UI.
Robust access control system for blocking or excluding URLs, by prefix or by exact match.
New in 2.6: Access Control embargo and http-header control access settings.
New in 2.6: Support for localization and multi-language deployment.
New in 2.7: New banner/calendar UI written in Vue, with interactive timeline and easier theming of colors and logo via config.yaml.

Please see the full documentation for more detailed info on all these features.

Installation for Deployment

To install pywb for usage, you can use:

pip install pywb

Note: depending on your Python installation, you may have to use pip3 instead of pip.

Installation from local copy

git clone https://github.com/webrecorder/pywb

To install from a locally cloned copy, install with pip install -e . or python setup.py install.

To run tests, we recommend installing pip install tox tox-current-env and then running tox --current-env to test in your current Python environment.

To Build docs locally, run: cd docs; make html. (The docs will be built in ./_build/html/index.html)

Running

After installation, you can run pywb or wayback.

Consult the local or online docs for latest usage and configuration details.

Documentation

The pywb documentation is extensive. Some links to a few key guides:

Contributions & Bug Reports

Users are encouraged to fork and contribute to this project to keep improving web archiving tools. Please consult the contributing guide for information on how to contribute to pywb.

pywb's People

Contributors

Stargazers

Watchers

Forkers

nlevitt jcushman rajbot phillipsm pombredanne tilgovi ptrourke akeprojecta machawk1 mnachmi robertknight yarwelp peval danielbicho gema-arta arquivo orbiter gwu-libraries flinkt chdorner hypothesis jeffreychung soedomoto giordanocardillo treora italoadler gvsurenderreddy sonalranjit hscale sebastian-nagel cequencer info-labs anastasia n0tan3rd bytearchive m4rk3r fernando-melo segerberg rebeccacremona leetcodes leonirlopes atomotic ukwa humberthardy tripti825 markosii ekilfeather kris-sigur eszense hubprojects anarcat arunk2 babibubebon halvir peterk nmunro anjackson commoncrawl fish2000 shawnmjones sesas harvard-lil divyank0 whitten surfndez dorsug sts0mrg0 neolithera backwardn openslx yvmarques hyl masterscott giovanisp sahwar fakegit traverseda syzyyp xlee igobranco micronn 5l1v3r1 apoudel1021 daleathan jayvdb xw0078 sohamjadiya solversa c00renut vishalbelsare malexmave bpgallagher e-harvester artur303 at911 arcalex kaij thedatashed ldko cyber-squirrel

pywb's Issues

TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

The cdx-indexer fails with a message like this:

Traceback (most recent call last):
  File "/usr/local/bin/cdx-indexer", line 9, in <module>
    load_entry_point('pywb==0.6.3', 'console_scripts', 'cdx-indexer')()
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 252, in main
    cdx09=cmd.cdx09)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 139, in write_multi_cdx_index
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 378, in create_index_iter
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 206, in create_record_iter
    for record in arcv_iter.iter_records():
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 50, in iter_records
    record = self._next_record(next_line)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 138, in _next_record
    self.known_format)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 138, in parse_record_stream
    status_headers = self.http_parser.parse(stream)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/utils/statusandheaders.py", line 172, in parse
    value += next_line
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

Input warc file here:
https://www.dropbox.com/s/jy9n5s5479850yd/0704wb-31.warc.gz?dl=0

Documentation of variables available in templates

Need a list of variables available in the jinja2 templates.

Route based on PATH_INFO?

Speaking of REQUEST_URI, would it make sense to do the routing part with PATH_INFO instead of the full request_uri? For example, my main wsgi file routes /warc urls to pywb:

from werkzeug.wsgi import DispatcherMiddleware
application = DispatcherMiddleware(
    get_wsgi_application(), # Django
    {
        '/warc': warc_application # pywb
    }
)

Then a request to /warc/foo/bar?a=b comes in with env = {'SCRIPT_NAME': '/warc', 'PATH_INFO': '/foo/bar', 'QUERY_STRING': 'a=b'}

If my pywb routes then match against PATH_INFO, I can change the location of the whole application without needing to edit the routes.

Ensure IDN urls can be replayed/proxied

Ensure that international domain name sites (IDN) can be live proxied and replayed without issues.

Make new config system also compatible with/understand bagit directory structure.

Reference: http://en.wikipedia.org/wiki/BagIt
This is related to new config system, #55

As part of config system improvements, it may be very useful to support bagit directory structures.
Figure out what is needed to support this layout, it may just be recursive directory structures, or possible recognizing additional metadata.

Add index / better error page.

Currently /index.html results in an error.

Add a placeholder index page that lists all routes
Add a better error page for empty or invalid request.

Support YAML Configs

Basic cdx/warc source config should be definable via yaml.

Support Memento Protocol

Implement Memento Support for replay, as well as timegate and timemap, as an optional config setting.

Simplified Configuration System for multiple collections

Add an optional 'convention over configuration' system for setting up collections of warcs and cdx files..

Currently, setting up multiple collections can be a bit tedious. For example, configuring two collections, each with custom set of cdx, warcs, search page and banner might look like this:

coll1:
        index_paths: ./coll1/cdx/
        archive_paths: ./coll1/warcs/
        banner_html: ./coll1/templates/banner.html
        search_html: ./coll1/templates/search.html

coll2:
        index_paths: ./coll1/cdx/
        archive_paths: ./coll1/warcs/
        banner_html: ./coll1/templates/banner.html
        search_html: ./coll1/templates/search.html

With the new convention system, the collections can instead be configured implicitly within a collections directory, eg: collections/coll1/, collections/coll2, and each having a
indexes , archive and an optional templates subdir.

Edit: changed cdx -> indexes, warcs -> archive in new config system

ampersand causes extra semicolon

There's an extra semicolon in the playback for this URL:

https://webrecorder.io/replay/20140317180312/http://www.boston.com/business/news/2013/10/08/jones-lang-lasalle-survey-boylston-seventh-most-expensive-street-for-office-rents/CeaD6LLvrNKyhPegK7RreM/story.html

The menu item "A&E" becomes "A&E;".

I suspect that some stage of the rewrite process is normalizing the HTML and treating "&E" as an HTML entity.

Let me know if the original .warc would be helpful as well. Thanks!

Implement binsearch over text file

Generic binsearch over seekable stream, which supports size(), seek(), readline()
Necessary for further cdx server functionality.

Incorrect URLS and redirect loop

looking at list of captures:
mydomain.com/pywb/*/<url>

on localhost the links are of the format:
mydomain.com/pywb/<id>/<url>

on my deployed site the links are:
mydomain.com/pywb/<current page url>/<id>/<ur>/

If I manually enter in the correct url for a capture, I get a redirect loop. not a problem on localhost.

my environment is setup the same as described here: #39
this is probably an issue with how I have things set up

nytimes.com redirect issues.

@dwhly dug this up with testing on our "via.hypothes.is" installation of PyWB.

If you make a request for an nytimes.com page...such as:
https://via.hypothes.is/h/http://www.nytimes.com/2015/01/12/business/media/pop-music-critic-leaves-the-new-yorker-to-annotate-lyrics-for-a-start-up.html?gwh=D7261152B8A43951CDB507B033AE73A8&gwt=pay&assetType=nyt_now&_r=0
(with or without query parameters)
...the nytimes.com site will send the browser on an endless redirect journey to no where. 😦

It's likely for some user "finger printing" of some kind.

Would love your thoughts on the matter @ikreymer. Thanks!

Create custom not_found.html template separate from generic error.html

Not found errors are different than other server-side errors. Provide a separate not_found.html template, overridable per collection. Based on discussion in #58

encoding issue - failing to playback warc

see stack trace below. we took a warc from our collection, indexed and visited the url in a locally running pywayback. this warc was made by wget (we can send the file via email but it is too big to upload here). other warcs we have tried that we created using webrecorder.io work perfectly. we're on v0.4.5

Pywb Error

'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 78, in __call__
    return self.handler(wbrequest) if wbrequest else None
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 41, in __call__
    cdx_callback)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 81, in __call__
    return self.render_content(wbrequest, *args)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 159, in render_content
    failed_files)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 224, in replay_capture
    response_iter)
  File "/usr/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 242, in buffered_response
    for buff in iterator:
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 224, in stream_to_gen
    buff = rewrite_func(buff)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 151, in do_rewrite
    buff = self._decode_buff(buff, stream, encoding)
  File "/usr/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 179, in _decode_buff
    buff = buff.decode(encoding)
  File "/usr/local/Cellar/python/2.7.6/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf1 in position 12562: invalid continuation byte

SCRIPT_NAME environment variable undefined?

index page loads fine.
tried to hit mydomain.com/pywb/*/example.com

Pywb Error

'SCRIPT_NAME'
Error Details:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/wsgi_wrappers.py", line 62, in __call__
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/proxy.py", line 28, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 33, in __call__
    result = route(env, self.abs_path)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 77, in __call__
    wbrequest = self.parse_request(env, use_abs_prefix)
  File "/usr/local/lib/python2.7/dist-packages/pywb/framework/archivalrouter.py", line 90, in parse_request
    rel_prefix = env['SCRIPT_NAME'] + '/' + matched_str + '/'
KeyError: 'SCRIPT_NAME'

I'm running pywb 0.4.7 (installed w/ pip) via uWSGI behind Nginx.

Nginx server block

upstream pywb {
    server 127.0.0.1:8001;
}

server {
    listen 80 default_server;
    listen [::]:80 default_server ipv6only=on;

    location / {
        uwsgi_pass pywb;
        include /etc/nginx/uwsgi_params;
       }
}

contents of uwsgi_params:
https://github.com/phusion/nginx/blob/master/conf/uwsgi_params

command to run uWSGI:
$ /usr/local/bin/uwsgi --ini /etc/pywb/wsgi.ini

contents of /etc/pywb/wsgi.ini

[uwsgi]
socket = :8001
master = true
processes = 10
buffer-size = 65536
die-on-term = true

# specify config file here
env = PYWB_CONFIG_FILE=/etc/pywb/config.yaml
chdir = /usr/local/lib/python2.7/dist-packages/pywb/
wsgi = pywb.apps.wayback

contents of /etc/pywb/config.yaml

# pywb config file
# ========================================
#
# Settings for each collection

collections:
    # <name>: <cdx_path>
    # collection will be accessed via /<name>
    # <cdx_path> is a string or list of:
    #  - string or list of one or more local .cdx file
    #  - string or list of one or more local dirs with .cdx files
    #  - a string value indicating remote http cdx server
    pywb: /my_archive/cdx/

    # ex with filtering: filter CDX lines by filename starting with 'dupe'
    #pywb-filt: {'index_paths': './sample_archive/cdx/', 'filters': ['filename:dupe*']}

# indicate if cdx files are sorted by SURT keys -- eg: com,example)/
# SURT keys are recommended for future indices, but non-SURT cdxs
# are also supported
#
#   * Set to true if cdxs start with surts: com,example)/
#   * Set to false if cdx start with urls: example.com)/
#
# default:
# surt_ordered: true

# list of paths prefixes for pywb look to 'resolve'  WARC and ARC filenames
# in the cdx to their absolute path
#
# if path is:
#   * local dir, use path as prefix
#   * local file, lookup prefix in tab-delimited sorted index
#   * http:// path, use path as remote prefix
#   * redis:// path, use redis to lookup full path for w:<warc> as key

archive_paths: /my_archive/warcs/

# The following are default settings -- uncomment to change
# Set to '' to disable the ui

# ==== UI: HTML/Jinja2 Templates ====

# template for <head> insert into replayed html content
#head_insert_html: ui/head_insert.html

# template to for 'calendar' query,
# eg, a listing of captures  in response to a ../*/<url>
#
# may be a simple listing or a more complex 'calendar' UI
# if omitted, will list raw cdx in plain text
#query_html: ui/query.html

# template for search page, which is displayed when no search url is entered
# in a collection
#search_html: ui/search.html

# template for home page.
# if no other route is set, this will be rendered at /, /index.htm and /index.html
#home_html: ui/index.html


# error page temlpate for may formatting error message and details
# if omitted, a text response is returned
#error_html: ui/error.html

# ==== Other Paths ====

# list of host names that pywb will be running from to detect
# 'fallthrough' requests based on referrer
#
# eg: an incorrect request for http://localhost:8080/image.gif with a referrer
# of http://localhost:8080/pywb/index.html, pywb can correctly redirect
# to http://localhost:8080/pywb/image.gif
#

#hostpaths: ['http://localhost:8080']

# Rewrite urls with absolute paths instead of relative
#absoulte_paths: true

# List of route names:
# <route>: <package or file path>
# default route static/default for pywb defaults
static_routes:
          static/default: pywb/static/

# ==== New / Experimental Settings ====
# Not yet production ready -- used primarily for testing

# Enable simple http proxy mode
enable_http_proxy: true

# enable cdx server api for querying cdx directly (experimental)
enable_cdx_api: true

# custom rules for domain specific matching
# set to false to disable
#domain_specific_rules: rules.yaml

# Memento support, enable
enable_memento: true

# Replay content in an iframe
framed_replay: true

Better handling of content when Content-Type is wrong or absent.

Currently, pywb relies on Content-Type to determine the type of rewriting, if any needs to be performed. However, content-type may be wrong or missing.

Some possible ideas:

For text types:

If chardet returns 0.0 confidence and no mime type, assume binary and skip rewriting.
maybe: use https://pypi.python.org/pypi/binaryornot/0.2.0 to verify
either keep wrong content-type and hope client ignores it, or use https://github.com/ahupp/python-magic to attempt to determine correct type.

For binary or non-rewritable content-type:

no checking, or use https://pypi.python.org/pypi/binaryornot/0.2.0 to ensure binary/text

For no content-type:

either serve without content-type
use https://github.com/ahupp/python-magic to determine correct type.

lxml rewriting error

Something weird happens when using lxml in the first doc in this WARC: https://www.dropbox.com/s/tmb7cusy7vg3u3o/audobon.warc

The URL is http://web4.audubon.org/bird/stateofthebirds/cbid/

When it gets played back and lxml is on, the HTML ends partway through, like this:

<!-- Begin Breadcrumb Nav -->
<tr>
    <td class="breadcrumbnav">
<a href="/warc/Y5UN-LGT2/http://www.audubon.org/bird/stateofthebirds/">State of the Birds</a> &gt;
<a href="/warc/Y5UN-LGT2/http://web4.audubon.org/bird/stateofthebirds/cbid/index.php">Common Birds in Decline</a>
</td></tr></table></td></tr></table></td></tr></table></td></tr></table></body></html>

Seems to work fine without lxml. Any thoughts?

Too many open files ?

I get this strange error:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/wsgi_wrappers.py", line 98, in handle_methods
    response = wb_router(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/proxy.py", line 37, in __call__
    response = super(ProxyArchivalRouter, self).__call__(env)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/framework/archivalrouter.py", line 36, in __call__
    return route.handler(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/handlers.py", line 75, in __call__
    return self.handle_request(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/handlers.py", line 133, in handle_request
    cdx_lines, output = self.index_reader.load_for_request(wbrequest)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/query_handler.py", line 71, in load_for_request
    cdx_iter = self.load_cdx(wbrequest, params)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/webapp/query_handler.py", line 92, in load_cdx
    cdx_iter = self.cdx_server.load_cdx(**params)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 77, in load_cdx
    return self._check_cdx_iter(cdx_iter, query)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 45, in _check_cdx_iter
    cdx_iter = self.peek_iter(cdx_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxserver.py", line 85, in peek_iter
    first = next(iterable)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 110, in <genexpr>
    return (cdx for cdx, _ in itertools.izip(cdx_iter, xrange(limit)))
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 201, in cdx_filter
    for cdx in cdx_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 100, in <genexpr>
    return (cls(line) for line in text_iter)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 83, in create_merged_cdx_gen
    source_iters = map(lambda src: src.load_cdx(query), sources)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxops.py", line 83, in <lambda>
    source_iters = map(lambda src: src.load_cdx(query), sources)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/cdx/cdxsource.py", line 31, in load_cdx
    source = open(self.filename)
IOError: [Errno 24] Too many open files: '/opt/clueweb12B13/cdx/1508wb-50.cdx'

130.125.11.134 - - [04/Nov/2014 16:34:01] "GET /clueweb/*/www.repubblica.it HTTP/1.1" 400 2830

Add support for lxml parser

Optionally support lxml target parser api, if lxml is available.
http://lxml.de/parsing.html#the-target-parser-interface

Provide option to toggle lxml via 'use_lxml_parser'

Support PathIndexPrefixResolver

Resolve warc paths from a simple, tab-delimitted text file, eg:

file.warc.<tab>file://full/path/to/file.warc
...

Support Http Proxy Mode

Should support replay in http proxy mode, with rewriting.
Need to:

Rewrite https -> http urls
Filter encoding related headers

Support Url Wildcard Query

eg. /pywb/example.com* to return results from multiple urls under example.com, in some fashion.

Some very dynamic pages do not work with the proxy

The following page looks like a normal static page in the browser: http://readwrite.com/.... However it is using angular. For example using pywb-h/h/http://readwrite.com/... shows an empty page.

I assume this is a proxy problem. Or should this be reported on pywb-hypothesis?

Slow response proxying archive.org

The really fun news is that this works:
https://via.hypothes.is/h/https://web.archive.org/web/20141111131547/http://www.theatlantic.com/magazine/archive/1945/07/as-we-may-think/303881/

But I waited probably three minutes for it to succeed. I actually thought it was hung. Notice that annotations that were made at the original page on theatlantic.com over a year ago work! (Because theatlantic uses rel=canonical, and we are paying attention).

Any idea why it might take so long? Did it for you?

Better reporting of non-chunked gzip warcs/arcs

cdx-indexer will return an obscure message when trying to parse a non-chunked gzip. Detect this better and return a useful error message, something like:
This WARC/ARC is not properly compressed, to use it, please decompress the file first
Raised by #44

Missing file in 0.3.0 release

Looks like the 0.3.0 .tar.gz here ( https://pypi.python.org/pypi/pywb/0.3.0 ) is missing the pywb/rules.yaml file, which is referred to in pywb/utils/dsrules.py .

Implement pywb cdx server!

Basic cdx server implementation which will support cdx loading for pywb, and for use as a standalone module

Redirect from bare domain to www subdomain throws Self Redirect error.

Suppose you archive http://metafilter.com/ using phantomjs and warcprox. That URL redirects to http://www.metafilter.com/. So you end up with an entry in your .warc for metafilter.com, which is a 301 redirect, and www.metafilter.com, which is the actual contents of the page.

Now suppose you try to replay http://metafilter.com/. replay.py will throw a self-redirect error at this point:

        # Check for self redirect
        if wbresponse.status_headers.statusline.startswith('3'):
            if self.isSelfRedirect(wbrequest, wbresponse.status_headers):
                raise wbexceptions.CaptureException('Self Redirect: ' + str(cdx))

The reason is that surt normalizes http://metafilter.com/ and http://www.metafilter.com/ to be the same thing, so they both come back as the 301 redirect:

>>> surt.surt('http://metafilter.com')
'com,metafilter)/'
>>> surt.surt('http://www.metafilter.com')
'com,metafilter)/'

I have no idea if this is a surt bug, a warcprox bug, or even a pywb bug, but figured you might know ...

Support Vine capture and replay

Several use cases have cropped up for archving Vines, both embedded and from direct.
Vine video is HTML5 based but looks like a few custom rules may still be needed to get the pages to work nicely.

ZipNum: add support for compressed cdx

Support for compressed chunked cdx lines with a secondary index

Improved top rewriting

'window.top' needs to be rewritten in framed mode, however it can be tricky to get it right, since variable can be just 'top' and can also be used for other things besides window, eg a local var top.

This is an attempt to improve detecting when top needs to be rewritten.

Problematic page that shows an advertisement and then loads the content

http://www.businessinsider.com/... shows an advertisement for some time and then redirects to the article. If you visit the page again, it remembers (by using a cookie) that you have seen the ad and goes directly to the page.

This does not work with the proxy http://pywb-h.herokuapp.com/h/http://www.businessinsider.com/....

To see the effect, make sure you have removed cookies for pywb-h.herokuapp.com. If you reload the page after it is stuck in the ad page, it will go directly to the article.

Avoid masking re-raised errors.

In this example in wbapp.py, and similar spots, it would make for easier debugging to use plain raise instead of raise e:

    except Exception as e:
        logging.exception('*** pywb could not init with settings from {0}.pywb_config()!\n'.format(config_name))
        raise e

That way python will print out a stack trace that goes back to the original source of the error, rather than the point where it was re-raised. Thanks!

Plaintext content on the live web is not recorded/replayed correctly

This was pointed out to me by @phonedude and applicable to webrecorder (which uses pywb). URIs where the only resource is a text document are "recorded" but when the download button is pressed and Web ARChive (WARC) is selected, the user is returned an error:

WebRecorder.io error
Temporary Warc Error
Please try downloading again.

Example URI:
http://www.cs.odu.edu/~mkelly/semester/2014_summer/bioCopy.txt

Templated Head Insert

Support head insert which can be either fixed string or generated by (Jinja2 or other) template, with info about current capture and request available to template.

Multithread architecture ?

Hi,
i've been profiling a bit the wayback web server, and it seems to be inherently single-threaded. Is this the case ? I've tried with an increasing number of clients (cURL clients) that concurrently issue requests to the same page, and the %time_total reported by cURL starts increasing steadily with the number of clients.
I run the pywb on a 8-core Xeon processor, and during the benchmark execution, only 1 core is pushed to 100% of its capabilities.
Is there some quick hack to improve the performances?
Do you plan to support multi-core architectures ?

Here's the scripts I use to benchmark:

:~/pywb$ cat single_client_pywb.sh 
#!/bin/bash
URL=$1
for i in $(seq 1 1 1000); do 
        curl -w 'Total time: \t%{time_total}\n' $URL -o /dev/null -s; 
done
:~/pywb$ cat bench.sh 
#!/bin/bash
N_CLIENTS=$1
for i in $(seq 1 1 $N_CLIENTS); do 
        ./single_client_pywb.sh http://localhost:8080/clueweb/20120418042230/http://wikitravel.org/en/282001 &
done
wait
```bash

I can provide the warc.gz if required, but any page would work.

Would like custom port

Pywb always runs on port 8080, I'd like to be able to change that via config.yaml and/or a command line parameter.

Client-Side Testing Suite

The server side testing suite is pretty extensive (eg 99%) coverage. However, this does not yet include client side rewriting, and testing of actual page content for replay accuracy.

Need at least some way to test the client side rewriting, and maybe more extensive testing after that.

cdx-indexer fails to index warc.gz file

I got this:

~/pywb$ cdx-indexer --sort clueweb12b13/cdx/ /tmp/clue/
Traceback (most recent call last):
  File "/usr/local/bin/cdx-indexer", line 9, in <module>
    load_entry_point('pywb==0.6.3', 'console_scripts', 'cdx-indexer')()
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 252, in main
    cdx09=cmd.cdx09)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 120, in write_multi_cdx_index
    write_cdx_index(outfile, infile, filename, **options)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/cdxindexer.py", line 160, in write_cdx_index
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 378, in create_index_iter
    for entry in entry_iter:
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 206, in create_record_iter
    for record in arcv_iter.iter_records():
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 50, in iter_records
    record = self._next_record(next_line)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/archiveiterator.py", line 138, in _next_record
    self.known_format)
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 97, in parse_record_stream
    known_format))
  File "/usr/local/lib/python2.7/dist-packages/pywb-0.6.3-py2.7.egg/pywb/warc/recordloader.py", line 171, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
pywb.warc.recordloader.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response

The /tmp/clue/ directory contains only 1 file, 1000tw-00.warc.gz.
If the file is unzipped, the cdx-indexer works fine.

Custom behavior and error page per collection on missing captures

Currently, the error page is a global setting. It would be helpful to have an error page per collection.

In proxy mode, there should be the option to just pipe requests through to the web.

Support remaining query api from wayback-cdx-server

Support roughly same features as documented under:
https://github.com/iipc/openwayback/tree/master/wayback-cdx-server

Ensure pywb runs on Windows

Mostly path-url conversion related issues, as well as possible line ending differences?

CDXSource can't return unicode strings

I'm not sure if this is a bug or expected behavior, but if a CDXSource object returns unicode strings from load_cdx, it seems to cause errors in the rewrite phase for some files:

Traceback (most recent call last):
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/framework/wsgi_wrappers.py", line 98, in handle_methods
    response = wb_router(env)
  File "/vagrant/perma_web/warc_server/pywb_config.py", line 71, in __call__
    return super(Router, self).__call__(env)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/framework/archivalrouter.py", line 36, in __call__
    return route.handler(wbrequest)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 75, in __call__
    return self.handle_request(wbrequest)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 138, in handle_request
    return self.handle_replay(wbrequest, cdx_lines)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/handlers.py", line 152, in handle_replay
    cdx_callback)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 83, in render_content
    failed_files)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 148, in replay_capture
    response_iter)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/webapp/replay_views.py", line 166, in buffered_response
    for buff in iterator:
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/rewrite/rewrite_content.py", line 250, in rewrite_text_stream_to_gen
    buff = rewrite_func(buff)
  File "/home/vagrant/.virtualenvs/perma/local/lib/python2.7/site-packages/pywb/rewrite/regex_rewriters.py", line 54, in rewrite
    return self.regex.sub(lambda x: self.replace(x), string)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

wombat hooks for missing video

Sometimes embedded videos supported by the yt-dl library are not available for various reasons at the time of archiving: video deleted, copyright violation, user closed their account, and so forth.

If a native video player is not supported at archiving time and a reason is known why the archiving failed, provide javascript hooks in wombat to handle those.

If the native video player can be archived and already shows the reason, this is not required.

Add Basic Support for an Exclusion/Perms System

Support filtering cdx lines based on a permissions checking module.

Ability to modify html rewriting rules as needed.

Currently, the html rewriting tags is somewhat hardcoded in the htmlparser.

Would be nice to have this more configurable, especially for certain specific use cases, like tags.

A particular use case is not rewriting to work with pywb-hypothesis integration.

Cleanup rewriting response

Currently, html rewriting is fully buffered, and css/js/xml rewriting is buffered.
Cleanup the interface to allow all rewriting to be either buffered fully (and served with Content-Length) or streamed (w/o Content-Length).

WARCs of text files on web are indexed but not replayable.

Uploaded WARCs containing captured of pages with only text are indexed with the URIs properly extracted but I receive a sad face icon (see below screenshot) when webrecorder tries to replay the content. This occurs both when the WARC is uploaded or placed somewhere on the web and webrecorder is given the URI.

Example WARC:
http://matkelly.com/temp/20140811154106173.warc

Validity verified:
./jwattools.sh test -e ./20140811154106173.warc
sh cdx-indexer ./20140811154106173.warc

Screenshot:

/cc @phonedude

Better general rewriting of css generated in JS

A few sites generate css in JS, currently have been adding explicit rewrite rules for these.. (eg. wikimedia blackout, instagram).. However, it may make sense to add a general JS rule if possible..

Luckily, the urls are generally of the form url(\/\example.com) or url(//example.com) so may be possible to detect the css url() wrapper in JS, but need to be careful.. May still want to apply on a per-site basis.

Note: this is needed as intercepting a style as its being set has proven to be quite difficult, so best to rewrite the style string itself.