GithubHelp home page GithubHelp logo

oduwsdl / archivenow Goto Github PK

View Code? Open in Web Editor NEW
392.0 21.0 41.0 20.94 MB

A Tool To Push Web Resources Into Web Archives

License: MIT License

Python 58.99% HTML 40.31% Dockerfile 0.71%
web-archiving internet-archive

archivenow's Introduction

Archive Now (archivenow)

A Tool To Push Web Resources Into Web Archives

Archive Now (archivenow) currently is configured to push resources into four public web archives. You can easily add more archives by writing a new archive handler (e.g., myarchive_handler.py) and place it inside the folder "handlers".

Update January 2021 ~~~~~~~~~ Originally, archivenow was configured to push to 6 different public web archives. The two removed web archives are WebCite and archive.st. WebCite was removed from archivenow as they are no longer accepting archiving requests. Archive.st was removed from archivenow due to encountering a Captcha when attempting to push to the archive. In addition to removing those 2 archives, the method for pushing to archive.today and megalodon.jp from archivenow has been updated. In order to push to archive.today and megalodon.jp, Selenium is used.

As explained below, this library can be used through:

  • Command Line Interface (CLI)
  • A Web Service
  • A Docker Container
  • Python

Installing

The latest release of archivenow can be installed using pip:

$ pip install archivenow

The latest development version containing changes not yet released can be installed from source:

$ git clone [email protected]:oduwsdl/archivenow.git
$ cd archivenow
$ pip install -r requirements.txt
$ pip install ./

In order to push to archive.today and megalodon.jp, archivenow must use Selenium, which has already been added to the requirements.txt. However, Selenium additionally needs a driver to interface with the chosen browser. It is recommended to use Selenium and archivenow with Firefox and Firefox's corresponding GeckoDriver.

You can download the latest versions of Firefox and the GeckoDriver to use with archivenow.

After installing the driver, you can push to archive.today and megalodon.jp from archivenow.

CLI USAGE

Usage of sub-commands in archivenow can be accessed through providing the -h or --help flag, like any of the below.

$ archivenow -h
usage: archivenow.py [-h] [--mg] [--cc] [--cc_api_key [CC_API_KEY]]
                     [--is] [--ia] [--warc [WARC]] [-v] [--all]
                     [--server] [--host [HOST]] [--agent [AGENT]]
                     [--port [PORT]]
                     [URI]

positional arguments:
  URI                   URI of a web resource

optional arguments:
  -h, --help            show this help message and exit
  --mg                  Use Megalodon.jp
  --cc                  Use The Perma.cc Archive
  --cc_api_key [CC_API_KEY]
                        An API KEY is required by The Perma.cc Archive
  --is                  Use The Archive.is
  --ia                  Use The Internet Archive
  --warc [WARC]         Generate WARC file
  -v, --version         Report the version of archivenow
  --all                 Use all possible archives
  --server              Run archiveNow as a Web Service
  --host [HOST]         A server address
  --agent [AGENT]       Use "wget" or "squidwarc" for WARC generation
  --port [PORT]         A port number to run a Web Service

Examples

Example 1

To save the web page (www.foxnews.com) in the Internet Archive:

$ archivenow --ia www.foxnews.com
https://web.archive.org/web/20170209135625/http://www.foxnews.com

Example 2

By default, the web page (e.g., www.foxnews.com) will be saved in the Internet Archive if no optional arguments are provided:

$ archivenow www.foxnews.com
https://web.archive.org/web/20170215164835/http://www.foxnews.com

Example 3

To save the web page (www.foxnews.com) in the Internet Archive (archive.org) and Archive.is:

$ archivenow --ia --is www.foxnews.com
https://web.archive.org/web/20170209140345/http://www.foxnews.com
http://archive.is/fPVyc

Example 4

To save the web page (https://nypost.com/) in all configured web archives. In addition to preserving the page in all configured archives, this command will also locally create a WARC file:

$ archivenow --all https://nypost.com/ --cc_api_key $Your-Perma-CC-API-Key
http://archive.is/dcnan
https://perma.cc/53CC-5ST8
https://web.archive.org/web/20181002081445/https://nypost.com/
https://megalodon.jp/2018-1002-1714-24/https://nypost.com:443/
https_nypost.com__96ec2300.warc

Example 5

To download the web page (https://nypost.com/) and create a WARC file:

$ archivenow --warc=mypage --agent=wget https://nypost.com/
mypage.warc

Server

You can run archivenow as a web service. You can specify the server address and/or the port number (e.g., --host localhost --port 12345)

$ archivenow --server

Running on http://0.0.0.0:12345/ (Press CTRL+C to quit)

Example 6

To save the web page (www.foxnews.com) in The Internet Archive through the web service:

$ curl -i http://0.0.0.0:12345/ia/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 95
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:20:18 GMT

    {
      "results": [
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com"
      ]
    }

Example 7

To save the web page (www.foxnews.com) in all configured archives though the web service:

$ curl -i http://0.0.0.0:12345/all/www.foxnews.com

    HTTP/1.0 200 OK
    Content-Type: application/json
    Content-Length: 385
    Server: Werkzeug/0.11.15 Python/2.7.10
    Date: Tue, 02 Oct 2018 08:23:53 GMT

    {
      "results": [
        "Error (The Perma.cc Archive): An API Key is required ", 
        "http://archive.is/ukads", 
        "https://web.archive.org/web/20181002082007/http://www.foxnews.com", 
        "Error (Megalodon.jp): We can not obtain this page because the time limit has been reached or for technical ... ", 
        "http://www.webcitation.org/72rbKsX8B"
      ]
    }

Example 8

Because an API Key is required by Perma.cc, the HTTP request should be as follows:

$ curl -i http://127.0.0.1:12345/all/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Or use only Perma.cc:

$ curl -i http://127.0.0.1:12345/cc/https://nypost.com/?cc_api_key=$Your-Perma-CC-API-Key

Running as a Docker Container

$ docker image pull oduwsdl/archivenow

Different ways to run archivenow

$ docker container run -it --rm oduwsdl/archivenow -h

Accessible at 127.0.0.1:12345:

$ docker container run -p 12345:12345 -it --rm oduwsdl/archivenow --server --host 0.0.0.0

Accessible at 127.0.0.1:22222:

$ docker container run -p 22222:11111 -it --rm oduwsdl/archivenow --server --port 11111 --host 0.0.0.0

image

To save the web page (http://www.cnn.com) in The Internet Archive

$ docker container run -it --rm oduwsdl/archivenow --ia http://www.cnn.com

Python Usage

>>> from archivenow import archivenow

Example 9

To save the web page (www.foxnews.com) in all configured archives:

>>> archivenow.push("www.foxnews.com","all")
['https://web.archive.org/web/20170209145930/http://www.foxnews.com','http://archive.is/oAjuM','http://www.webcitation.org/6o9LcQoVV','Error (The Perma.cc Archive): An API KEY is required]

Example 10

To save the web page (www.foxnews.com) in The Perma.cc:

>>> archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})
['https://perma.cc/8YYC-C7RM']

Example 11

To start the server from Python do the following. The server/port number can be passed (e.g., start(port=1111, host='localhost')):

>>> archivenow.start()

    2017-02-09 15:02:37
    Running on http://127.0.0.1:12345
    (Press CTRL+C to quit)

Configuring a new archive or removing existing one

Additional archives may be added by creating a handler file in the "handlers" directory.

For example, if I want to add a new archive named "My Archive", I would create a file "ma_handler.py" and store it in the folder "handlers". The "ma" will be the archive identifier, so to push a web page (e.g., www.cnn.com) to this archive through the Python code, I should write:

archivenow.push("www.cnn.com","ma")

In the file "ma_handler.py", the name of the class must be "MA_handler". This class must have at least one function called "push" which has one argument. See the existing handler files for examples on how to organized a newly configured archive handler.

Removing an archive can be done by one of the following options:

  • Removing the archive handler file from the folder "handlers"
  • Renaming the archive handler file to other name that does not end with "_handler.py"
  • Setting the variable "enabled" to "False" inside the handler file

Notes

The Internet Archive (IA) sets a time gap of at least two minutes between creating different copies of the "same" resource.

For example, if you send a request to IA to capture (www.cnn.com) at 10:00pm, IA will create a new copy (C) of this URI. IA will then return C for all requests to the archive for this URI received until 10:02pm. Using this same submission procedure for Archive.is requires a time gap of five minutes.

Citing Project

@INPROCEEDINGS{archivenow-jcdl2018,
  AUTHOR    = {Mohamed Aturban and
               Mat Kelly and
               Sawood Alam and
               John A. Berlin and
               Michael L. Nelson and
               Michele C. Weigle},
  TITLE     = {{ArchiveNow}: Simplified, Extensible, Multi-Archive Preservation},
  BOOKTITLE = {Proceedings of the 18th {ACM/IEEE-CS} Joint Conference on Digital Libraries},
  SERIES    = {{JCDL} '18},
  PAGES     = {321--322},
  MONTH     = {June},
  YEAR      = {2018},
  ADDRESS   = {Fort Worth, Texas, USA},
  URL       = {https://doi.org/10.1145/3197026.3203880},
  DOI       = {10.1145/3197026.3203880}
}

archivenow's People

Contributors

a-mabe avatar evil-wayback avatar ibnesayeed avatar lebnan avatar machawk1 avatar myano avatar ruebot avatar shawnmjones avatar veloute avatar waybackarchiver avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

archivenow's Issues

Will submit pr: submit ghostarchive.org

Cool site that uses webrecorder render and also archives videos.

The endpoint for submitting an archive is "/archive", and it is a POST request. Once request is submitted, it will redirect (302) you to the URL where the archive would be stored.

Will submit pr myself, but any objections before i do?

documentation incorrect on how to pass parameters to a handler

Current readme suggests this code for use in Python:
archivenow.push("www.foxnews.com","cc","cc_api_key=$YOUR-Perma-cc-API-KEY")
But actually, when push() is called directly, it seems to expect additional parameters in object form, e.g.:
archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})

Better defaults in the UI

For better user experience you might want to:

  • make the first three checkboxes checked by default
  • add a link to the page where Perma.cc API key can be generated, and
  • make the API key persist in user's browser's localstorage (if entered)

Web Service?

This is an amazing tool, thank you for building and publishing it! Do you by chance know if anyone is hosting a web service that utilizes this tool to allow users to paste a url once and generate archives across all of the supported archive service providers in one go? That would be amazing. If now, I may be able interested in building such a tool. Let me know what you think.

Add Support for Megalodon.jp

If possible, please add support for Megalodon.jp.


I've found some snippets of code around the Internet but when I've tried doing requests with the information from these projects, I always get megalodon.jp URL as the res.url and nothing useful in the res.headers in the response from the server.

I've tried replicating the cookies back but sometimes I get the error from their server "「Cookieが無効な状態」" which means it is complaining about them.

Anyone have any thoughts on how to submit URLs to Megalodon.jp in Python?

ImportError: No module named pathlib

In Ubuntu 18.04
pip install archivenow
...
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 Werkzeug-1.0.0 archivenow-2019.7.27.2.35.46 certifi-2019.11.28 chardet-3.0.4 click-7.1.1 flask-1.1.1 idna-2.9 itsdangerous-1.1.0 requests-2.23.0 urllib3-1.25.8
I tried running a test with
archivenow --all https://nypost.com/
The response was:

Traceback (most recent call last):
  File "/home/myusername/.local/bin/archivenow", line 7, in <module>
    from archivenow.archivenow import args_parser
  File "/home/myusername/.local/lib/python2.7/site-packages/archivenow/archivenow.py", line 13, in <module>
    from pathlib import Path
ImportError: No module named pathlib

Archive sites in addition to submitting URIs

One of the use cases in https://github.com/webrecorder/warcit is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia http://cnn.com would use wget to create a WARC of cnn.com and store it locally at news.arc but also submit the URI to IA.

via.hypothes.is nesting

Support the nesting of Hypothesis links as an archive link.
e.g. http://archive.today/?run=1&url=https://via.hypothes.is/https://www.agileservicemanifesto.org/
where http://archive.today/?run=1&url=https://via.hypothes.is/ is the prefix and https://www.agileservicemanifesto.org/ is the URL (encodeURIComponent in JS function)
Same should be applied to all sites

Restructure the response JSON

I would suggest that response

{
	"results": [
		"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"http://archive.is/H2Yfg",
		"http://www.webcitation.org/6o9Jubykh",
		"Error (The Perma.cc Archive): An API KEY is required"
	]
}

should be changed to

{
	"uri": "http://www.foxnews.com",
	"request-datetime": "20170209143321",
	"mementos": {
		"web.archive.org": "https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"archive.is": "http://archive.is/H2Yfg",
		"webcitation.org": "http://www.webcitation.org/6o9Jubykh",
		"perma.cc": "Error: An API KEY is required"
	}
}

Archive Web Site

Can you add the ability to archive a complete web site

  • spidering from a given directory to any depth or a specified depth
  • up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

Self-report module version number

In #7 I had to resort to pip to verify the version of the library I was using. This is report on installation but I have found it common that a module can self-report version.

Allow archivenow -v and archivenow --version to print the version of the module to stdout. This should help with debugging.

Handle case where no optional parameters are specified

I attempted to specify no optional parameters but simply the URI positional parameter via:

archivenow http://some-urir

and was supplied the command-line help functionality. It would be better to handle this usage in a smarter manner, i.e., triggering the "--all" or "--ia" flags when no archive is explicitly specified.

archive.today fails, the site presents a captcha challenge

When trying to archive an URL to archive.today through archivenow --is URL, it always returns:

Error (The Archive.is): 429 Client Error: Too Many Requests for url: https://archive.is/submit/

I have Firefox and geckodriver installed and available in my PATH.

When submitting a URL on the site regularly through a browser, the site returns 429 on submit and requires the completion of a reCAPTCHA challenge, and then proceeds to archive the URL.

archivenow --ia "http://www.hotcactus.nyc/" returns incorrect URL

archivenow --ia "http://www.hotcactus.nyc/" returns the following for me:

https://web.archive.org/web/20210723204229/https://www.google.com/maps/embed?pb=!1m0!3m2!1sen!2sus!4v1492711765912!6m8!1m7!1sUl8AEIci9YYO2dP_SwO1oQ!2m2!1d40.71478671950204!2d-73.99018606424495!3f303.3614672677981!4f-5.896130905148539!5f0.7820865974627469

Which is an embed on the site, but not the top level site itself — I'd expect it to return something like:

https://web.archive.org/web/20210723204228/http://www.hotcactus.nyc/

instead.

ModuleNotFoundError: No module named '__init__'

When installed on Heroku with pip here's the error I get.

from archivenow import archivenow

Triggers this:

Traceback (most recent call last):
  File "manage.py", line 22, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/app/archive/management/commands/testis.py", line 10, in handle
    tasks.is_memento(clip.id)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/local.py", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/task.py", line 380, in __call__
    return self.run(*args, **kwargs)
  File "/app/archive/tasks.py", line 13, in is_memento
    from archivenow import archivenow
  File "/app/.heroku/python/lib/python3.6/site-packages/archivenow/archivenow.py", line 10, in <module>
    from __init__ import __version__ as archiveNowVersion
ModuleNotFoundError: No module named '__init__'

[Feature request] Add retry logic

Description

The program should have a retry logic in case the request to the archive service fails. In my experience, this happens a lot with The Internet Archive. For example:

$ archivenow --ia --is https://twitter.com/Itaoka1/status/494145244540063745
Error (The Internet Archive): HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /save/https://twitter.com/Itaoka1/status/494145244540063745 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6d61a34d10>, 'Connection to web.archive.org timed out. (connect timeout=120)'))
https://archive.li/wip/v4SzI

I would prefer that the command does not complete before it actually succeeds with the requests to all of the given archive services, or at least before a certain number of maximum retries (per service) is reached. The retry count should be configurable, via a command line option (e.g. --max-retries 20), and it should have a reasonable default (5?) in case the option isn’t given by the user.

Currently, the user has to manually issue new archivals for the services for which the request was unsuccessful.

Reduce nesting

Too much nesting is generally a bad idea unless it is necessary. Try refactoring your code to reduce nesting in general. The file archivenow.py can use glob method to filter files with specific name pattern and reduce the nesting while making the code more readable.

Support Python 3

It is now safe and preferred to write code for Python 3 as Python 2 is reaching the end of its extended life. There is no need to support Python 2 anymore in libraries like these.

bug(windows): Error: No enabled archive handler found

1. Summary

I can't begin to use archivenow CLI on my Windows.

2. Environment

  • Windows 10 Enterprise LTSB 64-bit EN
  • Python 3.7.2
  • archivenow 2019.1.5.2.19.34

3. Steps to reproduce

I install archivenow to virtual environment:

D:\SashaDebugging>mkvirtualenv archivenowenv
Using base prefix 'c:\\python37'
New python executable in C:\Users\SashaChernykh\Envs\archivenowenv\Scripts\python.exe
Installing setuptools, pip, wheel…
done.

(archivenowenv) D:\SashaDebugging>toggleglobalsitepackages

    Disabled global site-packages

(archivenowenv) D:\SashaDebugging>pip install archivenow
Collecting archivenow
  Using cached https://files.pythonhosted.org/packages/32/25/0d3051d362e2a42322e8716dc359557aeb143cc9ca6e5d19efad74a0f6d2/archivenow-2019.1.5.2.19.34-py2.py3-none-any.whl
Collecting flask (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl
Collecting requests (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting itsdangerous>=0.24 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting Werkzeug>=0.14 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl
Collecting Jinja2>=2.10 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl
Collecting click>=5.1 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting urllib3<1.25,>=1.21.1 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/9f/e0/accfc1b56b57e9750eba272e24c4dddeac86852c2bebd1236674d7887e8a/certifi-2018.11.29-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Installing collected packages: itsdangerous, Werkzeug, MarkupSafe, Jinja2, click, flask, idna, urllib3, chardet, certifi, requests, archivenow
Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 archivenow-2019.1.5.2.19.34 certifi-2018.11.29 chardet-3.0.4 click-7.0 flask-1.0.2 idna-2.8 itsdangerous-1.1.0 requests-2.21.0 urllib3-1.24.1

I try run commands as in examples.

4. Expected behavior

Save web-pages on archiving services.

5. Actual behavior

I get Error: No enabled archive handler found any time.

(archivenowenv) D:\SashaDebugging>archivenow kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow google.com

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -all kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -ia google.com

 Error: No enabled archive handler found

Thanks.

Add a Dockerfile

In order to deploy it on our server, please add a Dockerfile in it and also add a corresponding image in DockerHub.

Problems with pushing mementos into Internet Archive

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

def push(URI, arc_id, p_args={}):
global handlers
global res_uris
try:
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
th.start()
for th in threads:
th.join()
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
except:
del res_uris[res_uris_idx]
pass
return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

Archiving resources with relative Content-Location

archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.orgOverview.html

See also from curl where a resource returns Content-Location:

curl -I https://www.w3.org/TR/webarch/
content-location: Overview.html

in comparison to the ones that don't:

curl -I http://csarven.ca/

So, when I do something like:

curl -ki 'https://web.archive.org/save/https://www.w3.org/TR/webarch/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

I get:

content-location: Overview.html

And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).

This is in comparison to say:

curl -ki 'https://web.archive.org/save/http://csarven.ca/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

which gives a nice workable:

content-location: /web/20190708123256/http://csarven.ca/

Ideas?

Archive images in IA

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>\(\)]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

502 bad gateway error

Hi,

I am getting the following error both from python and cli usage. archive.cli is working fine though.

Error (The Internet Archive): 502 Server Error: Bad Gateway for url: https://web.archive.org/save/www.google.co.in

I used it successfully last month. However, currently I am getting this error. I tried using the web page and it is working fine.

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.