GithubHelp home page GithubHelp logo

oduwsdl / archivenow Goto Github PK

View Code? Open in Web Editor NEW
395.0 21.0 41.0 20.94 MB

A Tool To Push Web Resources Into Web Archives

License: MIT License

Python 58.99% HTML 40.31% Dockerfile 0.71%
web-archiving internet-archive

archivenow's Issues

via.hypothes.is nesting

Support the nesting of Hypothesis links as an archive link.
e.g. http://archive.today/?run=1&url=https://via.hypothes.is/https://www.agileservicemanifesto.org/
where http://archive.today/?run=1&url=https://via.hypothes.is/ is the prefix and https://www.agileservicemanifesto.org/ is the URL (encodeURIComponent in JS function)
Same should be applied to all sites

Archive sites in addition to submitting URIs

One of the use cases in https://github.com/webrecorder/warcit is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia http://cnn.com would use wget to create a WARC of cnn.com and store it locally at news.arc but also submit the URI to IA.

Web Service?

This is an amazing tool, thank you for building and publishing it! Do you by chance know if anyone is hosting a web service that utilizes this tool to allow users to paste a url once and generate archives across all of the supported archive service providers in one go? That would be amazing. If now, I may be able interested in building such a tool. Let me know what you think.

Reduce nesting

Too much nesting is generally a bad idea unless it is necessary. Try refactoring your code to reduce nesting in general. The file archivenow.py can use glob method to filter files with specific name pattern and reduce the nesting while making the code more readable.

archivenow --ia "http://www.hotcactus.nyc/" returns incorrect URL

archivenow --ia "http://www.hotcactus.nyc/" returns the following for me:

https://web.archive.org/web/20210723204229/https://www.google.com/maps/embed?pb=!1m0!3m2!1sen!2sus!4v1492711765912!6m8!1m7!1sUl8AEIci9YYO2dP_SwO1oQ!2m2!1d40.71478671950204!2d-73.99018606424495!3f303.3614672677981!4f-5.896130905148539!5f0.7820865974627469

Which is an embed on the site, but not the top level site itself — I'd expect it to return something like:

https://web.archive.org/web/20210723204228/http://www.hotcactus.nyc/

instead.

Restructure the response JSON

I would suggest that response

{
	"results": [
		"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"http://archive.is/H2Yfg",
		"http://www.webcitation.org/6o9Jubykh",
		"Error (The Perma.cc Archive): An API KEY is required"
	]
}

should be changed to

{
	"uri": "http://www.foxnews.com",
	"request-datetime": "20170209143321",
	"mementos": {
		"web.archive.org": "https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"archive.is": "http://archive.is/H2Yfg",
		"webcitation.org": "http://www.webcitation.org/6o9Jubykh",
		"perma.cc": "Error: An API KEY is required"
	}
}

502 bad gateway error

Hi,

I am getting the following error both from python and cli usage. archive.cli is working fine though.

Error (The Internet Archive): 502 Server Error: Bad Gateway for url: https://web.archive.org/save/www.google.co.in

I used it successfully last month. However, currently I am getting this error. I tried using the web page and it is working fine.

Thanks

Support Python 3

It is now safe and preferred to write code for Python 3 as Python 2 is reaching the end of its extended life. There is no need to support Python 2 anymore in libraries like these.

ModuleNotFoundError: No module named '__init__'

When installed on Heroku with pip here's the error I get.

from archivenow import archivenow

Triggers this:

Traceback (most recent call last):
  File "manage.py", line 22, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/app/archive/management/commands/testis.py", line 10, in handle
    tasks.is_memento(clip.id)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/local.py", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/task.py", line 380, in __call__
    return self.run(*args, **kwargs)
  File "/app/archive/tasks.py", line 13, in is_memento
    from archivenow import archivenow
  File "/app/.heroku/python/lib/python3.6/site-packages/archivenow/archivenow.py", line 10, in <module>
    from __init__ import __version__ as archiveNowVersion
ModuleNotFoundError: No module named '__init__'

Archive Web Site

Can you add the ability to archive a complete web site

  • spidering from a given directory to any depth or a specified depth
  • up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

bug(windows): Error: No enabled archive handler found

1. Summary

I can't begin to use archivenow CLI on my Windows.

2. Environment

  • Windows 10 Enterprise LTSB 64-bit EN
  • Python 3.7.2
  • archivenow 2019.1.5.2.19.34

3. Steps to reproduce

I install archivenow to virtual environment:

D:\SashaDebugging>mkvirtualenv archivenowenv
Using base prefix 'c:\\python37'
New python executable in C:\Users\SashaChernykh\Envs\archivenowenv\Scripts\python.exe
Installing setuptools, pip, wheel…
done.

(archivenowenv) D:\SashaDebugging>toggleglobalsitepackages

    Disabled global site-packages

(archivenowenv) D:\SashaDebugging>pip install archivenow
Collecting archivenow
  Using cached https://files.pythonhosted.org/packages/32/25/0d3051d362e2a42322e8716dc359557aeb143cc9ca6e5d19efad74a0f6d2/archivenow-2019.1.5.2.19.34-py2.py3-none-any.whl
Collecting flask (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl
Collecting requests (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting itsdangerous>=0.24 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting Werkzeug>=0.14 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl
Collecting Jinja2>=2.10 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl
Collecting click>=5.1 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting urllib3<1.25,>=1.21.1 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/9f/e0/accfc1b56b57e9750eba272e24c4dddeac86852c2bebd1236674d7887e8a/certifi-2018.11.29-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Installing collected packages: itsdangerous, Werkzeug, MarkupSafe, Jinja2, click, flask, idna, urllib3, chardet, certifi, requests, archivenow
Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 archivenow-2019.1.5.2.19.34 certifi-2018.11.29 chardet-3.0.4 click-7.0 flask-1.0.2 idna-2.8 itsdangerous-1.1.0 requests-2.21.0 urllib3-1.24.1

I try run commands as in examples.

4. Expected behavior

Save web-pages on archiving services.

5. Actual behavior

I get Error: No enabled archive handler found any time.

(archivenowenv) D:\SashaDebugging>archivenow kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow google.com

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -all kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -ia google.com

 Error: No enabled archive handler found

Thanks.

ImportError: No module named pathlib

In Ubuntu 18.04
pip install archivenow
...
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 Werkzeug-1.0.0 archivenow-2019.7.27.2.35.46 certifi-2019.11.28 chardet-3.0.4 click-7.1.1 flask-1.1.1 idna-2.9 itsdangerous-1.1.0 requests-2.23.0 urllib3-1.25.8
I tried running a test with
archivenow --all https://nypost.com/
The response was:

Traceback (most recent call last):
  File "/home/myusername/.local/bin/archivenow", line 7, in <module>
    from archivenow.archivenow import args_parser
  File "/home/myusername/.local/lib/python2.7/site-packages/archivenow/archivenow.py", line 13, in <module>
    from pathlib import Path
ImportError: No module named pathlib

Self-report module version number

In #7 I had to resort to pip to verify the version of the library I was using. This is report on installation but I have found it common that a module can self-report version.

Allow archivenow -v and archivenow --version to print the version of the module to stdout. This should help with debugging.

Better defaults in the UI

For better user experience you might want to:

  • make the first three checkboxes checked by default
  • add a link to the page where Perma.cc API key can be generated, and
  • make the API key persist in user's browser's localstorage (if entered)

Docker and archive.today

The docker image contains a very outdated version of /app/archivenow/handlers/is_handler.py

There are three versions of this file: The one in Docker (original), the one in pip install (more updates), and the one on this repo (fully redone with Selenium support).

If you have a key to avoid captchas the one you want is in the pip install. If you have no key, you can try Selenium support but some users have reported it unsuccessful.

Archive images in IA

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>\(\)]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

Will submit pr: submit ghostarchive.org

Cool site that uses webrecorder render and also archives videos.

The endpoint for submitting an archive is "/archive", and it is a POST request. Once request is submitted, it will redirect (302) you to the URL where the archive would be stored.

Will submit pr myself, but any objections before i do?

Archiving resources with relative Content-Location

archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.orgOverview.html

See also from curl where a resource returns Content-Location:

curl -I https://www.w3.org/TR/webarch/
content-location: Overview.html

in comparison to the ones that don't:

curl -I http://csarven.ca/

So, when I do something like:

curl -ki 'https://web.archive.org/save/https://www.w3.org/TR/webarch/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

I get:

content-location: Overview.html

And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).

This is in comparison to say:

curl -ki 'https://web.archive.org/save/http://csarven.ca/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

which gives a nice workable:

content-location: /web/20190708123256/http://csarven.ca/

Ideas?

archive.today fails, the site presents a captcha challenge

When trying to archive an URL to archive.today through archivenow --is URL, it always returns:

Error (The Archive.is): 429 Client Error: Too Many Requests for url: https://archive.is/submit/

I have Firefox and geckodriver installed and available in my PATH.

When submitting a URL on the site regularly through a browser, the site returns 429 on submit and requires the completion of a reCAPTCHA challenge, and then proceeds to archive the URL.

[Feature request] Add retry logic

Description

The program should have a retry logic in case the request to the archive service fails. In my experience, this happens a lot with The Internet Archive. For example:

$ archivenow --ia --is https://twitter.com/Itaoka1/status/494145244540063745
Error (The Internet Archive): HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /save/https://twitter.com/Itaoka1/status/494145244540063745 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6d61a34d10>, 'Connection to web.archive.org timed out. (connect timeout=120)'))
https://archive.li/wip/v4SzI

I would prefer that the command does not complete before it actually succeeds with the requests to all of the given archive services, or at least before a certain number of maximum retries (per service) is reached. The retry count should be configurable, via a command line option (e.g. --max-retries 20), and it should have a reasonable default (5?) in case the option isn’t given by the user.

Currently, the user has to manually issue new archivals for the services for which the request was unsuccessful.

Add a Dockerfile

In order to deploy it on our server, please add a Dockerfile in it and also add a corresponding image in DockerHub.

Problems with pushing mementos into Internet Archive

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

def push(URI, arc_id, p_args={}):
global handlers
global res_uris
try:
# push to all possible archives
res_uris_idx = str(uuid.uuid4())
res_uris[res_uris_idx] = []
### if arc_id == 'all':
### for handler in handlers:
### if (handlers[handler].api_required):
# pass args like key API
### res.append(handlers[handler].push(str(URI), p_args))
### else:
### res.append(handlers[handler].push(str(URI)))
### else:
# push to the chosen archives
threads = []
for handler in handlers:
if (arc_id == handler) or (arc_id == 'all'):
### if (arc_id == handler): ### and (handlers[handler].api_required):
#res.append(handlers[handler].push(str(URI), p_args))
#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
### elif (arc_id == handler):
### res.append(handlers[handler].push(str(URI)))
for th in threads:
th.start()
for th in threads:
th.join()
res = res_uris[res_uris_idx]
del res_uris[res_uris_idx]
return res
except:
del res_uris[res_uris_idx]
pass
return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

Handle case where no optional parameters are specified

I attempted to specify no optional parameters but simply the URI positional parameter via:

archivenow http://some-urir

and was supplied the command-line help functionality. It would be better to handle this usage in a smarter manner, i.e., triggering the "--all" or "--ia" flags when no archive is explicitly specified.

documentation incorrect on how to pass parameters to a handler

Current readme suggests this code for use in Python:
archivenow.push("www.foxnews.com","cc","cc_api_key=$YOUR-Perma-cc-API-KEY")
But actually, when push() is called directly, it seems to expect additional parameters in object form, e.g.:
archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})

Add Support for Megalodon.jp

If possible, please add support for Megalodon.jp.


I've found some snippets of code around the Internet but when I've tried doing requests with the information from these projects, I always get megalodon.jp URL as the res.url and nothing useful in the res.headers in the response from the server.

I've tried replicating the cookies back but sometimes I get the error from their server "「Cookieが無効な状態」" which means it is complaining about them.

Anyone have any thoughts on how to submit URLs to Megalodon.jp in Python?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.