oduwsdl / archivenow Goto Github PK

View Code? Open in Web Editor NEW

395.0 21.0 41.0 20.94 MB

A Tool To Push Web Resources Into Web Archives

License: MIT License

Python 58.99% HTML 40.31% Dockerfile 0.71%

web-archiving internet-archive

archivenow's Issues

Make the default host 0.0.0.0

Changing the default host to 0.0.0.0 would slim down the readme and cause fewer unexpected results.

Support the nesting of Hypothesis links as an archive link.
e.g. http://archive.today/?run=1&url=https://via.hypothes.is/https://www.agileservicemanifesto.org/
where http://archive.today/?run=1&url=https://via.hypothes.is/ is the prefix and https://www.agileservicemanifesto.org/ is the URL (encodeURIComponent in JS function)
Same should be applied to all sites

Archive sites in addition to submitting URIs

One of the use cases in https://github.com/webrecorder/warcit is to grab a site's contents using wget then running the tool to create a WARC file from the local file contents. It would be useful for a tool called, "archivenow" to do more than submit URIs, rather, to perform some form of archiving itself.

I would like to propose replicating this model from the archivenow tool but in a single command. For example, running archivenow --warc=news.warc --agent=wget --ia http://cnn.com would use wget to create a WARC of cnn.com and store it locally at news.arc but also submit the URI to IA.

Web Service?

This is an amazing tool, thank you for building and publishing it! Do you by chance know if anyone is hosting a web service that utilizes this tool to allow users to paste a url once and generate archives across all of the supported archive service providers in one go? That would be amazing. If now, I may be able interested in building such a tool. Let me know what you think.

Reduce nesting

Too much nesting is generally a bad idea unless it is necessary. Try refactoring your code to reduce nesting in general. The file archivenow.py can use glob method to filter files with specific name pattern and reduce the nesting while making the code more readable.

perma.cc handler fails with 415 Unsupported Media Type

All use of the perma.cc handler via Python fails with errors of the following form:

['Error (The Perma.cc Archive): 415 Client Error: Unsupported Media Type for url: https://api.perma.cc/v1/archives/?api_key=asdfasdfasdfadf']

It looks like Perma.cc requires explicitly setting Content-Type to application/json, as per: https://perma.cc/docs/developer#create-an-archive

Change tabs to spaces

Usually 4 spaces are used, but avoid using tabs for indentation completely.

archivenow --ia "http://www.hotcactus.nyc/" returns incorrect URL

archivenow --ia "http://www.hotcactus.nyc/" returns the following for me:

https://web.archive.org/web/20210723204229/https://www.google.com/maps/embed?pb=!1m0!3m2!1sen!2sus!4v1492711765912!6m8!1m7!1sUl8AEIci9YYO2dP_SwO1oQ!2m2!1d40.71478671950204!2d-73.99018606424495!3f303.3614672677981!4f-5.896130905148539!5f0.7820865974627469

Which is an embed on the site, but not the top level site itself — I'd expect it to return something like:

https://web.archive.org/web/20210723204228/http://www.hotcactus.nyc/

instead.

Restructure the response JSON

I would suggest that response

{
	"results": [
		"https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"http://archive.is/H2Yfg",
		"http://www.webcitation.org/6o9Jubykh",
		"Error (The Perma.cc Archive): An API KEY is required"
	]
}

should be changed to

{
	"uri": "http://www.foxnews.com",
	"request-datetime": "20170209143321",
	"mementos": {
		"web.archive.org": "https://web.archive.org/web/20170209143327/http://www.foxnews.com",
		"archive.is": "http://archive.is/H2Yfg",
		"webcitation.org": "http://www.webcitation.org/6o9Jubykh",
		"perma.cc": "Error: An API KEY is required"
	}
}

502 bad gateway error

Hi,

I am getting the following error both from python and cli usage. archive.cli is working fine though.

Error (The Internet Archive): 502 Server Error: Bad Gateway for url: https://web.archive.org/save/www.google.co.in

I used it successfully last month. However, currently I am getting this error. I tried using the web page and it is working fine.

Thanks

Support Python 3

It is now safe and preferred to write code for Python 3 as Python 2 is reaching the end of its extended life. There is no need to support Python 2 anymore in libraries like these.

ModuleNotFoundError: No module named 'init'

When installed on Heroku with pip here's the error I get.

from archivenow import archivenow

Triggers this:

Traceback (most recent call last):
  File "manage.py", line 22, in <module>
    execute_from_command_line(sys.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 364, in execute_from_command_line
    utility.execute()
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/__init__.py", line 356, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 283, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/app/.heroku/python/lib/python3.6/site-packages/django/core/management/base.py", line 330, in execute
    output = self.handle(*args, **options)
  File "/app/archive/management/commands/testis.py", line 10, in handle
    tasks.is_memento(clip.id)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/local.py", line 191, in __call__
    return self._get_current_object()(*a, **kw)
  File "/app/.heroku/python/lib/python3.6/site-packages/celery/app/task.py", line 380, in __call__
    return self.run(*args, **kwargs)
  File "/app/archive/tasks.py", line 13, in is_memento
    from archivenow import archivenow
  File "/app/.heroku/python/lib/python3.6/site-packages/archivenow/archivenow.py", line 10, in <module>
    from __init__ import __version__ as archiveNowVersion
ModuleNotFoundError: No module named '__init__'

Webcitation is currently not accepting archiving requests.

We are currently not accepting archiving requests. The archival state/snapshots of websites that have been archived with WebCite in the past can still be accessed and cited.

From https://www.webcitation.org/

Archive Web Site

Can you add the ability to archive a complete web site

spidering from a given directory to any depth or a specified depth
up to a certain depth for links outside the site

Some files may be document files like doc, pdf with links.

bug(windows): Error: No enabled archive handler found

1. Summary

I can't begin to use archivenow CLI on my Windows.

2. Environment

Windows 10 Enterprise LTSB 64-bit EN
Python 3.7.2
archivenow 2019.1.5.2.19.34

3. Steps to reproduce

I install archivenow to virtual environment:

D:\SashaDebugging>mkvirtualenv archivenowenv
Using base prefix 'c:\\python37'
New python executable in C:\Users\SashaChernykh\Envs\archivenowenv\Scripts\python.exe
Installing setuptools, pip, wheel…
done.

(archivenowenv) D:\SashaDebugging>toggleglobalsitepackages

    Disabled global site-packages

(archivenowenv) D:\SashaDebugging>pip install archivenow
Collecting archivenow
  Using cached https://files.pythonhosted.org/packages/32/25/0d3051d362e2a42322e8716dc359557aeb143cc9ca6e5d19efad74a0f6d2/archivenow-2019.1.5.2.19.34-py2.py3-none-any.whl
Collecting flask (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/e7/08578774ed4536d3242b14dacb4696386634607af824ea997202cd0edb4b/Flask-1.0.2-py2.py3-none-any.whl
Collecting requests (from archivenow)
  Using cached https://files.pythonhosted.org/packages/7d/e3/20f3d364d6c8e5d2353c72a67778eb189176f08e873c9900e10c0287b84b/requests-2.21.0-py2.py3-none-any.whl
Collecting itsdangerous>=0.24 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/76/ae/44b03b253d6fade317f32c24d100b3b35c2239807046a4c953c7b89fa49e/itsdangerous-1.1.0-py2.py3-none-any.whl
Collecting Werkzeug>=0.14 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/20/c4/12e3e56473e52375aa29c4764e70d1b8f3efa6682bef8d0aae04fe335243/Werkzeug-0.14.1-py2.py3-none-any.whl
Collecting Jinja2>=2.10 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/7f/ff/ae64bacdfc95f27a016a7bed8e8686763ba4d277a78ca76f32659220a731/Jinja2-2.10-py2.py3-none-any.whl
Collecting click>=5.1 (from flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/fa/37/45185cb5abbc30d7257104c434fe0b07e5a195a6847506c074527aa599ec/Click-7.0-py2.py3-none-any.whl
Collecting idna<2.9,>=2.5 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/14/2c/cd551d81dbe15200be1cf41cd03869a46fe7226e7450af7a6545bfc474c9/idna-2.8-py2.py3-none-any.whl
Collecting urllib3<1.25,>=1.21.1 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/62/00/ee1d7de624db8ba7090d1226aebefab96a2c71cd5cfa7629d6ad3f61b79e/urllib3-1.24.1-py2.py3-none-any.whl
Collecting chardet<3.1.0,>=3.0.2 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/bc/a9/01ffebfb562e4274b6487b4bb1ddec7ca55ec7510b22e4c51f14098443b8/chardet-3.0.4-py2.py3-none-any.whl
Collecting certifi>=2017.4.17 (from requests->archivenow)
  Using cached https://files.pythonhosted.org/packages/9f/e0/accfc1b56b57e9750eba272e24c4dddeac86852c2bebd1236674d7887e8a/certifi-2018.11.29-py2.py3-none-any.whl
Collecting MarkupSafe>=0.23 (from Jinja2>=2.10->flask->archivenow)
  Using cached https://files.pythonhosted.org/packages/44/6e/41ac9266e3db762dfd9089f6b0d2298c84160f54ef2a7257c17b0e7ec2ec/MarkupSafe-1.1.0-cp37-cp37m-win_amd64.whl
Installing collected packages: itsdangerous, Werkzeug, MarkupSafe, Jinja2, click, flask, idna, urllib3, chardet, certifi, requests, archivenow
Successfully installed Jinja2-2.10 MarkupSafe-1.1.0 Werkzeug-0.14.1 archivenow-2019.1.5.2.19.34 certifi-2018.11.29 chardet-3.0.4 click-7.0 flask-1.0.2 idna-2.8 itsdangerous-1.1.0 requests-2.21.0 urllib3-1.24.1

I try run commands as in examples.

4. Expected behavior

Save web-pages on archiving services.

5. Actual behavior

I get Error: No enabled archive handler found any time.

(archivenowenv) D:\SashaDebugging>archivenow kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow google.com

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -all kristinita.ru

 Error: No enabled archive handler found


(archivenowenv) D:\SashaDebugging>archivenow -ia google.com

 Error: No enabled archive handler found

Thanks.

ImportError: No module named pathlib

In Ubuntu 18.04
pip install archivenow
...
Successfully installed Jinja2-2.11.1 MarkupSafe-1.1.1 Werkzeug-1.0.0 archivenow-2019.7.27.2.35.46 certifi-2019.11.28 chardet-3.0.4 click-7.1.1 flask-1.1.1 idna-2.9 itsdangerous-1.1.0 requests-2.23.0 urllib3-1.25.8
I tried running a test with
archivenow --all https://nypost.com/
The response was:

Traceback (most recent call last):
  File "/home/myusername/.local/bin/archivenow", line 7, in <module>
    from archivenow.archivenow import args_parser
  File "/home/myusername/.local/lib/python2.7/site-packages/archivenow/archivenow.py", line 13, in <module>
    from pathlib import Path
ImportError: No module named pathlib

archive.is responds with 409

Archive.is responds with 409 to requests issued through heroku (i.e. top cloud computing
providers).

Self-report module version number

In #7 I had to resort to pip to verify the version of the library I was using. This is report on installation but I have found it common that a module can self-report version.

Allow archivenow -v and archivenow --version to print the version of the module to stdout. This should help with debugging.

Wrong repo, disregard

Dynamically generate the list of archives in the splash page

The splash page in the server mode should not be static. The list of archives in the page should be dynamically generated based on configured archives (the enabled ones).

Better defaults in the UI

For better user experience you might want to:

make the first three checkboxes checked by default
add a link to the page where Perma.cc API key can be generated, and
make the API key persist in user's browser's localstorage (if entered)

How to fetch the original URL from the snapshopt?

Hello,

I am trying to get the original URL from the snapshot, but it is not working with the request library. can you suggest some other alternative

pypi project "archivenow" links to maturban/archivenow instead of here

The readme file states that pip install archivenow can be used to install archivenow (https://github.com/oduwsdl/archivenow). However, on pypi (https://pypi.org/project/archivenow/), the website of archivenow is https://github.com/maturban/archivenow.

Docker and archive.today

The docker image contains a very outdated version of /app/archivenow/handlers/is_handler.py

There are three versions of this file: The one in Docker (original), the one in pip install (more updates), and the one on this repo (fully redone with Selenium support).

If you have a key to avoid captchas the one you want is in the pip install. If you have no key, you can try Selenium support but some users have reported it unsuccessful.

Archive images in IA

It would be nice if the tool would also archive embedded content for Internet Archive requests. This could be done by downloading the archived page and searching for any /save/_embed/[^"'<>]* URLs in the page source.

(It would also be nice if the tool could download lazy-loaded files and/or any linked media files, although even the Wayback Machine can't really do that in many cases.)

Some Questions to consider

Observing this line, we can see that it "pushes" links, is it possible to queue multiple items into the archive (and get archives for each link)? https://github.com/oduwsdl/archivenow#python-usage
Can this process be done in a more async fashion, such that it would not halt other processes?

Alternate archive.today URLs

For those who blocks certain URLs

How to set proxy for it ?

Will submit pr: submit ghostarchive.org

Cool site that uses webrecorder render and also archives videos.

The endpoint for submitting an archive is "/archive", and it is a POST request. Once request is submitted, it will redirect (302) you to the URL where the archive would be stored.

Will submit pr myself, but any objections before i do?

Archiving resources with relative Content-Location

archivenow --ia https://www.w3.org/TR/webarch/
https://web.archive.orgOverview.html

See also from curl where a resource returns Content-Location:

curl -I https://www.w3.org/TR/webarch/
content-location: Overview.html

in comparison to the ones that don't:

curl -I http://csarven.ca/

So, when I do something like:

curl -ki 'https://web.archive.org/save/https://www.w3.org/TR/webarch/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

I get:

content-location: Overview.html

And that kind of screws up things for me because I can't figure out the actual snapshot location from the headers. Okay if JS-enabled agent is making the request because it eventually redirects.. but that's not what I want because I'm making this call from a client-side application and only want to work with headers (or whatever is proper structured data is available.. as opposed to scraping stuff).

This is in comparison to say:

curl -ki 'https://web.archive.org/save/http://csarven.ca/' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0' -H 'Accept: */*' -H 'Accept-Language: en-CA,en;q=0.7,en-US;q=0.3' --compressed -H 'Referer: https://localhost:8443/' -H 'Origin: https://localhost:8443' -H 'Connection: keep-alive' -H 'DNT: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'

which gives a nice workable:

content-location: /web/20190708123256/http://csarven.ca/

Ideas?

archive.today fails, the site presents a captcha challenge

When trying to archive an URL to archive.today through archivenow --is URL, it always returns:

Error (The Archive.is): 429 Client Error: Too Many Requests for url: https://archive.is/submit/

I have Firefox and geckodriver installed and available in my PATH.

When submitting a URL on the site regularly through a browser, the site returns 429 on submit and requires the completion of a reCAPTCHA challenge, and then proceeds to archive the URL.

[Feature request] Add retry logic

Description

The program should have a retry logic in case the request to the archive service fails. In my experience, this happens a lot with The Internet Archive. For example:

$ archivenow --ia --is https://twitter.com/Itaoka1/status/494145244540063745
Error (The Internet Archive): HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /save/https://twitter.com/Itaoka1/status/494145244540063745 (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7f6d61a34d10>, 'Connection to web.archive.org timed out. (connect timeout=120)'))
https://archive.li/wip/v4SzI

I would prefer that the command does not complete before it actually succeeds with the requests to all of the given archive services, or at least before a certain number of maximum retries (per service) is reached. The retry count should be configurable, via a command line option (e.g. --max-retries 20), and it should have a reasonable default (5?) in case the option isn’t given by the user.

Currently, the user has to manually issue new archivals for the services for which the request was unsuccessful.

Add archivo.pt as a source for URI submission

h/t @ibnesayeed
https://sobre.arquivo.pt/en/savepagenow-to-record-webpages-immediately-on-arquivo-pt/

Add a Dockerfile

In order to deploy it on our server, please add a Dockerfile in it and also add a corresponding image in DockerHub.

Problems with pushing mementos into Internet Archive

I noticed this when I was using ArchiveNow this morning.

# archivenow www.foxnews.com
Error (The Internet Archive): 445 Client Error:  for url: https://web.archive.org/save/www.foxnews.com

If I add a user agent to the arguments to the requests.get on line 15 of archivenow/archivenow/handlers/ia_handler.py then it works.

archivenow/archivenow/handlers/ia_handler.py

Line 15 in cafcbdd

r = requests.get(uri, timeout=120, allow_redirects=True)

I'm uncertain as to how you want to handle the user specifying their own user agent. The existing --agent argument appears to be for specifying which tool the user desires to employ for creating WARCs. Also, there doesn't appear to be a way to submit changes to any of the request headers in archivenow/archivenow.py.

As I'm calling ArchiveNow within Python code, I would prefer an available parameter to the push function on line 129 of archivenow/archivenow.py.

archivenow/archivenow/archivenow.py

Lines 129 to 168 in cafcbdd

 def push(URI, arc_id, p_args={}): 

 global handlers 

 global res_uris 

 try: 

 # push to all possible archives 

 res_uris_idx = str(uuid.uuid4()) 

 res_uris[res_uris_idx] = [] 

 ### if arc_id == 'all': 

 ### for handler in handlers: 

 ### if (handlers[handler].api_required): 

 # pass args like key API 

 ### res.append(handlers[handler].push(str(URI), p_args)) 

 ### else: 

 ### res.append(handlers[handler].push(str(URI))) 

 ### else: 

 # push to the chosen archives 

 threads = [] 

 for handler in handlers: 

 if (arc_id == handler) or (arc_id == 'all'): 

 ### if (arc_id == handler): ### and (handlers[handler].api_required): 

 #res.append(handlers[handler].push(str(URI), p_args)) 

 #push_proxy( handlers[handler], str(URI), p_args, res_uris_idx) 

 threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,))) 

 ### elif (arc_id == handler): 

 ### res.append(handlers[handler].push(str(URI))) 

 for th in threads: 

 th.start() 

 for th in threads: 

 th.join() 

 res = res_uris[res_uris_idx] 

 del res_uris[res_uris_idx] 

 return res 

 except: 

 del res_uris[res_uris_idx] 

 pass 

 return ["bad request"]

For example, we could have:

def push(URI, arc_id, p_args={}, headers={}):

where the user can override any of the request headers by assigning them as a dictionary to the headers parameter. This dictionary would have to be re-submitted through the code on line 154 to the function executed via multithreading.

I haven't submitted a pull request yet because all handlers would need to be updated to receive and act on this parameter. I'm not sure of the implications of that.

Handle case where no optional parameters are specified

I attempted to specify no optional parameters but simply the URI positional parameter via:

archivenow http://some-urir

and was supplied the command-line help functionality. It would be better to handle this usage in a smarter manner, i.e., triggering the "--all" or "--ia" flags when no archive is explicitly specified.

documentation incorrect on how to pass parameters to a handler

Current readme suggests this code for use in Python:
archivenow.push("www.foxnews.com","cc","cc_api_key=$YOUR-Perma-cc-API-KEY")
But actually, when push() is called directly, it seems to expect additional parameters in object form, e.g.:
archivenow.push("www.foxnews.com","cc",{"cc_api_key":"$YOUR-Perma-cc-API-KEY"})

Add Support for Megalodon.jp

If possible, please add support for Megalodon.jp.

I've found some snippets of code around the Internet but when I've tried doing requests with the information from these projects, I always get megalodon.jp URL as the res.url and nothing useful in the res.headers in the response from the server.

I've tried replicating the cookies back but sometimes I get the error from their server "「Cookieが無効な状態」" which means it is complaining about them.

Anyone have any thoughts on how to submit URLs to Megalodon.jp in Python?

Server version ignores query parameters

Server version seems to be ignoring query parameters when posting to archives. For example, https://www.youtube.com/watch?v=PEQEOP1zh_c becomes https://www.youtube.com/watch.

	def push(URI, arc_id, p_args={}):
	global handlers
	global res_uris
	try:
	# push to all possible archives
	res_uris_idx = str(uuid.uuid4())
	res_uris[res_uris_idx] = []
	### if arc_id == 'all':
	### for handler in handlers:
	### if (handlers[handler].api_required):
	# pass args like key API
	### res.append(handlers[handler].push(str(URI), p_args))
	### else:
	### res.append(handlers[handler].push(str(URI)))
	### else:
	# push to the chosen archives

	threads = []

	for handler in handlers:
	if (arc_id == handler) or (arc_id == 'all'):
	### if (arc_id == handler): ### and (handlers[handler].api_required):
	#res.append(handlers[handler].push(str(URI), p_args))
	#push_proxy( handlers[handler], str(URI), p_args, res_uris_idx)
	threads.append(Thread(target=push_proxy, args=(handlers[handler],str(URI), p_args, res_uris_idx,)))
	### elif (arc_id == handler):
	### res.append(handlers[handler].push(str(URI)))

	for th in threads:
	th.start()
	for th in threads:
	th.join()

	res = res_uris[res_uris_idx]
	del res_uris[res_uris_idx]
	return res
	except:
	del res_uris[res_uris_idx]
	pass
	return ["bad request"]

oduwsdl / archivenow Goto Github PK

archivenow's Issues

1. Summary

2. Environment

3. Steps to reproduce

4. Expected behavior

5. Actual behavior

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs