GithubHelp home page GithubHelp logo

richardg867 / waybackproxy Goto Github PK

View Code? Open in Web Editor NEW
636.0 636.0 54.0 161 KB

HTTP proxy for tunneling requests through the Internet Archive Wayback Machine

License: GNU General Public License v3.0

Python 92.01% Dockerfile 1.69% Shell 4.00% HTML 2.30%

waybackproxy's People

Contributors

cttynul avatar jmarler avatar nfinit avatar richardg867 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

waybackproxy's Issues

Config doesn't even work

Screenshot_May_20_2022_08_51_14_AM

I'd love to use this for browsing the NHL's website as it looked in the late 1990s and early 2000s... if config.py even worked! As you can see in the screenshot I have put in this issue, config won't even open because of a syntax error in line 1, resulting in me not really being able to change any sort of settings. What can I do in this situation?

Multiple problems

The Proxy worked for about 20 seconds and then after that failed to work with "ERR_TUNNEL_CONNECTION_FAILED"

The Config.py doesn't actually do anything. You change values rebuild and then docker it and it will still try to run on the default port.

Support for ProtoWeb

WaybackProxy is an awesome project, I love it!

The people of ProtoWeb maintain a proxy service for manually restored websites from around 1996 to early 2000's. Tested it and seems like a a nice service, but don't know much more about the project.

Can you please consider adding an option to use ProtoWeb in WaybackProxy? It would be off by default, but fun to turn on to navigate a selection of more functional and better looking websites from that time. Could be also useful to get some files from old FTP's, that ProtoWeb archived and provide access to.

THIS NO LONGER WORKES ANYMORE! PLEASE FIX! 404

Right, now its broken.
Nothing from the dev but now its not working at all.
Been at this crap for 2 straight days.
2022-09-22
Raspberry pi 32bit bullseye

Everything I load any page up I get this:

pi@raspberrypi:~ $ sudo python3 ./WaybackProxy/waybackproxy.py
[-] Now listening on port 8888
[>] http://home.netscape.com/
[>] http://home.netscape.com/images/home_igloo.jpg
[>] http://home.netscape.com/_static/js/bundle-playback.js?v=21L7o4JU
[>] http://home.netscape.com/images/nav_home.gif
[>] http://home.netscape.com/inserts/images/n_sm.gif
[!] 404 NOT FOUND
[>] http://home.netscape.com/
[>] http://home.netscape.com/inserts/images/n_sm.gif
[>] http://home.netscape.com/images/nav_home.gif
[>] http://home.netscape.com/_static/js/bundle-playback.js?v=21L7o4JU
[>] http://home.netscape.com/inserts/images/bullet_sm.gif
[>] http://home.netscape.com/inserts/images/jimb_sm.gif
[>] http://home.netscape.com/inserts/images/tuneup.gif
[!] 404 NOT FOUND

Exception occurred during processing of request from ('192.168.1.131', 57627)
Traceback (most recent call last):
File "/usr/lib/python3.9/socketserver.py", line 650, in process_request_thread
self.finish_request(request, client_address)
File "/usr/lib/python3.9/socketserver.py", line 360, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python3.9/socketserver.py", line 720, in init
self.handle()
File "/home/pi/./WaybackProxy/waybackproxy.py", line 399, in handle
self.send_passthrough(conn, http_version, content_type, request_url)
File "/home/pi/./WaybackProxy/waybackproxy.py", line 408, in send_passthrough
self.request.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe

Any help at all?

Error

Wops! Error opening config.json
Traceback (most recent call last):
File "c:\Users\User\Downloads\WaybackProxy-master\WaybackProxy-master\waybackproxy.py", line 26, in
shared_state = SharedState()
^^^^^^^^^^^^^
File "c:\Users\User\Downloads\WaybackProxy-master\WaybackProxy-master\waybackproxy.py", line 17, in init
self.availability_cache = lrudict.LRUDict(maxduration=86400, maxsize=1024) if WAYBACK_API else None
^^^^^^^^^^^
NameError: name 'WAYBACK_API' is not defined

Fix regex for //web-static.archive.org/_static/js/*

It seems the archive.org headers have changed and a new subdomain is now used. I will submit a fix once I mange to get it working. In the meanwhile I disabled JavaScript in my Netscape 4.0 settings :)

<html>
<head><script src="//archive.org/includes/analytics.js?v=cf34f82" type="text/javascript"></script>
<script type="text/javascript">window.addEventListener('DOMContentLoaded',function(){var v=archive_analytics.values;v.service='wb';v.server_name='wwwb-app228.us.archive.org';v.server_ms=536;archive_analytics.send_pageview({});});</script>
<script type="text/javascript" src="//web-static.archive.org/_static/js/bundle-playback.js?v=6XRi73ky" charset="utf-8"></script>
<script type="text/javascript" src="//web-static.archive.org/_static/js/wombat.js?v=txqj7nKC" charset="utf-8"></script>
<script>window.RufflePlayer=window.RufflePlayer||{};window.RufflePlayer.config={"autoplay":"on","unmuteOverlay":"hidden"};</script>
<script type="text/javascript" src="//web-static.archive.org/_static/js/ruffle.js"></script>
<script type="text/javascript">
  __wm.init("https://web.archive.org/web");
  __wm.wombat("http://www.arnes.si:80/","19970131060208","https://web.archive.org/","web","//web-static.archive.org/_static/",
	      "854690528");
</script>
<link rel="stylesheet" type="text/css" href="//web-static.archive.org/_static/css/banner-styles.css?v=S1zqJCYt" />
<link rel="stylesheet" type="text/css" href="//web-static.archive.org/_static/css/iconochive.css?v=qtvMKcIJ" />
<!-- End Wayback Rewrite JS Include -->

Wayback proxy refuses to work on IE 4.0 on Windows 95

I'm really not sure if it's just me doing something wrong, but everytime I attempt to access a webpage using the WaybackProxy on Win 95, it simply throws an error like "The connection with webpage could not initiated".

It works flawlessly on Windows 98, but refuses to work on Windows 95.

Random URL "leakage" from newer dates outside date tolerance

I am using waybackproxy to crawl pages saved on the wayback machine (because it was the easiest and fastest thing to set up)
However, I've noticed some random "leakage" from newer dates.
The proxy is set to January 1st, 2003. However some pages from years after are randomly appearing.
For example:
As we all know, Youtube released in 2005, and was bought by Google in 2006, but here it is in my data, showing up on a google support page (Which I doubt even existed in 2003!)

...
{
    "url": "https://support.google.com/",
    "title": "Google Help",
    "tags": [
        "center",
        "search",
        "youtube"
    ]
},
...

(The tags are based on the most common words on a page)
This isn't just a one off thing, as a bit further down...

...
{
    "url": "https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://support.google.com/&ec=GAZAdQ",
    "title": "Sign in - Google Accounts",
    "tags": [
        "use",
        "account",
        "email"
    ]
},
...

... we see a Google accounts page, which definetly was NOT a thing in 2003.
There are 41 occurences of this after running the proxy & crawler for just two-ish minutes.

I don't see any other pages experience this "leakage", only Google pages. Is there any way to fix this?

Whitelist wildcard

Today whitelists only allow specific domains, but not all the subdirectories and subdomains are included, so it would be very nice if one could include a domain wildcard like:

anything.apple.com or apple.com/anything
and not just apple.com

So it would bypass everything for that domain.

Wayback Proxy not connecting to websites, "request_url" error and WinError 10061

So, as of today and possibly tonight, anytime I try to connect to a website it will immediately throw up a bunch of errors with the error "urllib.error.URLError: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it", loading like 2 or 3 images before everything errors out. It's apparently also having errors with request_url.

Update, heres the whole error.

[>] http://twitter.com/
[!] Failed to fetch Wayback availability data
[!] Fetch exception:
Traceback (most recent call last):
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 1344, in do_open
h.request(req.get_method(), req.selector, req.data, headers,
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1319, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1365, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1314, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1074, in _send_output
self.send(msg)
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 1018, in send
self.connect()
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\http\client.py", line 984, in connect
self.sock = self._create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\socket.py", line 852, in create_connection
raise exceptions[0]
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\socket.py", line 837, in create_connection
sock.connect(sa)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\WaybackProxy-master\waybackproxy.py", line 197, in handle
conn = urllib.request.urlopen(request_url)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 215, in urlopen
return opener.open(url, data, timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 515, in open
response = self._open(req, data)
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 532, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 492, in _call_chain
result = func(*args)
^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 1373, in http_open
return self.do_open(http.client.HTTPConnection, req)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\theci\AppData\Local\Programs\Python\Python312\Lib\urllib\request.py", line 1347, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>

It is missing one space before the "200 OK" and that breaks the HTTP response.

I was trying to use WaybackProxy with IE4, but got an error about unknown http response. So, I tried to use curl and got this:

* Unsupported response code in HTTP response
* Closing connection 0
curl: (1) Unsupported response code in HTTP response

Using netcat, I found that it was answering HTTP/1.1200 OK, instead of HTTP/1.1 200 OK (that is, it was missing an space before the 200).
After adding the space, in waybackproxy.py#L415, it works fine.

I created pull request #5

Some JavaScript scripts only work on newer browers.

Some JavaScript scripts only work on browsers about as new as Internet Explorer 11 on Windows 8.1. They don't work at all in any browser older than IE 11. Is there a fix for this or will this be fixed in a future update?

For example, go to WWE.com in the year 2002 on IE 6 on Windows XP. When you hover over a menu, no options appear. Go to the same website on Pale Moon on Windows 11. The menus bring out options as you hover over them.

Bypass POST requests

I know that for some reason the WaybackProxy deals only with GET requests and that this is not compatible with POST requests.
However, it would be interesting to support POST requests on the Bypassed URLs (I host some old sites on my local cache and in some of them I often restore search mechanisms for example, and they do not work with waybackproxy bypass, so I always need to disable and then re-enable).
Does it sound doable?

Really old browsers aren't happy with the inserted Javascript

Hi, I setup Waybackproxy on a Docker container and while it's service up pages from the date specified, it seems to insert some JS at the start which is breaking things on IE 5 and Netscape 3. (The two browsers I tried on Windows 98 and Mac OS 7.5.3 respectively.)

I looked through the docs I don't immediately see a way to stop this JS insertion -- and I missing a config option?

Thanks!

Requirement Issues

I set up the proxy on windows, do py -m pip install --user -r requirements.txt

Then I launch the proxy and get:

"WaybackProxy now requires urllib3 to be installed. Follow setup step 3 on the readme to fix this."

urllib3 IS installed, the proxy just cant use it.

Images not being loaded

Hi,
First, thanks for the effort, did not know somebody went with making this awesome thing :D
Found it on that yt video, pretty cool :D

Anyway, I tried running it on my Pi Zero and now running it in a debian VM, but I still have the issue of images not being loaded.
Seems like the URLs for images are not being forwarded to browser properly.

This is what my output looks like when the script is running and while loading apple.com

[>] http://statse.webtrendslive.com/S139226/button6.asp?tagver=6&si=139226&offset=-800&fw=0&js=No&
[>] http://www.apple.com/main/css/fonts.css
[>] http://statse.webtrendslive.com/S130376/button6.asp?tagver=6&si=130376&offset=-800&fw=0&js=No&
[>] http://images.apple.com/t/2002/us/en/i/1bg.gif
[f] http://www.apple.com/main/css/fonts.css
[f] http://images.apple.com/t/2002/us/en/i/1bg.gif
[f] http://statse.webtrendslive.com/S139226/button6.asp?tagver=6&amp;si=139226&amp;offset=-800&amp;fw=0&amp;js=No&
[f] http://statse.webtrendslive.com/S130376/button6.asp?tagver=6&amp;si=130376&amp;offset=-800&amp;fw=0&amp;js=No&
[>] http://images.apple.com/t/2003/us/en/i/3.gif
[>] http://images.apple.com/t/2002/us/en/i/4.gif
[f] http://images.apple.com/t/2003/us/en/i/3.gif
[f] http://images.apple.com/t/2002/us/en/i/4.gif
[>] http://images.apple.com/t/2002/us/en/i/2.gif
[>] http://images.apple.com/t/2002/us/en/i/5.gif
[>] http://images.apple.com/t/2002/us/en/i/6.gif
[f] http://images.apple.com/t/2002/us/en/i/5.gif

Seems it does not use the original Wayback Machine URL but direct to the original source...same goes for google or yahoo...could you provide a little assistance? :)

I am using Safari on my PB G4 :)

Moron, Cant Figure it Out

I don't understand how to get this running on Windows. Whenever I open the proxy, it just closes instantly. What moronic thing am I doing wrong?

raspberry pi install

hello,
i would like to know how do you install this proxy on a raspberry pi 4
Thanks !

YouTube does not like the proxy in IE11 + Archive Finding The Correct Dates

I have the date set to October 31 2005.
I type in http://www.youtube.com and it immediately doesn't load, citing "HTTP 501 Not Implemented/HTTP 505 Version Not Supported".

When trying to load a page like RuneScape, it will load a page from years earlier.

When loading Google.com it works just fine, but when I type YouTube into the search box and press Search, it says there are no snapshots within the date range in my config. I had it set to 365, and tried updating it to 500 but that didn't fix the problem.

Binary files don't load.

I'm using the proxy on Internet Explorer 6 running under Windows XP Service Pack 2. The proxy is running on Windows 10 21H2 and Python 3.6.8. My config is here: https://gist.github.com/wertercatt/c575c8cbf389eb7a6f8baac859c91457

When attempting to download a binary file through the proxy, such as http://download.microsoft.com:80/download/9/f/f/9ffc346d-55e9-420a-89fd-22d10d8f803f/ZooCardFlip.msi for example, the proxy kills the connection early and prevents the browser from actually completing the download.

I'm willing to provide any more details you need.

Support for WWW.

I dont know if its only me but most links in web.archive.org sites are in www, example: youtube in 2005-2012, but when i open a website with HTTPS or WWW, it doesnt work, could you please fix this?

K thats all

UPDATE: Nevermind, it was a issue with chrome

Unsupported HTTP version

Unable to determine where this is originating, but I'm not able to use the current build due to it.

colin@Colins-MacBook-Air~> curl -v http://apple.com -x http://192.168.1.5:8888
*   Trying 192.168.1.5:8888...
* Connected to 192.168.1.5 (192.168.1.5) port 8888 (#0)
> GET http://apple.com/ HTTP/1.1
> Host: apple.com
> User-Agent: curl/7.79.1
> Accept: */*
> Proxy-Connection: Keep-Alive
>
* Unsupported HTTP version in response
* Closing connection 0
curl: (1) Unsupported HTTP version in response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.