hartator / wayback-machine-downloader Goto Github PK
View Code? Open in Web Editor NEWDownload an entire website from the Wayback Machine.
License: Other
Download an entire website from the Wayback Machine.
License: Other
Hello.
Use wayback_machine_downloader http://vodka-veda.ru/ --timestamp 20140203115416 >3.log for download
error
http://vodka-veda.ru/javascripts/cufon-yui.js?1293704508 # Invalid argument @ rb_sysopen - websites/vodka-veda.ru/javascripts/cufon-yui.js?1293704508
...
Line 174: http://vodka-veda.ru/images/3d-format.png?1293704508 # Invalid argument @ rb_sysopen - websites/vodka-veda.ru/images/3d-format.png?1293704508
please help me :(
I have noticed that archive.org likes to send you to some custom page to say the server that file is on is down, and this program faithfully downloads it and treats it like an image file. So the result is it looks like a successful download when in fact it is not.
Interestingly enough I am able to bring the site up on archive.org and those pictures are displayed just fine. I was wondering if iterating backwards until success might work. I guess the program would need to check each downloaded file for its magic bytes to make sure it is jpg or gif and if not treat as 404 or failure and check previous snapshots for that file until success.
You just saved my ass! Great script. Thanks!
I'm trying to download a Gothic 3 walkthru from wayback and with an address of
http://www.worldofgothic.com/gothic3/?go=g3walkthrough
and I get an Invalid argument when using the Start command prompt with ruby. If I use the interactive ruby it tells me I have a syntax error unexpected '?'
Is there anyway to get this to work using this program? Thanks for any help.
P.S. First time using this so maybe I'm doing something wrong but I don't think I am. I just type
wayback_machine_download http://www.worldofgothic.com/gothic3/?go=g3walkthrough at the command prompt. I was also trying to use --timestamp 20121202124458 but either way I think the ? mark causes the problem
May I know the rule for malformed file url?
I found the following urls are markded as malformed. Is it because of the "?" or the "%"?
"
Malformed file url, ignoring: http://www.example.com:80/search.htm?search_author=%E5%D0%D2%A3%C1%E9%B6%F9
Malformed file url, ignoring: http://example.com:80/search.htm?search_author=%E5%FB
Malformed file url, ignoring: http://www.example.com:80/search.htm?search_author=%E5%FB
"
I apologize if its a newbie error but trying to download this cached website and I can't, any help please?
Thanks a lot to the devs, really nice tool!
[~]: wayback_machine_downloader http://web.archive.org/web/20130509062359/http://afrocubanlatinjazz4.blogspot.com.es/
Downloading http://web.archive.org/web/20130509062359/http://afrocubanlatinjazz4.blogspot.com.es/ to websites/web.archive.org/ from Wayback Machine...
No files to download. Possible reasons:
* Accept regex didn't let any files through (Accept Regex: "")
* Site is not in wayback machine.
I tried downloading myDomain.com which used to be on WP. Not a major issue, got almost everything.
Looks great but - the blog has the main blog page but none of the blog posts were downloaded. I cross checked with Archive to ensure that those posts were within the system and they were.
I have /blog which contains index.html (the main blog page which showcases 4 blog posts) as well as 3 subdirectories /blog/page-2, /page-3, /page-4. All of these directories are empty.
I have not been able to do further testing yet to see how often this happens across multiple sites.
Some index.html files are being created empty - with an error reported during the Download as incorrect header check
.
ruby 2.2.0 p0
Is it possible for the script to iterate backwards through snapshots if a 404 is returned. A site I am recovering has quite a few blank pages in, but a non-404 snapshot of the pages in question was available a couple of snapshots back.
Hello,
Thanks for creating this program. Is it possible to add filters like downloading only pdf files (by following html ?) files.
Or download only pdf files that are named "filename-with-numbers_name.pdf" ?
The idea is to avoid downloading files (.gif..etc) unnecessarily.
Thanks in advance ;)
Hiโ
I'm trying to download an entire site, but it appears that the script is downloading the oldest version of each file first from Wayback Machine, and doesn't rewrite over the old version. So basically, the site seems to be intact, but the old css has been pulled, parts of the pages are older versions. The site on 08-05-2015 is intact, so I basically need that date. Otherwise, the script seems to work beautifully, recreating the exact file structure from the original site (which is a mess).
Is there anything I can do on my end? Instead of using a timestamp that downloads from before a certain date, I need it to download after and overwrite existing files with the same name.
Is this possible?
Thanks, and thanks for an amazing script!
Michael
Hey there,
Nice piece of software!! :)
I found a bug though. When downloading item 4053, the file already existed as a single file, thus a folder could not be created.
Here's the error:
`http://REDACTED.com/uncategorized/reflective-thoughts-on-marriage/ -> websites/REDACTED.com/uncategorized/reflective-thoughts-on-marriage/index.html (4052/48177)
http://REDACTED.com/uncategorized/doing-the-important-stuff/ -> websites/REDACTED.com/uncategorized/doing-the-important-stuff/index.html (4053/48177)
/usr/lib/ruby/1.9.1/fileutils.rb:1515:in stat': No such file or directory - File exists - websites/REDACTED.com/www.REDACTED2.com (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:in
block in fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:1531:in fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:in
fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:508:in mv' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:116:in
rescue in structure_dir_path'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:109:in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:83:in
block in download_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in each' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in
download_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/bin/wayback_machine_downloader:27:in <top (required)>' from /usr/local/bin/wayback_machine_downloader:23:in
load'
from /usr/local/bin/wayback_machine_downloader:23:in <main>'
Here's an ls of the file
ubuntu@ip-172-30-0-198:~$ ll websites/REDACTED.com/www.REDACTED2.com -rw-rw-r-- 1 ubuntu ubuntu 27286 Sep 4 07:44 websites/REDACTED.com/www.REDACTED2.com
(Had to start new post since other is closed..)
I tried using the / to escape as suggested and using quotes but neither one worked in ruby. I got parsing errors at ruby command prompt using escapes and I get invalid argument error when I use quotes.
Strider-2015
I'm trying to download a Gothic 3 walkthru from wayback and with an address of
http://www.worldofgothic.com/gothic3/?go=g3walkthrough
and I get an Invalid argument when using the Start command prompt with ruby. If I use the interactive ruby it tells me I have a syntax error unexpected '?'
Is there anyway to get this to work using this program? Thanks for any help.
P.S. First time using this so maybe I'm doing something wrong but I don't think I am. I just type
wayback_machine_download http://www.worldofgothic.com/gothic3/?go=g3walkthrough at the command prompt. I was also trying to use --timestamp 20121202124458 but either way I think the ? mark causes the problem
@MrOutis
MrOutis commented 2 days ago
wayback_machine_downloader http://www.worldofgothic.com/gothic3/\?go\=g3walkthrough
Try this, dude. You need to escape special signs like ? or =, this way bash won't interpret them and took them as literals.
Other way to stop interpretation is by enclosing the line in single quotes:
wayback_machine_downloader 'http://www.worldofgothic.com/gothic3/?go=g3walkthrough'
@hartator
Owner
hartator commented 10 hours ago
@MrOutis is right, your shell will interpret ? as a trump character and will fail. There is no much we can change in our end.
@Strider-2015, any of @MrOutis solutions should work. Try escaping the special characters or adding quotes ('abc' or "abc") around your url.
Feel free to follow up here if it doesn't.
Hi, I get the following error while recovering a site from wayback machine:
/var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:66:in `split': invalid byte sequence in UTF-8 (ArgumentError)
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:66:in `block in download_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:62:in `each'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:62:in `download_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/bin/wayback_machine_downloader:23:in `<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in `load'
from /usr/local/bin/wayback_machine_downloader:23:in `<main>'
the code at the line where the error is raised:
file_path_elements = file_id.split('/')
possible fix:
path = file_id.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
file_path_elements = path.split('/')
Unfortunately, this error occurs 11k pages into the site and I don't know which page/file causes it. So, I don't know of an easy way to test if the fix works.
Hi, I'm using Linux (maybe that has something to do with the timestamps), and when I use this script, it doesn't download the files with the original timestamps. This can be corrected by adding "id_" at the end of a snapshot's timestamp. For example:
http://web.archive.org/web/19970227062641/http://www5.yahoo.com/ (original)
http://web.archive.org/web/19970227062641id_/http://www5.yahoo.com/ (with id_)
I want the timestamps preserved for historical purposes--so that I capture the original date as well.
I've been trying this a few times and some sites download fine, while others download but is missing the design completely. Content is there but design is off completely. I'm assuming it's just not downloading CSS files.
Am I missing something?
Doesn't correctly download UTF8 encoded files.
Hey,
Where does the downloaded websites saves? I searched in C:/Windows/System32 but not found. Could someone sort this out?
Getting this error when files have a particularly long name, such as when it has been a dynamic site and has a long query string;
/usr/local/lib/ruby/2.2.0/open-uri.rb:36:in `initialize': File name too long @ rb_sysopen
The filename and rest of the error;
/img.aspx?q=L3MkWGAkZmD3BGx3AQV5AwZ0AGD3BGR2BPHlAzpyZ3RkZQRyZwMyWGAkqJqaLlHlAGAhWGV1ZaZyZwHlp256pz50rKVhpTW6WGV1ZaZyZwMhWGAkZPHlAzZyZ3RjWGV2MJLyZ3RjWGV2MvHmpFHlAzIaWGAkZwNkAGN4ZGNjAmV2AGpyZwMwrFHmpGRkZGt5BFHlAaEaWGAkZQysZ2A1WGV2rPHmpFHlAzMapPHmpGNyZwMjqFHmpGD5ZwNmWGV2oabyZ3RjWGV2MaNyZ3R1AGLyZwMbozpyZ3RkWGV2qTLyZ3R2WGV2pUNyZ3SVEvHlAzAbWGAkZPHlAaSyWGAkozLgpJI2pF0lZwtjAGZ2Zmx0AwH1ZQx0-1 (Errno::ENAMETOOLONG)
from /usr/local/lib/ruby/2.2.0/open-uri.rb:36:in open' from /usr/local/lib/ruby/2.2.0/open-uri.rb:36:in
open'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:83:in block in download_files' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in
each'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in download_files' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/bin/wayback_machine_downloader:23:in
<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in
Just noticed this for a site I recently downloaded where there are random numbers and symbols after the file extension, ex. logo.jpg is downloaded as logo.jpg%3f1303921736
I dont think its completely random numbers and symbols, they follow a certain patter such as always starting with %3f
It also seems like that the new file name with the symbols is hard coded into the html, so an image, for example, uses src="logo.jpg%3f1303921736" meaning just changing the name of the image file itself won't fix the issue.
So far I've noticed this with css and image files.
Hello,
I'm using your tools for downloading a old website of mine.
I'm using without any option in order to download the last version of each file.
But this behaviour is still true even if the last version is a 404 error (captured by Wayback website).
Is it possible to download the last version of each file with skipping any 404 (or other error code) in order to obtain the last viable version of the page ?
I've got error trying to run
/usr/bin/wayback_machine_downloader:23:in `load': /usr/lib64/ruby/gems/1.8/gems/wayback_machine_downloader-0.1.6/bin/wayback_machine_downloader:22: syntax error, unexpected ':', expecting kEND (SyntaxError)
...achineDownloader.new base_url: base_url, timestamp: options[...
^
/usr/lib64/ruby/gems/1.8/gems/wayback_machine_downloader-0.1.6/bin/wayback_machine_downloader:22: syntax error, unexpected ':', expecting '='
...base_url: base_url, timestamp: options[:timestamp]
^
from /usr/bin/wayback_machine_downloader:23
I want to avoid downloading the pages below 23KB since they are 404 pages.
Also I want to limit the download to
http://www.userring.com/cricket-or-football/question-3695225
(instead of)
http://www.userring.com/cricket-or-football/question-3695225/3699293/3463987
So do I have to use:-
wayback_machine_downloader http://www.userring.com --exclude "/.+/question-[0-9]/.+"' or other?
Hey,
It would be super nice to have multiple downloads at the same time .. currently it takes foreeeeever to download a whole site copy because if one document is unresponsive, everything else suffers for it.
Thanks!
/Siggy
This is a very useful piece of software! Please add a license so that people will be legally able to add it to any Linux distribution. ๐
On readme.md it is said that this downloader "will download the last version of every file present on Wayback Machine". But for the following url it is not downloading the latest (nor the oldest):
$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l
[
{"file_url":"http://tohokinemakan.jp:80/index.html","timestamp":20110918020620,"file_id":"index.html"},
]
By adding --from
option it is possible to find later version of the page:
$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l -f 2012
[
{"file_url":"http://tohokinemakan.jp:80/index.html","timestamp":20120116225740,"file_id":"index.html"},
]
However it is not able to find the latest (20130211093834):
$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l -f 2013
[
]
I'm using version 0.5.4 on Ubuntu 16.04
So this downloads everything ever uploaded to wayback machine? for the specific link? and combines them? or am I miss understanding. Does it also follow the other links connected on those pages? i.e. it takes you to a facebook page or something or something on the website.
--Btw you can close this right after a reply
Hi
I have windows 7 with Ruby installed (2.3.1)
I want try to save entire website #http://forum.thebadasschoobs.org/ of timestamp 20160304221428.
So I try:
wayback_machine_downloader http://forum.thebadasschoobs.org/ -t 20160304221428
and I get error.
Same error I have using:
wayback_machine_downloader http://forum.thebadasschoobs.org/ -t 20160304221428 -a
I think this pastebin will be helpful !!!
Where I wrong?
Maybe I can save all page also without extension php.
As stated in the title, I cannot download webarchive sites with a minus ("-") in the URL. I came across that problem when a friend of mine searched for an old site he liked and found it in the archives. (The site in question would be thieves-guild.net, which, apparently, went down someday 2014ish)
With different versions of Ruby (2.2, 2.1, 2.0) on machines running Win 7, we got the following Invalid Argument Exception:
C:/Ruby21/lib/ruby/2.1.0/open-uri.rb:36:in 'initialize': Invalid argument @ rb_sysopen - websites/www.thieves-guild.net/index.php?pid=242&_ga=1.180943079.1977932896.1437794729 (Errno::EINVAL)
from C:/Ruby21/liv/ruby/2.1.0/open-uri.rb:36:in 'open'
from C:/Ruby21/liv/ruby/2.1.0/open-uri.rb:36:in 'open'
from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:83:in 'block in download_files'
from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in 'each'
from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in 'download_files'
from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:23:in '<top (required)>'
from C:/Ruby21/bin/wayback_machine_downloader:23:in 'load'
from C:/Ruby21/bin/wayback_machine_downloader:23:in '<main>'
Since I tried some other sites with a "-" in their URL and got similar Exceptions, I guess that this is the problem.
Greetings!
--only '/\.(html|htm)$/i'
only is not working. I am getting error.
'htm)$' is not recognized as an internal or external command, operable program or batch file.
Also is there any way to exclude files. For example excluding css and js files.
when I open Ruby and type "gem install wayback_machine downloader" into the prompt and hit enter, nothing happens. Can you please help? Would it have anything to do with where I extracted the wayback downloader file (desktop)?
Hi all,
Firstly thank you to the original coders of this. I've been looking for such a tool for ages!
I'm having an issue though (latest ubuntu x64, latest Ruby)
When trying to download a desired site I get this error:
jay@jay-VirtualBox:~/wayback-machine-downloader$ sudo wayback_machine_downloader http://www.sitename.co.uk
Downloading http://www.sitename.co.uk to websites/www.sitename.co.uk/ from Wayback Machine...
/usr/lib/ruby/1.9.1/net/http.rb:763:in initialize': getaddrinfo: Name or service not known (SocketError) from /usr/lib/ruby/1.9.1/net/http.rb:763:in
open'
from /usr/lib/ruby/1.9.1/net/http.rb:763:in block in connect' from /usr/lib/ruby/1.9.1/timeout.rb:55:in
timeout'
from /usr/lib/ruby/1.9.1/timeout.rb:100:in timeout' from /usr/lib/ruby/1.9.1/net/http.rb:763:in
connect'
from /usr/lib/ruby/1.9.1/net/http.rb:756:in do_start' from /usr/lib/ruby/1.9.1/net/http.rb:745:in
start'
from /usr/lib/ruby/1.9.1/open-uri.rb:306:in open_http' from /usr/lib/ruby/1.9.1/open-uri.rb:775:in
buffer_open'
from /usr/lib/ruby/1.9.1/open-uri.rb:203:in block in open_loop' from /usr/lib/ruby/1.9.1/open-uri.rb:201:in
catch'
from /usr/lib/ruby/1.9.1/open-uri.rb:201:in open_loop' from /usr/lib/ruby/1.9.1/open-uri.rb:146:in
open_uri'
from /usr/lib/ruby/1.9.1/open-uri.rb:677:in open' from /usr/lib/ruby/1.9.1/open-uri.rb:33:in
open'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:42:in get_file_list_curated' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:72:in
get_file_list_by_timestamp'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:83:in download_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/bin/wayback_machine_downloader:32:in
<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in
I'm new to github so it could very well be user error but I'm pretty sure I did it correctly.. although i did get errors during the install:
jay@jay-VirtualBox:~/wayback-machine-downloader$ sudo gem install wayback_machine_downloader
WARNING: Error fetching data: SocketError: getaddrinfo: Name or service not known (http://rubygems.org/latest_specs.4.8.gz)
Fetching: wayback_machine_downloader-0.2.0.gem (100%)
Successfully installed wayback_machine_downloader-0.2.0
1 gem installed
Installing ri documentation for wayback_machine_downloader-0.2.0...
Installing RDoc documentation for wayback_machine_downloader-0.2.0...
Any help or advice greatly appreciated.
The script works great, thank you for sharing your work! Since Windows can't have the '?' character in folder or file names, I am unable to save any archived pages with query strings (example.com/?this=that).
To work around this, I'm using Linux to run wayback_machine_downloader, then using 'rename' within Linux to change all occurrences in folder and file names of '?' to '@'. (This is how wget works for me on Windows) Then... copying the now compatible files over to windows to work with them and prep for upload to my webserver. (I use some php to check for a query string in the request, then check for files that start with '@' for display.)
Can you help me figure how to make the script convert '?' to '@' upon saving the files / directories so I can run this directly from Windows? Thanks!
If a bug appears at download number 40.000 and I have 45.000 files, I have to start from scratch ..
It would be awesome if I could continue at file X (also would make it possible for parallell downloads if I can set max of 5000 or something per execution)
Many using the Wayback Machine Downloader to download their website. So, everyone have tried a method to move the data into the new server. If you know how or successfully imported the data then please tell others about how you did it.
I can get the description inside the command prompt.
What I did?
Created a Example.rb file on my PC. Entered the below code inside the file and saved it.
`require 'nokogiri'
require 'open-uri'
url = "C:/Users/Manikandan/websites/index.html"
data = Nokogiri::HTML(open(url))
puts data.at_css("#summaryDescription").text.strip
puts data.css(".postContent").text.strip
data.at_css('h1.title').text.strip
`
When running the file on Command prompt, I can get the details
But I get all the answers as a single paragraph. I'm not sure what I'm doing, please help me.
It seems that specifying a timestamp causes this script to download the nearest timestamp rather than the latest one before the timestamp (which is how it is described in the documentation)
Note that this mirrors the behavior of the Wayback Machine itself.
Now, I have literally never done anything with Ruby before, but I think I have some idea what it going on after looking at the code.
There is a block of code at the beginning of wayback_machine_downloader.rb looking at and sorting timestamps. I don't 100% get what's going on there (like I said, never used Ruby before), but I think I get the idea after playing with the code for a bit.
But then in the end, there is the line:
open("http://web.archive.org/web/#{timestamp}id_/#{file_url}") do |uri|
Down here, "timestamp" is referring to the original timestamp the user specified, not anything from the above work. So it's just going to mimic the Wayback Machine's normal rounding behavior and bypass the above code.
I was able to hackishly make it work on my end by pointing it to part of "file_list_curated", but I really don't trust posting my solution here since I'm a total Ruby novice :)
I can't seem to get --timestamp to work. I'm trying to download a copy of a site which was lost as the domain name expired and the latest version has only the hosting site parking page shown in the archive. I can go to the archive to get an older version of the site manually, but the downloader keeps grabbing the latest version.
h33sport.com is the site, and I'm trying to grab 20141219170711.
My gem is version 0.2.4.
Fairly new to Ruby so excuse me if I'm being an idiot, I keep getting this error;
/usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/wayback_machine_downloader.rb:3:in require_relative': cannot load such file -- /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/tidy_bytes (LoadError) from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/wayback_machine_downloader.rb:3:in
<top (required)>'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/bin/wayback_machine_downloader:3:in require_relative' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/bin/wayback_machine_downloader:3:in
<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in
Am I missing something - any help appreciated, great project btw.
It would be helpful to be able to provide --after and --before flags so that only specific timeframes are downloaded.
An example use-case is archiving the state of a site across years, or being able to snapshot the evolution of a site.
Sure, we can set --timestamp to 1-year intervals, but downloading a 2001 timestamp and then downloading a 2002 timestamp results in a LOT of duplicate files.
Hi
For totally personal nostaligabrowse reasons I run this to grab a 97/2000 site (with an appropriate timestamp) that had lots and lots of files and many directories (and not dynamic content like a forum). The site from later had dynamic content excessively captured (as well as the squatter), so the directories seem to stop after a certain letter, even though I know there's directories after that letter.
Is there a workaround for this?
Since its taking so much time (in my case it's taking weeks!)
My website has around 1 lakh pages in it and each page has a cache versions like 234539. So it starts like (1/234539) for each url.
I just can't reduce the time frame, this will eliminate many urls.
Is it possible to copy the whole archive including all timestamps available?
Thanks in advance!
Several users encounter issues when their websites used to contain absolute urls (ie. http://example.com/style.css) making images, styles and even page links appear broken when trying to open the website from the downloaded copy. Ie: #6
Unfortunately, I don't have time to work on this, so it's up for anyone to work on it!
The goal is to write a script that will rewrite every absolute urls in every downloaded files to point to the local copies. For example, http://example.com/static/style.css should be './static/style.css' assuming a webpage at the root.
The challenges are urls can be more complex than that and the script should be aware where the local copies are relatively to current file location. It should also be a CLI option as not everyone wants urls to be rewritten.
Ask me any question!
This is the error I get:
C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:25:in `+': no implicit conversion of nil into String (TypeError)
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:25:in `backup_path'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:81:in `download_files'
from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/bin/wayback_machine_downloader:32:in `<top (required)>'
from C:/Ruby22/bin/wayback_machine_downloader:23:in `load'
from C:/Ruby22/bin/wayback_machine_downloader:23:in `<main>'
Windows 10. Ruby 2.2. Tried with Ruby 1.9.3. Didn't work.
Fix it, please.
Hello,
Thanks a lot for this GREAT tool, it saved my old website that I wanted to restore.
I just have one question, not really related, but may be you can help.
I would like to import back the dowloaded archive to Wordpress. Do you have any idea how I can do this ?
thanks again !
The simple solution would be while going through the file names, to rename every '?' to %3F (URL encoding for '?')
I would also recommend changing the '=' to %3D as that may become a issue as well.
(This is on windows, not Linux)
Be advised: this comic contains voilence.
Windows 10, Ruby 2.2
So i tried:
C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/home/
Downloading http://starkreality.smackjeeves.com/home/ to websites/starkreality.smackjeeves.com/ from Wayback Machine
...
http://starkreality.smackjeeves.com:80/home/ -> websites/starkreality.smackjeeves.com/home/index.html (1/1)
Download complete, saved in websites/starkreality.smackjeeves.com/ (1 files)
C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/Archivepage/
Downloading http://starkreality.smackjeeves.com/Archivepage/ to websites/starkreality.smackjeeves.com/ from Wayback
Machine...
http://starkreality.smackjeeves.com:80/archivepage/ -> websites/starkreality.smackjeeves.com/archivepage/index.html
(1/2)
http://starkreality.smackjeeves.com:80/Archivepage/ # websites/starkreality.smackjeeves.com/Archivepage/index.html a
lready exists. (2/2)
Download complete, saved in websites/starkreality.smackjeeves.com/ (2 files)
I have those 2 pages but that's it.
I tried to manually download an archived comic but that gives me a sole 0kb index.html
C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/Sale
Downloading http://starkreality.smackjeeves.com/Sale to websites/starkreality.smackjeeves.com/ from Wayback Machine.
..
http://starkreality.smackjeeves.com/Sale # incorrect header check
http://starkreality.smackjeeves.com/Sale -> websites/starkreality.smackjeeves.com/Sale/index.html (1/2)
http://starkreality.smackjeeves.com:80/sale/ # websites/starkreality.smackjeeves.com/sale/index.html already exists.
(2/2)
Download complete, saved in websites/starkreality.smackjeeves.com/ (2 files)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.