hartator / wayback-machine-downloader Goto Github PK

View Code? Open in Web Editor NEW

5.1K 5.1K 655.0 164 KB

Download an entire website from the Wayback Machine.

License: Other

Ruby 99.50% Dockerfile 0.50%

wayback-machine-downloader's People

Stargazers

Watchers

Forkers

infin8 raven-fly juankennaugh im0rtel hayat007 mcomputers sgudbrandsson infinitedevelopment stuartm747 danest muffinista botaowei leobarcellos wuxmedia rafasashi nickkarras jbeigh unilogue bohwaz phpclub madwort rl421403 witchfindertr kittnz cloudxtreme c0destra mrapurva26 hasokeric pgorsira kzm4269 lfzawacki rashidhassan tamersalama roisevege sleepindevil sergeyganago betoalanis siparker rustam musamamasood carlojpf tedder a1leads d3signer stevenx themonkeyz varbonida okihika instocky majkelos01 bhyvex imberzki amnek0 ksarunas wangqr hulet mornecl wanderz nikvasi gbtami tpetmanson billyprice1old antoneliasson nodekits lstmmnt elimcjah backupmanager ajmartel joaquindev fl68dds setroot dpgribkov howkj1 tirelesswake insaner rad73 kostrovoy kfoxton bharathtp xbsd mbrukman toxicgumbo p nemozhp naverstay eheath mdotslash drandromeda nivertech tyknot product-think2049 tholu e-orlov antonini wchief leuldereje niklasjansson masptj djhargrove boywondercreative

wayback-machine-downloader's Issues

I can not download the fully site

Hello.
Use wayback_machine_downloader http://vodka-veda.ru/ --timestamp 20140203115416 >3.log for download

error
http://vodka-veda.ru/javascripts/cufon-yui.js?1293704508 # Invalid argument @ rb_sysopen - websites/vodka-veda.ru/javascripts/cufon-yui.js?1293704508
...
Line 174: http://vodka-veda.ru/images/3d-format.png?1293704508 # Invalid argument @ rb_sysopen - websites/vodka-veda.ru/images/3d-format.png?1293704508

please help me :(

Failed image file downloads

I have noticed that archive.org likes to send you to some custom page to say the server that file is on is down, and this program faithfully downloads it and treats it like an image file. So the result is it looks like a successful download when in fact it is not.

Interestingly enough I am able to bring the site up on archive.org and those pictures are displayed just fine. I was wondering if iterating backwards until success might work. I guess the program would need to check each downloaded file for its magic bytes to make sure it is jpg or gif and if not treat as 404 or failure and check previous snapshots for that file until success.

Thank you thank you thank you!

You just saved my ass! Great script. Thanks!

Can't download website that uses ? in the URL

I'm trying to download a Gothic 3 walkthru from wayback and with an address of

http://www.worldofgothic.com/gothic3/?go=g3walkthrough

and I get an Invalid argument when using the Start command prompt with ruby. If I use the interactive ruby it tells me I have a syntax error unexpected '?'

Is there anyway to get this to work using this program? Thanks for any help.

P.S. First time using this so maybe I'm doing something wrong but I don't think I am. I just type

wayback_machine_download http://www.worldofgothic.com/gothic3/?go=g3walkthrough at the command prompt. I was also trying to use --timestamp 20121202124458 but either way I think the ? mark causes the problem

Rule for malformed file url

May I know the rule for malformed file url?
I found the following urls are markded as malformed. Is it because of the "?" or the "%"?
"
Malformed file url, ignoring: http://www.example.com:80/search.htm?search_author=%E5%D0%D2%A3%C1%E9%B6%F9
Malformed file url, ignoring: http://example.com:80/search.htm?search_author=%E5%FB
Malformed file url, ignoring: http://www.example.com:80/search.htm?search_author=%E5%FB
"

Getting error: Accept regex didn't let any files through

I apologize if its a newbie error but trying to download this cached website and I can't, any help please?
Thanks a lot to the devs, really nice tool!

[~]: wayback_machine_downloader http://web.archive.org/web/20130509062359/http://afrocubanlatinjazz4.blogspot.com.es/
Downloading http://web.archive.org/web/20130509062359/http://afrocubanlatinjazz4.blogspot.com.es/ to websites/web.archive.org/ from Wayback Machine...

No files to download. Possible reasons:
* Accept regex didn't let any files through (Accept Regex: "")
* Site is not in wayback machine.

Not able to download all URLs

I tried downloading myDomain.com which used to be on WP. Not a major issue, got almost everything.

Looks great but - the blog has the main blog page but none of the blog posts were downloaded. I cross checked with Archive to ensure that those posts were within the system and they were.

I have /blog which contains index.html (the main blog page which showcases 4 blog posts) as well as 3 subdirectories /blog/page-2, /page-3, /page-4. All of these directories are empty.

I have not been able to do further testing yet to see how often this happens across multiple sites.

Getting empty index.html files with `incorrect header check` error

Some index.html files are being created empty - with an error reported during the Download as incorrect header check.

ruby 2.2.0 p0

new state on 'nil' is invalid

While downloading a site that has 4.5K+ pages in archive org i got the line below:

new state on 'nil' is invalid

here is a screenshot

*using docker on win10

Handling 404

Is it possible for the script to iterate backwards through snapshots if a 404 is returned. A site I am recovering has quite a few blank pages in, but a non-404 snapshot of the pages in question was available a couple of snapshots back.

Request: download only ".pdf" files for instance ?

Hello,
Thanks for creating this program. Is it possible to add filters like downloading only pdf files (by following html ?) files.
Or download only pdf files that are named "filename-with-numbers_name.pdf" ?

The idea is to avoid downloading files (.gif..etc) unnecessarily.
Thanks in advance ;)

Overwrite Older Files

Hi—

I'm trying to download an entire site, but it appears that the script is downloading the oldest version of each file first from Wayback Machine, and doesn't rewrite over the old version. So basically, the site seems to be intact, but the old css has been pulled, parts of the pages are older versions. The site on 08-05-2015 is intact, so I basically need that date. Otherwise, the script seems to work beautifully, recreating the exact file structure from the original site (which is a mess).

Is there anything I can do on my end? Instead of using a timestamp that downloads from before a certain date, I need it to download after and overwrite existing files with the same name.

Is this possible?

Thanks, and thanks for an amazing script!
Michael

When file exists, the program exits

Hey there,

Nice piece of software!! :)

I found a bug though. When downloading item 4053, the file already existed as a single file, thus a folder could not be created.

Here's the error:
`http://REDACTED.com/uncategorized/reflective-thoughts-on-marriage/ -> websites/REDACTED.com/uncategorized/reflective-thoughts-on-marriage/index.html (4052/48177)
http://REDACTED.com/uncategorized/doing-the-important-stuff/ -> websites/REDACTED.com/uncategorized/doing-the-important-stuff/index.html (4053/48177)

File exists - websites/REDACTED.com/www.REDACTED2.com

/usr/lib/ruby/1.9.1/fileutils.rb:1515:in stat': No such file or directory - File exists - websites/REDACTED.com/www.REDACTED2.com (Errno::ENOENT) from /usr/lib/ruby/1.9.1/fileutils.rb:1515:inblock in fu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:1531:in fu_each_src_dest0' from /usr/lib/ruby/1.9.1/fileutils.rb:1513:infu_each_src_dest'
from /usr/lib/ruby/1.9.1/fileutils.rb:508:in mv' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:116:inrescue in structure_dir_path'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:109:in structure_dir_path' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:83:inblock in download_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:in each' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/lib/wayback_machine_downloader.rb:66:indownload_files'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.15/bin/wayback_machine_downloader:27:in <top (required)>' from /usr/local/bin/wayback_machine_downloader:23:inload'
from /usr/local/bin/wayback_machine_downloader:23:in <main>'

Here's an ls of the file
ubuntu@ip-172-30-0-198:~$ ll websites/REDACTED.com/www.REDACTED2.com -rw-rw-r-- 1 ubuntu ubuntu 27286 Sep 4 07:44 websites/REDACTED.com/www.REDACTED2.com

Can't download website that uses ? in the URL

(Had to start new post since other is closed..)

I tried using the / to escape as suggested and using quotes but neither one worked in ruby. I got parsing errors at ruby command prompt using escapes and I get invalid argument error when I use quotes.

Strider-2015

I'm trying to download a Gothic 3 walkthru from wayback and with an address of

http://www.worldofgothic.com/gothic3/?go=g3walkthrough

and I get an Invalid argument when using the Start command prompt with ruby. If I use the interactive ruby it tells me I have a syntax error unexpected '?'

Is there anyway to get this to work using this program? Thanks for any help.

P.S. First time using this so maybe I'm doing something wrong but I don't think I am. I just type

wayback_machine_downloader http://www.worldofgothic.com/gothic3/\?go\=g3walkthrough

Try this, dude. You need to escape special signs like ? or =, this way bash won't interpret them and took them as literals.

Other way to stop interpretation is by enclosing the line in single quotes:
wayback_machine_downloader 'http://www.worldofgothic.com/gothic3/?go=g3walkthrough'
@hartator
Owner
hartator commented 10 hours ago

@MrOutis is right, your shell will interpret ? as a trump character and will fail. There is no much we can change in our end.

@Strider-2015, any of @MrOutis solutions should work. Try escaping the special characters or adding quotes ('abc' or "abc") around your url.

Feel free to follow up here if it doesn't.

Invalid UTF-8 in file path

Hi, I get the following error while recovering a site from wayback machine:

/var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:66:in `split': invalid byte sequence in UTF-8 (ArgumentError)
    from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:66:in `block in download_files'
    from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:62:in `each'
    from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/lib/wayback_machine_downloader.rb:62:in `download_files'
    from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.1.9/bin/wayback_machine_downloader:23:in `<top (required)>'
    from /usr/local/bin/wayback_machine_downloader:23:in `load'
    from /usr/local/bin/wayback_machine_downloader:23:in `<main>'

the code at the line where the error is raised:

file_path_elements = file_id.split('/')

possible fix:

path = file_id.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
file_path_elements = path.split('/')

reference: http://stackoverflow.com/questions/10466161/ruby-string-encode-still-gives-invalid-byte-sequence-in-utf-8/10466273#10466273

Unfortunately, this error occurs 11k pages into the site and I don't know which page/file causes it. So, I don't know of an easy way to test if the fix works.

Preserve Timestamps

Hi, I'm using Linux (maybe that has something to do with the timestamps), and when I use this script, it doesn't download the files with the original timestamps. This can be corrected by adding "id_" at the end of a snapshot's timestamp. For example:

http://web.archive.org/web/19970227062641/http://www5.yahoo.com/ (original)
http://web.archive.org/web/19970227062641id_/http://www5.yahoo.com/ (with id_)

I want the timestamps preserved for historical purposes--so that I capture the original date as well.

Doesn't Download Style.css

I've been trying this a few times and some sites download fine, while others download but is missing the design completely. Content is there but design is off completely. I'm assuming it's just not downloading CSS files.

Am I missing something?

Encoding

Doesn't correctly download UTF8 encoded files.

Where does it saves? Not found in system32

Hey,
Where does the downloaded websites saves? I searched in C:/Windows/System32 but not found. Could someone sort this out?

file name too long

Getting this error when files have a particularly long name, such as when it has been a dynamic site and has a long query string;

/usr/local/lib/ruby/2.2.0/open-uri.rb:36:in `initialize': File name too long @ rb_sysopen

The filename and rest of the error;

/img.aspx?q=L3MkWGAkZmD3BGx3AQV5AwZ0AGD3BGR2BPHlAzpyZ3RkZQRyZwMyWGAkqJqaLlHlAGAhWGV1ZaZyZwHlp256pz50rKVhpTW6WGV1ZaZyZwMhWGAkZPHlAzZyZ3RjWGV2MJLyZ3RjWGV2MvHmpFHlAzIaWGAkZwNkAGN4ZGNjAmV2AGpyZwMwrFHmpGRkZGt5BFHlAaEaWGAkZQysZ2A1WGV2rPHmpFHlAzMapPHmpGNyZwMjqFHmpGD5ZwNmWGV2oabyZ3RjWGV2MaNyZ3R1AGLyZwMbozpyZ3RkWGV2qTLyZ3R2WGV2pUNyZ3SVEvHlAzAbWGAkZPHlAaSyWGAkozLgpJI2pF0lZwtjAGZ2Zmx0AwH1ZQx0-1 (Errno::ENAMETOOLONG)
from /usr/local/lib/ruby/2.2.0/open-uri.rb:36:in open' from /usr/local/lib/ruby/2.2.0/open-uri.rb:36:inopen'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:83:in block in download_files' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:ineach'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in download_files' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.12/bin/wayback_machine_downloader:23:in<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in

Files download with random numbers and symbols after the file extension

Just noticed this for a site I recently downloaded where there are random numbers and symbols after the file extension, ex. logo.jpg is downloaded as logo.jpg%3f1303921736

I dont think its completely random numbers and symbols, they follow a certain patter such as always starting with %3f

It also seems like that the new file name with the symbols is hard coded into the html, so an image, for example, uses src="logo.jpg%3f1303921736" meaning just changing the name of the image file itself won't fix the issue.

So far I've noticed this with css and image files.

Avoid download and replace by a 404 error

Hello,

I'm using your tools for downloading a old website of mine.
I'm using without any option in order to download the last version of each file.
But this behaviour is still true even if the last version is a 404 error (captured by Wayback website).

Is it possible to download the last version of each file with skipping any 404 (or other error code) in order to obtain the last viable version of the page ?

syntax error, unexpected ':', expecting '='

I've got error trying to run
/usr/bin/wayback_machine_downloader:23:in `load': /usr/lib64/ruby/gems/1.8/gems/wayback_machine_downloader-0.1.6/bin/wayback_machine_downloader:22: syntax error, unexpected ':', expecting kEND (SyntaxError)
...achineDownloader.new base_url: base_url, timestamp: options[...
^
/usr/lib64/ruby/gems/1.8/gems/wayback_machine_downloader-0.1.6/bin/wayback_machine_downloader:22: syntax error, unexpected ':', expecting '='
...base_url: base_url, timestamp: options[:timestamp]
^
from /usr/bin/wayback_machine_downloader:23

Avoid downloading the pages below 23KB - Possible?

I want to avoid downloading the pages below 23KB since they are 404 pages.

Also I want to limit the download to

http://www.userring.com/cricket-or-football/question-3695225
(instead of)
http://www.userring.com/cricket-or-football/question-3695225/3699293/3463987

So do I have to use:-
wayback_machine_downloader http://www.userring.com --exclude "/.+/question-[0-9]/.+"' or other?

FEATURE REQUEST: Threading for faster download

Hey,

It would be super nice to have multiple downloads at the same time .. currently it takes foreeeeever to download a whole site copy because if one document is unresponsive, everything else suffers for it.

Thanks!
/Siggy

Open source license

This is a very useful piece of software! Please add a license so that people will be legally able to add it to any Linux distribution. 😄

not downloading the latest version of the website

On readme.md it is said that this downloader "will download the last version of every file present on Wayback Machine". But for the following url it is not downloading the latest (nor the oldest):

$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l
[
{"file_url":"http://tohokinemakan.jp:80/index.html","timestamp":20110918020620,"file_id":"index.html"},
]

By adding --from option it is possible to find later version of the page:

$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l -f 2012
[
{"file_url":"http://tohokinemakan.jp:80/index.html","timestamp":20120116225740,"file_id":"index.html"},
]

However it is not able to find the latest (20130211093834):

$ wayback_machine_downloader http://tohokinemakan.jp/index.html -l -f 2013
[
]

I'm using version 0.5.4 on Ubuntu 16.04

Let me get this straight

So this downloads everything ever uploaded to wayback machine? for the specific link? and combines them? or am I miss understanding. Does it also follow the other links connected on those pages? i.e. it takes you to a facebook page or something or something on the website.

--Btw you can close this right after a reply

Download entire website from one specific timestamp

Hi
I have windows 7 with Ruby installed (2.3.1)
I want try to save entire website #http://forum.thebadasschoobs.org/ of timestamp 20160304221428.
So I try:

wayback_machine_downloader http://forum.thebadasschoobs.org/ -t 20160304221428

and I get error.
Same error I have using:

wayback_machine_downloader http://forum.thebadasschoobs.org/ -t 20160304221428 -a

I think this pastebin will be helpful !!!
Where I wrong?
Maybe I can save all page also without extension php.

no implicit conversion of nil into string

where it saves??

Can't download sites with a "-" in their URL

As stated in the title, I cannot download webarchive sites with a minus ("-") in the URL. I came across that problem when a friend of mine searched for an old site he liked and found it in the archives. (The site in question would be thieves-guild.net, which, apparently, went down someday 2014ish)
With different versions of Ruby (2.2, 2.1, 2.0) on machines running Win 7, we got the following Invalid Argument Exception:

C:/Ruby21/lib/ruby/2.1.0/open-uri.rb:36:in 'initialize': Invalid argument @ rb_sysopen - websites/www.thieves-guild.net/index.php?pid=242&_ga=1.180943079.1977932896.1437794729 (Errno::EINVAL)
    from C:/Ruby21/liv/ruby/2.1.0/open-uri.rb:36:in 'open'
    from C:/Ruby21/liv/ruby/2.1.0/open-uri.rb:36:in 'open'
    from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:83:in 'block in download_files'
    from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in 'each'
    from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:66:in 'download_files'
    from C:/Ruby21/lib/ruby/gems/2.1.0/gems/wayback_machine_downloader-0.1.12/lib/wayback_machine_downloader.rb:23:in '<top (required)>'
    from C:/Ruby21/bin/wayback_machine_downloader:23:in 'load'
    from C:/Ruby21/bin/wayback_machine_downloader:23:in '<main>'

Since I tried some other sites with a "-" in their URL and got similar Exceptions, I guess that this is the problem.

Greetings!

Only allow HTML files

--only '/\.(html|htm)$/i'

only is not working. I am getting error.

'htm)$' is not recognized as an internal or external command, operable program or batch file.

Also is there any way to exclude files. For example excluding css and js files.

Running Ruby issue

when I open Ruby and type "gem install wayback_machine downloader" into the prompt and hit enter, nothing happens. Can you please help? Would it have anything to do with where I extracted the wayback downloader file (desktop)?

Errors when trying to download?

Hi all,

Firstly thank you to the original coders of this. I've been looking for such a tool for ages!

I'm having an issue though (latest ubuntu x64, latest Ruby)

When trying to download a desired site I get this error:

jay@jay-VirtualBox:~/wayback-machine-downloader$ sudo wayback_machine_downloader http://www.sitename.co.uk
Downloading http://www.sitename.co.uk to websites/www.sitename.co.uk/ from Wayback Machine...

/usr/lib/ruby/1.9.1/net/http.rb:763:in initialize': getaddrinfo: Name or service not known (SocketError) from /usr/lib/ruby/1.9.1/net/http.rb:763:inopen'
from /usr/lib/ruby/1.9.1/net/http.rb:763:in block in connect' from /usr/lib/ruby/1.9.1/timeout.rb:55:intimeout'
from /usr/lib/ruby/1.9.1/timeout.rb:100:in timeout' from /usr/lib/ruby/1.9.1/net/http.rb:763:inconnect'
from /usr/lib/ruby/1.9.1/net/http.rb:756:in do_start' from /usr/lib/ruby/1.9.1/net/http.rb:745:instart'
from /usr/lib/ruby/1.9.1/open-uri.rb:306:in open_http' from /usr/lib/ruby/1.9.1/open-uri.rb:775:inbuffer_open'
from /usr/lib/ruby/1.9.1/open-uri.rb:203:in block in open_loop' from /usr/lib/ruby/1.9.1/open-uri.rb:201:incatch'
from /usr/lib/ruby/1.9.1/open-uri.rb:201:in open_loop' from /usr/lib/ruby/1.9.1/open-uri.rb:146:inopen_uri'
from /usr/lib/ruby/1.9.1/open-uri.rb:677:in open' from /usr/lib/ruby/1.9.1/open-uri.rb:33:inopen'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:42:in get_file_list_curated' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:72:inget_file_list_by_timestamp'
from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/lib/wayback_machine_downloader.rb:83:in download_files' from /var/lib/gems/1.9.1/gems/wayback_machine_downloader-0.2.0/bin/wayback_machine_downloader:32:in<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in

I'm new to github so it could very well be user error but I'm pretty sure I did it correctly.. although i did get errors during the install:

jay@jay-VirtualBox:~/wayback-machine-downloader$ sudo gem install wayback_machine_downloader
WARNING: Error fetching data: SocketError: getaddrinfo: Name or service not known (http://rubygems.org/latest_specs.4.8.gz)
Fetching: wayback_machine_downloader-0.2.0.gem (100%)
Successfully installed wayback_machine_downloader-0.2.0
1 gem installed
Installing ri documentation for wayback_machine_downloader-0.2.0...
Installing RDoc documentation for wayback_machine_downloader-0.2.0...

Any help or advice greatly appreciated.

Writing directory and file names with '?' on Windows

The script works great, thank you for sharing your work! Since Windows can't have the '?' character in folder or file names, I am unable to save any archived pages with query strings (example.com/?this=that).

To work around this, I'm using Linux to run wayback_machine_downloader, then using 'rename' within Linux to change all occurrences in folder and file names of '?' to '@'. (This is how wget works for me on Windows) Then... copying the now compatible files over to windows to work with them and prep for upload to my webserver. (I use some php to check for a query string in the request, then check for files that start with '@' for display.)

Can you help me figure how to make the script convert '?' to '@' upon saving the files / directories so I can run this directly from Windows? Thanks!

FEATURE REQUEST: command line flag to continue at item # X

If a bug appears at download number 40.000 and I have 45.000 files, I have to start from scratch ..
It would be awesome if I could continue at file X (also would make it possible for parallell downloads if I can set max of 5000 or something per execution)

How to import the downloaded website from Wayback Machine Downloader into MySQL

Many using the Wayback Machine Downloader to download their website. So, everyone have tried a method to move the data into the new server. If you know how or successfully imported the data then please tell others about how you did it.

Here what I'm stuck at: (Importing forum to forum)

I can get the description inside the command prompt.

What I did?

Created a Example.rb file on my PC. Entered the below code inside the file and saved it.

`require 'nokogiri'
require 'open-uri'

url = "C:/Users/Manikandan/websites/index.html"

data = Nokogiri::HTML(open(url))

puts data.at_css("#summaryDescription").text.strip

puts data.css(".postContent").text.strip

data.at_css('h1.title').text.strip
`
When running the file on Command prompt, I can get the details

But I get all the answers as a single paragraph. I'm not sure what I'm doing, please help me.

Timestamp grabs nearest image, not necessarily earlier ones

It seems that specifying a timestamp causes this script to download the nearest timestamp rather than the latest one before the timestamp (which is how it is described in the documentation)

Note that this mirrors the behavior of the Wayback Machine itself.

Now, I have literally never done anything with Ruby before, but I think I have some idea what it going on after looking at the code.

There is a block of code at the beginning of wayback_machine_downloader.rb looking at and sorting timestamps. I don't 100% get what's going on there (like I said, never used Ruby before), but I think I get the idea after playing with the code for a bit.

But then in the end, there is the line:

          open("http://web.archive.org/web/#{timestamp}id_/#{file_url}") do |uri|

Down here, "timestamp" is referring to the original timestamp the user specified, not anything from the above work. So it's just going to mimic the Wayback Machine's normal rounding behavior and bypass the above code.

I was able to hackishly make it work on my end by pointing it to part of "file_list_curated", but I really don't trust posting my solution here since I'm a total Ruby novice :)

--timestamp not working?

I can't seem to get --timestamp to work. I'm trying to download a copy of a site which was lost as the domain name expired and the latest version has only the hosting site parking page shown in the archive. I can go to the archive to get an older version of the site manually, but the downloader keeps grabbing the latest version.

h33sport.com is the site, and I'm trying to grab 20141219170711.

My gem is version 0.2.4.

tidy_bytes (LoadError)

Fairly new to Ruby so excuse me if I'm being an idiot, I keep getting this error;

/usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/wayback_machine_downloader.rb:3:in require_relative': cannot load such file -- /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/tidy_bytes (LoadError) from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/lib/wayback_machine_downloader.rb:3:in<top (required)>'
from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/bin/wayback_machine_downloader:3:in require_relative' from /usr/local/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.1.10/bin/wayback_machine_downloader:3:in<top (required)>'
from /usr/local/bin/wayback_machine_downloader:23:in load' from /usr/local/bin/wayback_machine_downloader:23:in

Am I missing something - any help appreciated, great project btw.

FEATURE REQUEST: After/before flags

It would be helpful to be able to provide --after and --before flags so that only specific timeframes are downloaded.

An example use-case is archiving the state of a site across years, or being able to snapshot the evolution of a site.

Sure, we can set --timestamp to 1-year intervals, but downloading a 2001 timestamp and then downloading a 2002 timestamp results in a LOT of duplicate files.

Doesn't grab every directory

For totally personal nostaligabrowse reasons I run this to grab a 97/2000 site (with an appropriate timestamp) that had lots and lots of files and many directories (and not dynamic content like a forum). The site from later had dynamic content excessively captured (as well as the squatter), so the directories seem to stop after a certain letter, even though I know there's directories after that letter.

Is there a workaround for this?

I just want only the 1st cache of an URL

Since its taking so much time (in my case it's taking weeks!)
My website has around 1 lakh pages in it and each page has a cache versions like 234539. So it starts like (1/234539) for each url.
I just can't reduce the time frame, this will eliminate many urls.

Download all archived snapshots (not just the latest and not just a single timestamp)

Is it possible to copy the whole archive including all timestamps available?

Thanks in advance!

[Feature Request] Rewrite every urls to relative urls after download is complete

Several users encounter issues when their websites used to contain absolute urls (ie. http://example.com/style.css) making images, styles and even page links appear broken when trying to open the website from the downloaded copy. Ie: #6

Unfortunately, I don't have time to work on this, so it's up for anyone to work on it!

The goal is to write a script that will rewrite every absolute urls in every downloaded files to point to the local copies. For example, http://example.com/static/style.css should be './static/style.css' assuming a webpage at the root.

The challenges are urls can be more complex than that and the script should be aware where the local copies are relatively to current file location. It should also be a CLI option as not everyone wants urls to be rewritten.

Ask me any question!

Error: no implicit conversion of nil into String

This is the error I get:

C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:25:in `+': no implicit conversion of nil into String (TypeError)
        from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:25:in `backup_path'
        from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/lib/wayback_machine_downloader.rb:81:in `download_files'
        from C:/Ruby22/lib/ruby/gems/2.2.0/gems/wayback_machine_downloader-0.2.4/bin/wayback_machine_downloader:32:in `<top (required)>'
        from C:/Ruby22/bin/wayback_machine_downloader:23:in `load'
        from C:/Ruby22/bin/wayback_machine_downloader:23:in `<main>'

Windows 10. Ruby 2.2. Tried with Ruby 1.9.3. Didn't work.
Fix it, please.

Import archives to Wordpress

Hello,

Thanks a lot for this GREAT tool, it saved my old website that I wanted to restore.
I just have one question, not really related, but may be you can help.

I would like to import back the dowloaded archive to Wordpress. Do you have any idea how I can do this ?

thanks again !

Fails on URL's with '?' in the name

The simple solution would be while going through the file names, to rename every '?' to %3F (URL encoding for '?')

I would also recommend changing the '=' to %3D as that may become a issue as well.

(This is on windows, not Linux)

Can't download a website - doesn't crawl a website and some subpages give 0Kb Index.html only when manually entered

Be advised: this comic contains voilence.

Windows 10, Ruby 2.2

So i tried:

C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/home/
Downloading http://starkreality.smackjeeves.com/home/ to websites/starkreality.smackjeeves.com/ from Wayback Machine
...

http://starkreality.smackjeeves.com:80/home/ -> websites/starkreality.smackjeeves.com/home/index.html (1/1)

Download complete, saved in websites/starkreality.smackjeeves.com/ (1 files)

C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/Archivepage/
Downloading http://starkreality.smackjeeves.com/Archivepage/ to websites/starkreality.smackjeeves.com/ from Wayback
Machine...

http://starkreality.smackjeeves.com:80/archivepage/ -> websites/starkreality.smackjeeves.com/archivepage/index.html
(1/2)
http://starkreality.smackjeeves.com:80/Archivepage/ # websites/starkreality.smackjeeves.com/Archivepage/index.html a
lready exists. (2/2)

Download complete, saved in websites/starkreality.smackjeeves.com/ (2 files)

I have those 2 pages but that's it.

I tried to manually download an archived comic but that gives me a sole 0kb index.html

C:\wrk\stark>wayback_machine_downloader http://starkreality.smackjeeves.com/Sale
Downloading http://starkreality.smackjeeves.com/Sale to websites/starkreality.smackjeeves.com/ from Wayback Machine.
..

http://starkreality.smackjeeves.com/Sale # incorrect header check
http://starkreality.smackjeeves.com/Sale -> websites/starkreality.smackjeeves.com/Sale/index.html (1/2)
http://starkreality.smackjeeves.com:80/sale/ # websites/starkreality.smackjeeves.com/sale/index.html already exists.
 (2/2)

Download complete, saved in websites/starkreality.smackjeeves.com/ (2 files)

hartator / wayback-machine-downloader Goto Github PK

wayback-machine-downloader's People

Stargazers

Watchers

Forkers

wayback-machine-downloader's Issues

File exists - websites/REDACTED.com/www.REDACTED2.com

Here what I'm stuck at: (Importing forum to forum)

Recommend Projects

Recommend Topics

Recommend Org

Jobs