Comments (7)
As far as I can tell archive.org is limiting the number of connections you can make in a short period of time.
As mentioned in #264, browsers and wget (which uses persistent connection) is not affected by this issue.
It should be fixed by using a single persistent connection for all downloads instead of creating a new connection for each download.
diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb
index 730714a..199b9dd 100644
--- a/lib/wayback_machine_downloader.rb
+++ b/lib/wayback_machine_downloader.rb
@@ -206,11 +206,15 @@ class WaybackMachineDownloader
@processed_file_count = 0
@threads_count = 1 unless @threads_count != 0
@threads_count.times do
+ http = Net::HTTP.new("web.archive.org", 443)
+ http.use_ssl = true
+ http.start()
threads << Thread.new do
until file_queue.empty?
file_remote_info = file_queue.pop(true) rescue nil
- download_file(file_remote_info) if file_remote_info
+ download_file(file_remote_info, http) if file_remote_info
end
+ http.finish()
end
end
@@ -243,7 +247,7 @@ class WaybackMachineDownloader
end
end
- def download_file file_remote_info
+ def download_file (file_remote_info, http)
current_encoding = "".encoding
file_url = file_remote_info[:file_url].encode(current_encoding)
file_id = file_remote_info[:file_id]
@@ -268,8 +272,8 @@ class WaybackMachineDownloader
structure_dir_path dir_path
open(file_path, "wb") do |file|
begin
- URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri|
- file.write(uri.read)
+ http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body|
+ file.write(body)
end
rescue OpenURI::HTTPError => e
puts "#{file_url} # #{e}"
from wayback-machine-downloader.
same here - guessing that wayback is breaking the connection after a small handful of requests...mine worked for the first 19 pages, then it began to fail...
from wayback-machine-downloader.
This fix does work. It's a bit slow now of course, but the files get downloaded.
from wayback-machine-downloader.
can we get this fix approved and a new release created?
from wayback-machine-downloader.
archive.org has implemented rate limiting, which is why the delay fixes things. It is unfortunate, and probably breaks multithreaded downloading as well, but it is a free resource after all. https://archive.org/details/toomanyrequests_20191110
from wayback-machine-downloader.
diff --git a/lib/wayback_machine_downloader.rb b/lib/wayback_machine_downloader.rb index 730714a..199b9dd 100644 --- a/lib/wayback_machine_downloader.rb +++ b/lib/wayback_machine_downloader.rb @@ -206,11 +206,15 @@ class WaybackMachineDownloader @processed_file_count = 0 @threads_count = 1 unless @threads_count != 0 @threads_count.times do + http = Net::HTTP.new("web.archive.org", 443) + http.use_ssl = true + http.start() threads << Thread.new do until file_queue.empty? file_remote_info = file_queue.pop(true) rescue nil - download_file(file_remote_info) if file_remote_info + download_file(file_remote_info, http) if file_remote_info end + http.finish() end end @@ -243,7 +247,7 @@ class WaybackMachineDownloader end end - def download_file file_remote_info + def download_file (file_remote_info, http) current_encoding = "".encoding file_url = file_remote_info[:file_url].encode(current_encoding) file_id = file_remote_info[:file_id] @@ -268,8 +272,8 @@ class WaybackMachineDownloader structure_dir_path dir_path open(file_path, "wb") do |file| begin - URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}").open("Accept-Encoding" => "plain") do |uri| - file.write(uri.read) + http.get(URI("https://web.archive.org/web/#{file_timestamp}id_/#{file_url}")) do |body| + file.write(body) end rescue OpenURI::HTTPError => e puts "#{file_url} # #{e}"
This is an elegant (and working) solution. Nice one!
from wayback-machine-downloader.
Related Issues (20)
- For Offline Browsing???? HOT 1
- 'wayback_machine_downloader' is not recognized as an internal or external command, operable program or batch file. HOT 2
- -all command not working
- Cannot run wayback_machine_downloader on windows - 'wayback_machine_downloader' is not recognized as an internal or external command, HOT 2
- Only downloading index.html
- I want to download normally
- Some downloaded files are gzip stream HOT 3
- Unable to create docker image HOT 2
- Network Mode Host
- wayback_machine_downloader get lots Connection refused HOT 11
- I found garbled code files after downloading the entire website! HOT 2
- Error 503 HOT 10
- Feature request: Download earliest version
- Can't find "websites" folder inside my users-folder HOT 1
- Permission denied - connect(2) for "web.archive.org" port 443 HOT 15
- how to download all urls inside a text file HOT 3
- Doesn't properly work anymore HOT 3
- Error 400 HOT 1
- Error while "Getting snapshot pages..." HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wayback-machine-downloader.