Comments (5)
@ckought I do not see this, so I wonder what version of wayback_machine_downloader
are you using and has it been modified in any way?
Use the one in #280 or see this comment #265 (comment)
from wayback-machine-downloader.
@gingerbeardman That fixed the issue. I've had it running for about 48 hours, with 25K downloads with no corrupted downloads.
The version of wayback_machine_downloader.rb I had been using was the version from the automatic install using "gem install wayback_machine_downloader".
from wayback-machine-downloader.
My index files are downloaded and look as I expect them to.
Feel free to post a sample URL where you are seeing this and I'll check.
from wayback-machine-downloader.
I'm getting this for about a third of the pages that I'm grabbing. No rhyme or reason for what ones are corrupted. It's happening to txt and html files, and probably other file types too (I know some jpg files are getting corrupted, but I'm not sure on css or js or others).
It looks to me like whatever code is being used to strip off the archive.org code at the top is causing it. It's like the code is downloading the page, stripping off the archive.org code, something goes wrong, and then it writes the garbage file to disk thinking the file is just fine.
from wayback-machine-downloader.
Update:
Been doing some digging, and it looks like every corrupted file, no matter if it's html, txt, jpeg, css, all start with one of these three sets of HEX characters:
1f 8b 08 00 00 00 00 00 00 03
1f 8b 08 00 3f 3f 3f 00
1f 8b 08 00 4f 3f 3f 00
Still not sure what's actually causing the files to corrupt though. It does seem to get worse the longer larger the website is and the therefore the longer the script is running, so that may have something to do with it.
from wayback-machine-downloader.
Related Issues (20)
- Download fails HOT 7
- Error 503 HOT 10
- Feature request: Download earliest version
- Can't find "websites" folder inside my users-folder HOT 1
- Permission denied - connect(2) for "web.archive.org" port 443 HOT 15
- Doesn't properly work anymore HOT 3
- Error 400 HOT 1
- Error while "Getting snapshot pages..." HOT 3
- Any way to download PHP forums?
- DO NOT USE unless you have a means of rate limiting yourself HOT 9
- index.html ignored
- I downloaded but there isn't an execution file... I'm new at this... HOT 1
- Command outputs garbled mess as html HOT 1
- # Failed to open TCP connection to web.archive.org:443 (Connection refused - connect(2) for "web.archive.org" port 443 HOT 1
- error 503 stuff
- Am not a programmer but only a website designer for wordpress HOT 1
- 503 Service Temporarily Unavailable (OpenURI::HTTPError) HOT 3
- My Account Was Hacked Can Someone Help Me?
- Help please! port 443 HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wayback-machine-downloader.