Comments (11)
Wget WARC file output is now supported in e8808b0, via the FETCH_WARC=True
flag (on by default).
I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.
from archivebox.
for the record, pywb has trouble reading wget WARC file as its output is non-standard: webrecorder/pywb#294 you might want to consider another crawler for the task or see that wget fixes their stuff first.
from archivebox.
The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/
WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.
It would be very nice if bookmark-archiver get support for WARC archives.
from archivebox.
Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.
https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive
WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC
from archivebox.
Oh I didn't know they were different, thanks @eqyiel.
from archivebox.
I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.
I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.
from archivebox.
The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.
EDIT: just noticed there's #113 for brozzler already.
from archivebox.
See: #11
from archivebox.
Just requires adding a new config FETCH_WARC
option and archive_method.fetch_warc
:
https://www.archiveteam.org/index.php/Wget_with_WARC_output
from archivebox.
WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?
from archivebox.
Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get warcprox
working with headless.
Here's the new issue to track the all-in-one WARC file #130
from archivebox.
Related Issues (20)
- Support: Multiple binaries present from `brew` and `pip` installs + difficulty running on macOS HOT 7
- Question: First time user confusion/documentation feedback HOT 3
- Question: how to remove a particular entry from archivebox scheduling list? HOT 2
- Documentation: Default value for PUBLIC_* keys incorrect HOT 1
- New Extractor Idea: `aria2` with bittorrents link/FTP/SFTP/Metalink download support
- New Extractor: `rich` and `imgcat` for rendering markdown, code, error logs, and more to html/CLI HOT 1
- New Extractor Idea: `podcast-archiver` for auto-downloading podcasts
- Django Admin general improvements: tree view, better filters, better sorting, custom pages, etc.
- Feature Request: Raindrop.io import HOT 1
- htmltotext archive results are not recorded HOT 1
- parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win HOT 7
- Feature Request: Add config to show Snapshot.bookmarked timestamp instead of Snapshot.added in the UI
- New Extractor Idea: `forum-dl` for downloading forum threads as JSON/html HOT 1
- Feature Request: Add new `generic_jsonl` parser to support ingesting JSONL HOT 3
- Bug: `UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 110372: surrogates not allowed` when trying to render unprintable filesystem path in view HOT 15
- How to navigate various snapshots of a single url? HOT 2
- Support: podman-compose rootless setup leads to `PUID=0` being passed, and ArchiveBox refuses to start as root HOT 9
- Ability to disable archiving if not logged in HOT 3
- Support: Singlefile is failing to archive some sites (`xz.aliyun.com`) HOT 1
- Bug: Bilibili fails to scrape
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from archivebox.