An overview of existing projects to consume them is here: <a href="http://www.arch

Wget WARC file output is now supported in <a class="commit-link" data-hovercard-type="

The bookmark-archiver is mentioned in a recent LWN article: <a href="https://lwn.n

Oh I didn't know they were different, thanks <a class="user-mention notranslate" data-

I saw the article, and I actually emailed <a class="user-mention notranslate" data-hov

See: <a class="issue-link js-issue-link" data-error-text="Failed to load title" data-i

Archive Method: Add WARC file output about archivebox HOT 11 CLOSED

archivebox commented on May 11, 2024 3

Archive Method: Add WARC file output

from archivebox.

Comments (11)

pirate commented on May 11, 2024 3

Wget WARC file output is now supported in e8808b0, via the FETCH_WARC=True flag (on by default).

I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.

from archivebox.

anarcat commented on May 11, 2024 2

for the record, pywb has trouble reading wget WARC file as its output is non-standard: webrecorder/pywb#294 you might want to consider another crawler for the task or see that wget fixes their stuff first.

from archivebox.

f0086 commented on May 11, 2024 2

The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/

WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.

It would be very nice if bookmark-archiver get support for WARC archives.

from archivebox.

eqyiel commented on May 11, 2024 1

Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.

https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive

WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC

from archivebox.

pirate commented on May 11, 2024 1

Oh I didn't know they were different, thanks @eqyiel.

from archivebox.

pirate commented on May 11, 2024 1

I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.

I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.

from archivebox.

FiloSottile commented on May 11, 2024 1

The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.

EDIT: just noticed there's #113 for brozzler already.

from archivebox.

pirate commented on May 11, 2024

See: #11

from archivebox.

pirate commented on May 11, 2024

Just requires adding a new config FETCH_WARC option and archive_method.fetch_warc:

https://www.archiveteam.org/index.php/Wget_with_WARC_output

from archivebox.

FiloSottile commented on May 11, 2024

WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?

from archivebox.

pirate commented on May 11, 2024

Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get warcprox working with headless.

Here's the new issue to track the all-in-one WARC file #130

from archivebox.

Recommend Projects

Archive Method: Add WARC file output about archivebox HOT 11 CLOSED

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs