GithubHelp home page GithubHelp logo

Comments (11)

pirate avatar pirate commented on May 11, 2024 3

Wget WARC file output is now supported in e8808b0, via the FETCH_WARC=True flag (on by default).

I'll investigate creating another a more complete warc including dynamic requests with the chrome proxy, but for now the simpler wget one is worth testing in the interim.

from archivebox.

anarcat avatar anarcat commented on May 11, 2024 2

for the record, pywb has trouble reading wget WARC file as its output is non-standard: webrecorder/pywb#294 you might want to consider another crawler for the task or see that wget fixes their stuff first.

from archivebox.

f0086 avatar f0086 commented on May 11, 2024 2

The bookmark-archiver is mentioned in a recent LWN article:
https://lwn.net/Articles/766374/

WARC archives seems the be the way to go for archiving webpages, especially for SPAs and other JS-heavy and streaming pages. Wget is -- according to that article -- a little broken in terms of WARC output, but there are a list of alternatives which works fine.

It would be very nice if bookmark-archiver get support for WARC archives.

from archivebox.

eqyiel avatar eqyiel commented on May 11, 2024 1

Can this be re-opened? The linked issue is about the webarchive format, which is a totally different thing.

https://en.wikipedia.org/wiki/Webarchive
https://en.wikipedia.org/wiki/Web_ARChive

WARC is the one that archiveteam and libraries have thrown their weight behind: http://fileformats.archiveteam.org/wiki/WARC

from archivebox.

pirate avatar pirate commented on May 11, 2024 1

Oh I didn't know they were different, thanks @eqyiel.

from archivebox.

pirate avatar pirate commented on May 11, 2024 1

I saw the article, and I actually emailed @anarcat about it since it turns out we both live in Montreal.

I think if I add WARC saving I want to do it with https://github.com/internetarchive/brozzler and a headless browser to cover all the edge cases around JS execution.

from archivebox.

FiloSottile avatar FiloSottile commented on May 11, 2024 1

The easiest way to do this might be to just point the headless browser you already run to a WARC recording proxy. However brozzler looks like a nicely maintained piece of software (and way better than the old Java stuff the IA was using before), so maybe migrating to that can be a longer term goal.

EDIT: just noticed there's #113 for brozzler already.

from archivebox.

pirate avatar pirate commented on May 11, 2024

See: #11

from archivebox.

pirate avatar pirate commented on May 11, 2024

Just requires adding a new config FETCH_WARC option and archive_method.fetch_warc:

https://www.archiveteam.org/index.php/Wget_with_WARC_output

from archivebox.

FiloSottile avatar FiloSottile commented on May 11, 2024

WARCs are nice for bundling all resources needed to render a page in a browser, but using wget generates a WARC which is just a container for the HTML. Maybe open a new issue for the browser integration?

from archivebox.

pirate avatar pirate commented on May 11, 2024

Yeah, actually I think I'll turn it off by default for now because it's too minimal to be a default until we get warcprox working with headless.

Here's the new issue to track the all-in-one WARC file #130

from archivebox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.