GithubHelp home page GithubHelp logo

bibanon / basc-warc Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 2.0 184 KB

Library for creating and managing WARC files. Currently in planning / pre-alpha stage.

License: Creative Commons Zero v1.0 Universal

Python 100.00%

basc-warc's Introduction

The Bibliotheca Anonoma

The Bibliotheca Anonoma is a wiki designed to collect, document, and safeguard the products and history of internet culture; which constitutes the shared experience of humanity on a network that defines our lives.

The Wiki

This is the source code viewer for the Bibliotheca Anonoma Wiki.
To actually view and edit the Wiki follow one of the links below:

basc-warc's People

Contributors

danieloaks avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

python3pkg

basc-warc's Issues

Allow simple loading of WARC files

Write a function that loads in a WARC file and presents a WarcFile object. Does not have to be streaming yet – we can work on that part of it later.

Possible uses in Website Reconstruction

One important alternate application of this library would be to export data from the WARC files, to output HTML and other metadata.

For example, the Internet Archive has the only snapshots of 4chanarchive/Chanarchive around. We could download the warc.gz files and try to programmatically reconstruct the site as best we can using the BASC-WARC library, with further HTML extraction tools (such as BeautifulSoup) used to export the threads into a database.

On a related note, there's the good ol' warc Python Library to get us started from the Internet Archive: http://warc.readthedocs.org/en/latest/

Write Gzipped Records

Finish up the Gzipped records option for WarcFile.bytes(), confirm it works properly by parsing the output with a few different WARC tools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.