GithubHelp home page GithubHelp logo

yertleturtlegit / gutenberg Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pgcorpus/gutenberg

0.0 0.0 0.0 16.94 MB

Pipeline to generate the Standardized Project Gutenberg Corpus

Home Page: https://zenodo.org/record/2422561

License: GNU General Public License v3.0

Python 100.00%

gutenberg's Introduction

Standardized Project Gutenberg Corpus

Easily generate a local, up-to-date copy of the Standardized Project Gutenberg Corpus (SPGC).

The Standardized Project Gutenberg Corpus was presented in

A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics
M. Gerlach, F. Font-Clos, arXiv:1812.08092, Dec 2018

acompanied by a 'frozen' version of the corpus (SPGC-2018-07-18) as a Zenodo dataset:

DOI

SPGC-2018-07-18 contains the tokens/ and counts/ files of all books that were part of Project Gutenbergh (PG) as of Jul 18, 2018, matching exactly those used in the paper. Since then, a few more thousands books have been added to PG, so if you want to exactly reproduce the results of the paper, then you should use SPGC-2018-07-18.

For most other use cases, however, you probably want the latest, most recent version of the corpus, in which case you should use this repository to generate the corpus locally on your computer. In particular, you will need to generate the corpus locally if you need to work with the original full text files in raw/ and text/, since these are not included in the SPGC-2018-07-18 Zenodo dataset.

Installation

โš ๏ธ Python 2.x is not supported Please make sure your system runs Python 3.x. (https://pythonclock.org/).

Clone this repository

git clone https://github.com/pgcorpus/gutenberg.git

enter the newly created gutenberg directory

cd gutenberg

To install any missing dependencies, just run

pip install -r requirements.txt

Getting the data

To get a local copy of the PG data, just run

python get_data.py

This will download a copy of all UTF-8 books in PG and will create a csv file with metadata (e.g. author, title, year, ...).

Notice that if you already have some of the data, the program will only download those you are missing (we use rsync for this). It is hence easy to update the dataset periodically to keep it up-to-date by just running get_data.py.

Processing the data

To process all the data in the raw/ directory, run

python process_data.py

This will fill in the text/, tokens/ and counts/ folders.

gutenberg's People

Contributors

fontclos avatar martingerlach avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.