GithubHelp home page GithubHelp logo

freiguy1 / gutenberg Goto Github PK

View Code? Open in Web Editor NEW

This project forked from c-w/gutenberg

0.0 0.0 0.0 5.99 MB

A simple interface to the Project Gutenberg corpus.

Home Page: https://gutenberg.justamouse.com

License: Apache License 2.0

Python 100.00%

gutenberg's Introduction

Gutenberg

image

image

image

image

Overview

This package contains a variety of scripts to make working with the Project Gutenberg body of public domain texts easier.

The functionality provided by this package includes:

  • Downloading texts from Project Gutenberg.
  • Cleaning the texts: removing all the crud, leaving just the text behind.
  • Making meta-data about the texts easily accessible.

The package has been tested with Python 2.7 and 3.5+.

An HTTP interface to this package exists too. Try it out!

Installation

This project is on PyPI, so I'd recommend that you just install everything from there using your favourite Python package manager.

If you want to install from source or modify the package, you'll need to clone this repository:

Now, you should probably install the dependencies for the package and verify your checkout by running the tests.

Alternatively, you can also run the project via Docker:

Python 3

This package depends on BSD-DB. The bsddb module was removed from the Python standard library since version 2.7. This means that if you wish to use gutenberg on Python 3, you will need to manually install BSD-DB.

Linux

On Linux, you can usually install BSD-DB using your distribution's package manager. For example, on Ubuntu, you can use apt-get:

MacOS

On Mac, you can install BSD-DB using homebrew:

Windows

On Windows, it's easiest to download a pre-compiled version of BSD-DB from pythonlibs.

For example, if you have Python 3.5 on a 64-bit version of Windows, you should download bsddb3‑6.2.1‑cp35‑cp35m‑win_amd64.whl.

After you download the wheel, install it and you're good to go:

License conflicts

Since its v6.x releases, BSD-DB switched to the AGPL3 license which is stricter than this project's Apache v2 license. This means that unless you're happy to comply to the terms of the AGPL3 license, you'll have to install an ealier version of BSD-DB (anything between 4.8.30 and 5.x should be fine). If you are happy to use this project under AGPL3 (or if you have a commercial license for BSD-DB), set the following environment variable before attempting to install BSD-DB:

Apache Jena Fuseki

As an alternative to the BSD-DB backend, this package can also use Apache Jena Fuseki for the metadata store. The Apache Jena Fuseki backend is activated by setting the GUTENBERG_FUSEKI_URL environment variable to the HTTP endpoint at which Fuseki is listening. If the Fuseki server has HTTP basic authentication enabled, the username and password can be provided via the GUTENBERG_FUSEKI_USER and GUTENBERG_FUSEKI_PASSWORD environment variables.

For local development, the Fuseki server can be run via Docker:

Usage

Downloading a text

Looking up meta-data

A bunch of meta-data about ebooks can be queried:

You can get a full list of the meta-data that can be queried by calling:

Before you use one of the gutenberg.query functions you must populate the local metadata cache. This one-off process will take quite a while to complete (18 hours on my machine) but once it is done, any subsequent calls to get_etexts or get_metadata will be very fast. If you fail to populate the cache, the calls will raise an exception.

To populate the cache:

If you need more fine-grained control over the cache (e.g. where it's stored or which backend is used), you can use the set_metadata_cache function to switch out the backend of the cache before you populate it. For example, to use the Sqlite cache backend instead of the default Sleepycat backend and store the cache at a custom location, you'd do the following:

Limitations

This project deliberately does not include any natural language processing functionality. Consuming and processing the text is the responsibility of the client; this library merely focuses on offering a simple and easy to use interface to the works in the Project Gutenberg corpus. Any linguistic processing can easily be done client-side e.g. using the TextBlob library.

gutenberg's People

Contributors

c-w avatar hugovk avatar cpeel avatar masterodin avatar lifuhuang avatar sethwoodworth avatar andrewyang96 avatar bwindsor22 avatar ikarth avatar lissyx avatar srisi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.