GithubHelp home page GithubHelp logo

jonathanreeve / corpus-db Goto Github PK

View Code? Open in Web Editor NEW
57.0 7.0 8.0 26.67 MB

A textual corpus database for the digital humanities.

Home Page: http://corpus-db.org

License: GNU General Public License v3.0

Haskell 7.72% Shell 0.17% Jupyter Notebook 92.12%
corpus-linguistics digital-humanities literature literary-studies literary-criticism literary-analysis text-analysis natural-language-processing

corpus-db's Introduction

Corpus-DB

Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This will make it easy to download subcorpora like:

  • Bildungsromans
  • Dickens novels
  • Poetry published in the 1880s
  • Novels set in London

Corpus-DB has several components:

  • Scripts for aggregating metadata, written in Python
  • The database, currently a few SQLite databases
  • A REST API for querying the database, written in Haskell (currently in progress)
  • Analytic experiments, mostly in Python

Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo.

Contributing

I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.

Join the chat at https://gitter.im/corpus-db/Lobby

Hacking

If you want to build the website and API, you'll need the Haskell tool stack.

stack build
cd src
export ENV=dev
stack runhaskell Main.hs

If you use ENV=dev, this will set the database path to /data/dev.db, which is a 30-row subset of the main database, since the main database is too big (16GB at the moment) to put on GitHub. You can use this dev database for hacking around on. If you need the full database for some reason, let me know.

Upcoming Changes

I'm rewriting corpus-db from scratch (see issues labeled 2.0). This is to make the whole toolchain in Corpus-DB repeatable, in case of data loss, and future-proof, so that it can ingest new texts from Project Gutenberg and other sources as they arrive. Feel free to help out with this!

  1. Parse Project Gutenberg RDF/XML metadata, and put it into a database.
  2. Mirror PG, using an rsync script.
  3. Clean PG texts, and add them to that database. Also add HTML files.
  4. Write an ORM-level database layer, using Persistent, for more native DB interactions and typesafe queries.

corpus-db's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

corpus-db's Issues

Need a title search

It would be great to have a way to search by title, if that is all the info we have (i.e. don't know Gutenberg ID, author, etc.). Thanks!

Speed up full-text search response

Full-text searches take a really long time, but they return results iteratively through the sqlite interface. Speeding them up could probably be achieved by treating the results of a database query more like a stream. Haskell might already be doing some kind of streaming. It might be good to investigate this a little.

Consider other database formats

It would be worth considering other database formats for this project, since SQLite only supports certain data types. It's definitely nice to have everything in one file, though.

IDs formatted as floats

The IDs when getting subject metadata all have .0 at the end. Seems like they should be integers, not floats.

Make a robots.txt

The server has been saying there have been a lot of requests for robots.txt. Better make something that tells bots where the content pages are (and to stay away from the API).

Rewrite database layer using Persistent

The nice thing about using Haskell is type safety, and HDBC isn't as typesafe as a more ORM-like database layer like Persistent. Persistent would also allow us to effortlessly migrate our database, if need be.

Make local test DB and add to GitHub repo

This will facilitate collaboration and building of the system for those that don't need or want the full 16GB database.

I'm guessing the process will be:

  • make a new DB, attach it using the ATTACH command
  • copy over the first 20 or so texts to the new DB
  • add the DB to GitHub
  • add a new environment (default) that will use the relative path to this test DB

Write out API spec

This would, in the beginning, just be a list of URL patterns and descriptions, like:

  • api.corpus-db.org/author/Dickens -- Should give you the metadata of Dickens novels
  • api.corpus-db.org/fulltext/author/Dickens -- Should give the full text of all Dickens novels

This could eventually become more formal API documentation.

Create full text search (FTS5) table

https://www.sqlite.org/fts5.html

Steps are:

  1. Create a new FTS5 virtual table modeled on the text table
  2. Copy all data from text to the new table
  3. ???
  4. Profit!

I have no idea how big this will make the database, though, since I had to kill the process on my laptop after the database more than doubled in size (>18G). I think I'll need to temporarily buy a new DigitalOcean volume and hook it up to the server in order to test this.

Find a way to automatically create API docs from code comments

It seems like the usual way of writing documentation is using Haddock. Documentation is automatically generated using code comments.

This isn't necessary, but it would be nice, since that way we'd have everything in one place (without having to update the docs page every time there's an API change).

Migrate to Docker

It'd be nice to have the whole setup procedure (stack build, etc.) containerized on docker. This would save time when migrating to a new server. I'm not quite sure how data volumes would work, though. Low-priority for the moment, since this is a dev-ops issue.

Full text wrapped in square brackets

When getting a full text, like in this example:

http://corpus-db.org/api/id/108/fulltext

the result is wrapped in square brackets. That means that when loaded into JSON it's a list with 1 item in it (the dictionary) rather than just a dictionary. Think it would be better without the brackets so it wouldn't need to be parsed out.

Compile statistics about data field completeness

Wikipedia data only exists for about 1-2K of the ~45K books in PG, if I recall correctly. To figure out where there is room for improvement, it would first help to know the completeness of all the data fields. Then we can identify patterns in the books that have very little metadata.

Semantic endpoints (singular/plural)

According to what I've seen/read, typically the endpoints stay plural, like

baseurl + "/api/subjects/detectivefiction"

In terms of meaning, the category above, subjects, is inclusive of all the subjects. The next delineation, detectivefiction, is the singular.

If the API was like:

baseurl + '/api?subject="detectivefiction"&output=json'

then that makes sense because subject isn't the category above, it's the key to which detectivefiction is the value.

Also it's confusing if "subjects" lists all the subjects and then that's not continued when getting specific subjects.

More semantics

Thinking about semantics again. In this example:

http://corpus-db.org/api/id/807/fulltext

I think this has some issues. As a user, if I see the endpoint for getting some specific data attached to a resource, I should a) also know how to get the resource metadata and b) see how to get the metadata for all resources of that type.

If the endpoint was something like:

http://corpus-db.org/api/v1/books/id/10?fulltext=true

or

http://corpus-db.org/api/v1/books?fulltext=true&id=101

you could infer that

    http://corpus-db.org/api/v1/books

gives you all the metadata for the books resource and

http://corpus-db.org/api/v1/books?id=101

gives you metadata, probably with an excerpt, without the full text. Also, if you wind up adding more resources, you've painted yourself into a corner with the approach that only has the id. Imagine if you want to do this:

http://corpus-db.org/api/v1/author?name="Margaret+Atwood"
http://corpus-db.org/api/v1/author?id=391

that doesn't really jive with the format you've established for books, making books a special case rather than a template that allows you to infer how the rest of the system works. I think consistency here makes the whole API a lot more usable and intuitive.

Make project website

It'd be great to do this in Scotty, so that it'll be the same codebase and language as the API.

Add more example analyses

More comparisons between single-author corpora would be interesting, as well as more metadata analyses. Example text analyses in languages other than Python might be interesing, too. How about some R analyses? Haskell, even. Jupyter supports all of these languages, now, so they could all be in Jupyter notebooks, and thus displayable on GitHub.

New API endpoint for Wikipedia categories

This would allow dowloading books with the Wikipedia category "Novels set in London," for instance:

corpus-db.org/api/category/Novels set in London

And its full-text counterpart:

corpus-db.org/api/category/Novels set in London/fulltext

Clean database

A lot of fields in the database are just string representations of Python objects (lists, dictionaries, etc.). It would help to put this into a more structured format, if possible.

Max results query parametr

Would be nice to have a query parameter that can return x number of results to speed up examples in the notebook.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.