jonathanreeve / corpus-db Goto Github PK

A textual corpus database for the digital humanities.

License: GNU General Public License v3.0

Haskell 7.72% Shell 0.17% Jupyter Notebook 92.12%

corpus-linguistics digital-humanities literature literary-studies literary-criticism literary-analysis text-analysis natural-language-processing

corpus-db's Introduction

Corpus-DB

Corpus-DB is a textual corpus database for the digital humanities. This project aggregates public domain texts, enhances their metadata from sources like Wikipedia, and makes those texts available according to that metadata. This will make it easy to download subcorpora like:

Bildungsromans
Dickens novels
Poetry published in the 1880s
Novels set in London

Corpus-DB has several components:

Scripts for aggregating metadata, written in Python
The database, currently a few SQLite databases
A REST API for querying the database, written in Haskell (currently in progress)
Analytic experiments, mostly in Python

Read more about the database at this introductory blog post. Scripts used to generate the database are in the gitenberg-experiments repo.

Contributing

I could use some help with this, especially if you know Python or Haskell, have library or bibliography experience, or simply like books. Get in touch in the chat room, or contact me via email.

Hacking

If you want to build the website and API, you'll need the Haskell tool stack.

stack build
cd src
export ENV=dev
stack runhaskell Main.hs

If you use ENV=dev, this will set the database path to /data/dev.db, which is a 30-row subset of the main database, since the main database is too big (16GB at the moment) to put on GitHub. You can use this dev database for hacking around on. If you need the full database for some reason, let me know.

Upcoming Changes

I'm rewriting corpus-db from scratch (see issues labeled 2.0). This is to make the whole toolchain in Corpus-DB repeatable, in case of data loss, and future-proof, so that it can ingest new texts from Project Gutenberg and other sources as they arrive. Feel free to help out with this!

Parse Project Gutenberg RDF/XML metadata, and put it into a database.
Mirror PG, using an rsync script.
Clean PG texts, and add them to that database. Also add HTML files.
Write an ORM-level database layer, using Persistent, for more native DB interactions and typesafe queries.

corpus-db's People

Stargazers

Watchers

Forkers

gitter-badger wangziyi2016 chunlitj gdevanla mkwassmason evantilley damian-romero wiwern

corpus-db's Issues

Need a title search

It would be great to have a way to search by title, if that is all the info we have (i.e. don't know Gutenberg ID, author, etc.). Thanks!

Parse Gutenberg RDF-XML straight from the source

I've started looking into this. Still haven't decided on an XML parsing solution, so I posted on StackOverflow about it.

Speed up full-text search response

Full-text searches take a really long time, but they return results iteratively through the sqlite interface. Speeding them up could probably be achieved by treating the results of a database query more like a stream. Haskell might already be doing some kind of streaming. It might be good to investigate this a little.

Version number

Think it will prevent some trouble if you add a version number to your URIs:

/api/v0.1/subjects/detectivefiction

Interesting discussion here, though some of it is probably overkill for this:

https://stackoverflow.com/questions/389169/best-practices-for-api-versioning

Consider other database formats

It would be worth considering other database formats for this project, since SQLite only supports certain data types. It's definitely nice to have everything in one file, though.

IDs formatted as floats

The IDs when getting subject metadata all have .0 at the end. Seems like they should be integers, not floats.

api returns a prefix of the json when requesting fulltext

When requesting the full text, in some cases only a prefix of the json is returned. For example:

http://corpus-db.org/api/author/Dickens,%20Charles/fulltext

Doing so does not return the full json file. Instead each time I visit the page I get a different prefix of the entire json file.

New endpoint for Library of Congress subject headings (LCSH)

I've already started doing this on the develop branch.

Make a robots.txt

The server has been saying there have been a lot of requests for robots.txt. Better make something that tells bots where the content pages are (and to stay away from the API).

Rewrite database layer using Persistent

The nice thing about using Haskell is type safety, and HDBC isn't as typesafe as a more ORM-like database layer like Persistent. Persistent would also allow us to effortlessly migrate our database, if need be.

Make local test DB and add to GitHub repo

This will facilitate collaboration and building of the system for those that don't need or want the full 16GB database.

I'm guessing the process will be:

make a new DB, attach it using the ATTACH command
copy over the first 20 or so texts to the new DB
add the DB to GitHub
add a new environment (default) that will use the relative path to this test DB

Set up log file system and hook them up to fail2ban

There are just too many requests to wp_config.php. Tons of hacking attempts. They need to be banned.

Write out API spec

This would, in the beginning, just be a list of URL patterns and descriptions, like:

api.corpus-db.org/author/Dickens -- Should give you the metadata of Dickens novels
api.corpus-db.org/fulltext/author/Dickens -- Should give the full text of all Dickens novels

This could eventually become more formal API documentation.

Create full text search (FTS5) table

https://www.sqlite.org/fts5.html

Steps are:

Create a new FTS5 virtual table modeled on the text table
Copy all data from text to the new table
???
Profit!

I have no idea how big this will make the database, though, since I had to kill the process on my laptop after the database more than doubled in size (>18G). I think I'll need to temporarily buy a new DigitalOcean volume and hook it up to the server in order to test this.

Find a way to automatically create API docs from code comments

It seems like the usual way of writing documentation is using Haddock. Documentation is automatically generated using code comments.

This isn't necessary, but it would be nice, since that way we'd have everything in one place (without having to update the docs page every time there's an API change).

Migrate to Docker

It'd be nice to have the whole setup procedure (stack build, etc.) containerized on docker. This would save time when migrating to a new server. I'm not quite sure how data volumes would work, though. Low-priority for the moment, since this is a dev-ops issue.

Full text wrapped in square brackets

When getting a full text, like in this example:

http://corpus-db.org/api/id/108/fulltext

the result is wrapped in square brackets. That means that when loaded into JSON it's a list with 1 item in it (the dictionary) rather than just a dictionary. Think it would be better without the brackets so it wouldn't need to be parsed out.

Compile statistics about data field completeness

Wikipedia data only exists for about 1-2K of the ~45K books in PG, if I recall correctly. To figure out where there is room for improvement, it would first help to know the completeness of all the data fields. Then we can identify patterns in the books that have very little metadata.

Create documentation for new RISE endpoints

This is documentation for PR #30, which creates new API endpoints for compatibility with external tools.

Import Gutenberg metadata to Fuseki container

There's already a Fuseki docker container on DockerHub. I'd like to get this to import Project Gutenberg metadata automatically.

I imagine this would be:

write a script that downloads PG metadata (RDF XML)
write something that hooks this metadata up to a Fuseki container
expose endpoint on server

Semantic endpoints (singular/plural)

According to what I've seen/read, typically the endpoints stay plural, like

baseurl + "/api/subjects/detectivefiction"

In terms of meaning, the category above, subjects, is inclusive of all the subjects. The next delineation, detectivefiction, is the singular.

If the API was like:

baseurl + '/api?subject="detectivefiction"&output=json'

then that makes sense because subject isn't the category above, it's the key to which detectivefiction is the value.

Also it's confusing if "subjects" lists all the subjects and then that's not continued when getting specific subjects.

More semantics

Thinking about semantics again. In this example:

http://corpus-db.org/api/id/807/fulltext

I think this has some issues. As a user, if I see the endpoint for getting some specific data attached to a resource, I should a) also know how to get the resource metadata and b) see how to get the metadata for all resources of that type.

If the endpoint was something like:

http://corpus-db.org/api/v1/books/id/10?fulltext=true

http://corpus-db.org/api/v1/books?fulltext=true&id=101

you could infer that

    http://corpus-db.org/api/v1/books

gives you all the metadata for the books resource and

http://corpus-db.org/api/v1/books?id=101

gives you metadata, probably with an excerpt, without the full text. Also, if you wind up adding more resources, you've painted yourself into a corner with the approach that only has the id. Imagine if you want to do this:

http://corpus-db.org/api/v1/author?name="Margaret+Atwood"
http://corpus-db.org/api/v1/author?id=391

that doesn't really jive with the format you've established for books, making books a special case rather than a template that allows you to infer how the rest of the system works. I think consistency here makes the whole API a lot more usable and intuitive.

Make project website

It'd be great to do this in Scotty, so that it'll be the same codebase and language as the API.

Write author API endpoint

Having the ability to generate single-author subcorpora would be a fun first feature to have.

Add more example analyses

More comparisons between single-author corpora would be interesting, as well as more metadata analyses. Example text analyses in languages other than Python might be interesing, too. How about some R analyses? Haskell, even. Jupyter supports all of these languages, now, so they could all be in Jupyter notebooks, and thus displayable on GitHub.

New API endpoint for Wikipedia categories

This would allow dowloading books with the Wikipedia category "Novels set in London," for instance:

corpus-db.org/api/category/Novels set in London

And its full-text counterpart:

corpus-db.org/api/category/Novels set in London/fulltext