jonathanreeve / corpus-db Goto Github PK
View Code? Open in Web Editor NEWA textual corpus database for the digital humanities.
Home Page: http://corpus-db.org
License: GNU General Public License v3.0
A textual corpus database for the digital humanities.
Home Page: http://corpus-db.org
License: GNU General Public License v3.0
There are just too many requests to wp_config.php. Tons of hacking attempts. They need to be banned.
The IDs when getting subject metadata all have .0 at the end. Seems like they should be integers, not floats.
This is documentation for PR #30, which creates new API endpoints for compatibility with external tools.
A lot of fields in the database are just string representations of Python objects (lists, dictionaries, etc.). It would help to put this into a more structured format, if possible.
The stock Scotty favicon is 404ing with every request. Should probably fix this. There's probably an easy fix for it.
It would be worth considering other database formats for this project, since SQLite only supports certain data types. It's definitely nice to have everything in one file, though.
When getting a full text, like in this example:
http://corpus-db.org/api/id/108/fulltext
the result is wrapped in square brackets. That means that when loaded into JSON it's a list with 1 item in it (the dictionary) rather than just a dictionary. Think it would be better without the brackets so it wouldn't need to be parsed out.
The nice thing about using Haskell is type safety, and HDBC isn't as typesafe as a more ORM-like database layer like Persistent. Persistent would also allow us to effortlessly migrate our database, if need be.
This could probably be accomplished with a drop-in Bootstrap theme of some sort, like bootstrap-material-design. As @smythp pointed out, the current site looks very vanilla bootstrap.
When requesting the full text, in some cases only a prefix of the json is returned. For example:
http://corpus-db.org/api/author/Dickens,%20Charles/fulltext
Doing so does not return the full json file. Instead each time I visit the page I get a different prefix of the entire json file.
Wikipedia data only exists for about 1-2K of the ~45K books in PG, if I recall correctly. To figure out where there is room for improvement, it would first help to know the completeness of all the data fields. Then we can identify patterns in the books that have very little metadata.
It would be great to have a way to search by title, if that is all the info we have (i.e. don't know Gutenberg ID, author, etc.). Thanks!
https://www.sqlite.org/fts5.html
Steps are:
text
tabletext
to the new tableI have no idea how big this will make the database, though, since I had to kill the process on my laptop after the database more than doubled in size (>18G). I think I'll need to temporarily buy a new DigitalOcean volume and hook it up to the server in order to test this.
Think it will prevent some trouble if you add a version number to your URIs:
/api/v0.1/subjects/detectivefiction
Interesting discussion here, though some of it is probably overkill for this:
https://stackoverflow.com/questions/389169/best-practices-for-api-versioning
More comparisons between single-author corpora would be interesting, as well as more metadata analyses. Example text analyses in languages other than Python might be interesing, too. How about some R analyses? Haskell, even. Jupyter supports all of these languages, now, so they could all be in Jupyter notebooks, and thus displayable on GitHub.
This would, in the beginning, just be a list of URL patterns and descriptions, like:
This could eventually become more formal API documentation.
Would be nice to have a query parameter that can return x number of results to speed up examples in the notebook.
It'd be great to do this in Scotty, so that it'll be the same codebase and language as the API.
This would allow dowloading books with the Wikipedia category "Novels set in London," for instance:
corpus-db.org/api/category/Novels set in London
And its full-text counterpart:
corpus-db.org/api/category/Novels set in London/fulltext
According to what I've seen/read, typically the endpoints stay plural, like
baseurl + "/api/subjects/detectivefiction"
In terms of meaning, the category above, subjects, is inclusive of all the subjects. The next delineation, detectivefiction, is the singular.
If the API was like:
baseurl + '/api?subject="detectivefiction"&output=json'
then that makes sense because subject isn't the category above, it's the key to which detectivefiction is the value.
Also it's confusing if "subjects" lists all the subjects and then that's not continued when getting specific subjects.
Having the ability to generate single-author subcorpora would be a fun first feature to have.
I've started looking into this. Still haven't decided on an XML parsing solution, so I posted on StackOverflow about it.
Full-text searches take a really long time, but they return results iteratively through the sqlite interface. Speeding them up could probably be achieved by treating the results of a database query more like a stream. Haskell might already be doing some kind of streaming. It might be good to investigate this a little.
I've already started doing this on the develop branch.
This'll just make it easier to switch between port 80 and 8000 (cloud/local), and between database paths.
Thinking about semantics again. In this example:
http://corpus-db.org/api/id/807/fulltext
I think this has some issues. As a user, if I see the endpoint for getting some specific data attached to a resource, I should a) also know how to get the resource metadata and b) see how to get the metadata for all resources of that type.
If the endpoint was something like:
http://corpus-db.org/api/v1/books/id/10?fulltext=true
or
http://corpus-db.org/api/v1/books?fulltext=true&id=101
you could infer that
http://corpus-db.org/api/v1/books
gives you all the metadata for the books resource and
http://corpus-db.org/api/v1/books?id=101
gives you metadata, probably with an excerpt, without the full text. Also, if you wind up adding more resources, you've painted yourself into a corner with the approach that only has the id. Imagine if you want to do this:
http://corpus-db.org/api/v1/author?name="Margaret+Atwood"
http://corpus-db.org/api/v1/author?id=391
that doesn't really jive with the format you've established for books, making books a special case rather than a template that allows you to infer how the rest of the system works. I think consistency here makes the whole API a lot more usable and intuitive.
Hey, got item # 807 back as a detective novel when I query for metadata, but it returns an empty list when I request the full text.
The server has been saying there have been a lot of requests for robots.txt. Better make something that tells bots where the content pages are (and to stay away from the API).
It'd be nice to have the whole setup procedure (stack build, etc.) containerized on docker. This would save time when migrating to a new server. I'm not quite sure how data volumes would work, though. Low-priority for the moment, since this is a dev-ops issue.
It seems like the usual way of writing documentation is using Haddock. Documentation is automatically generated using code comments.
This isn't necessary, but it would be nice, since that way we'd have everything in one place (without having to update the docs page every time there's an API change).
There's already a Fuseki docker container on DockerHub. I'd like to get this to import Project Gutenberg metadata automatically.
I imagine this would be:
This will facilitate collaboration and building of the system for those that don't need or want the full 16GB database.
I'm guessing the process will be:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.