cltk / cltk_api Goto Github PK

View Code? Open in Web Editor NEW

13.0 12.0 16.0 2.31 MB

RESTful API for the CLTK

License: MIT License

Python 100.00%

cltk_api's Introduction

Notice

The Classics Archive application is currently under active development and is not ready for production.

About

A simple Flask app for accessing corpora from the CLTK corpora. Currently under development.

To run with gunicorn: gunicorn -w 4 -b 0.0.0.0:5000 api_json:app.

Development

To get started developing, you'll need Python3.5 and Mongo installed.

Create a virtual environment and activate it:

$ pyvenv venv $ source venv/bin/activate

Install dependencies:

$ pip install -r requirements.txt

Finally, start the app with the following command:

$ python api_json.py

cltk_api's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger suheb malithsen vipul-sharma20 sameeriitkgp aman-ks kokmritunjay shaiwilson asteroidb612 manu-chroma ratulghosh parag00991 tylerkirby greenat92 damiamhd sourabhyadav999

cltk_api's Issues

Get Perseus definitions

Cross-referencing Issue cltk/cltk_frontend#60 by @lukehollis.

Strategies for parsing / ingesting corpora - coptic

Added very basic text parser for the TEI from the coptic_text corpus. Available here: https://github.com/cltk/cltk_api/blob/ingest/ingest/learn/coptic_text.py

Add ability to identify and respond with a list of entities given an input string of a classical text

For a given input string, we need to identity and respond with a list of entities (perhaps also their positions in the string).

A working function..

Accepts an input string
Identifies named entities in the input string
Associates named entities to some external data sources (maybe wikipedia or VIAF?)
Returns a list of entities serialized as JSON

We have a little bit of work done here that I used on an earlier project: https://github.com/cltk/cltk_api/tree/master/metadata/entities We should only keep what is useful of this and delete every line of code that is not. I will filter through the existing files and be more judicious in removing what's not applicable and documenting what is.

Add SSL to backend API

I've been seeing a lot about Let's Encrypt from the EFF. I think this would be a great way to go.

An easy tutorial I'll probably follow: https://www.digitalocean.com/community/tutorials/how-to-secure-nginx-with-let-s-encrypt-on-ubuntu-14-04

Re-add Perseus lemma and dictionary files

At some point in the past, I removed the lemmata and analyses files from cltk/latin_lexica_perseus. They need to be re-added. See greek_lexica_perseus for examples.

If any of the files are above GitHub's recommendation of 50MB, then you'll need to split them and call each _1 ,_2, etc.

Strategies for parsing / ingesting corpora - latin_text_latin_library

The beginnings of a solution are in https://github.com/cltk/cltk_api/blob/ingest/ingest/learn/latin_library.py, but we've discussed the difficulty of incorporating TLL files here, and it's very likely that the added benefit at this stage is outweighed by the effort of programming attempting to parse/infer useful metadata.

Write tests for cltk_api

With all the new developers, having unit tests is increasingly important.

@modassir, would you be interested in taking this task? Once it is finished, I would like to set up a build server on Travis CI (as we do for the core software).

Write Vagrant bootstrap.sh for CLTK and CLTK API

Building upon Manvendra's script to automate Nginx: https://gist.github.com/manu-chroma/4a6f3b6b27aa49683c67b9fb0b23d493

Let's add the basic setup, too:

Make user cltk with sudo privileges
Create a Python 3.5 virtualenv venv in cltk's home dir
Install cltk and, with the cltk, import the Perseus corpora (https://github.com/cltk/latin_text_perseus ; https://github.com/cltk/greek_text_perseus )
Clone cltk_api and install its requirements
Launch cltk_api as a service

Relevant Vagrant example here: https://www.vagrantup.com/docs/getting-started/provisioning.html

Please reach out for anything you get stuck on, we're here to help.

Stabilize and catalog all document formats (chapter, book-chapter, book-chapter-section, etc.)

In order to sync data from the text server to the Meteor application's database, we need to better define the different document formats. What makes a document "chapter" v. "book-chapter", etc.?

It only really matters for the API to understand how many levels of nested content there are. We can build it to be flexible if each level of nested content above the actual string of text (book, chapter, poem, etc.) has a similar structure.

Add converter.py and converted cltk_json to all CTLK corpora _text_ repos

All CLTK corpora text repos need a converter.py and the subsequent converted cltk_json dir with json files that are produced by converter.py.

One example of a converter.py is here: https://github.com/cltk/chinese_text_cbeta_01/blob/master/converter.py

A simple search on the cltk github page will yield all CLTK text corpora: https://github.com/cltk?utf8=%E2%9C%93&q=text&type=&language=

Here is a checklist of all CLTK text repos with their conversion status:

Metadata snippets for Perseus text files

@christophermorse has converted the XML to JSON. He will get us metadata snippets, from this JSON, for each author's text.

Converting TLG texts with TLGU issue

I have an issue trying to convert the TLG texts with TLGU. I use the code exactly as it is given in the instruction:

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='tlg')

I use Python 3.4.3 on Cygwin. When I enter [3], it does not do anything except start a newline waiting for me to enter another command. I checked and there is no file output.

By the way I cannot find PHI7 in my CLTK folders either.

Strategies for parsing / ingesting corpora - latin_text_lacus_curtius

Parse greek_text_perseus XML files to CLTK JSON data format

Following from the Latin job at #11.

Continue parsing latin_text_perseus XML files to CLTK JSON data format

We have begun converting the XML files of the latin_text_perseus corpus to the JSON data format that will be offered to the reading environment by the API. The files that are already converted are available here: https://github.com/cltk/latin_text_perseus/tree/master/json As we batch process the conversion of the JSON files, they should be added to the latin_text_perseus repo's /json directory.

Determine final angular/blaze breakdown

Expose Prosody core modules via the API

Return prosody data from the scanners in the CLTK core: https://github.com/cltk/cltk/tree/master/cltk/prosody

Dynamically open Perseus text files

According to the file's particular metadata, eg, Aeneid: book/line; Catullus: poem/line; Tacitus: book/section/subsection.

I'll do this crudely at first, then use what @christophermorse produces out of #1 .

Add cltk_api to Travis CI build server

To be done once #15 is at least minimally accomplished.

Add route for accessing CLTK POS tagger

Add route for accessing CLTK stemmer

To get started with adding routes for accessing the CLTK core modules, an easy first step seems to be to add a route which we can send a GET request to with an input string of Latin words and receive a stemmed string of Latin words in response. I'll take a look at adding this unless anyone else is interested in working on it.

Add "source" field to JSON

To our ingested JSON documents, add "source" and "license" to the text output.

Port old Segetes materials as are useful to the project to the api module for mining metadata

Add route for accessing CLTK tokenizer

Reference #20 for the original conversation about this.

Version API resources

Seems like this might be handled best through Accept Headers, though many solutions will work here.

route for accessing NER

Renaming API name gives error with gunicorn command

After setting up cltk_api I was able to run the API through python api.py command but not through gunicorn -w 4 -b 0.0.0.0:5000 api:app (Error in the screenshot)
I changed to it's old name api_json.py and run the command gunicorn -w 4 -b 0.0.0.0:5000 api_json:app and it worked.
I think the already present api/ folder in the same directory creates conflict when the gunicorn command is run and API is named api.py
Can someone else please verify the same error on their system ?
P.S: Name of the API was updated 2 commits back. 2bcc0da
Terminal Output:

OS: Ubuntu 15.10 (64-bit)

REST API Design

This issue is to brainstorm the design of the API endpoints and responses. I'll start with a couple of points on shorter URLs, HATEOAS and the folder structure.

Shorter URLs

I propose maintaining numeric IDs for each author, corpus, text, etc. and using those to construct the REST endpoints.

So, for example, endpoint GET /lang/latin/corpus/perseus/author/tacitus/text/germania becomes GET /lang/latin/corpus/1/author/6/text/8.

This keeps the URLs short while allowing the actual names that the IDs map to to be as long as needed.

A problem with this (assuming an external API consumer) is figuring out the ID of a specific author/corpus/text.

API Discoverability

The formal term for this is HATEOAS. This implies a user should be able to browse and discover all the endpoints of the REST API using the REST API itself.

Towards this, we should define endpoints like GET /lang/latin/corpus/ that returns a response:

{"corpora": [ {"name": "perseus", "id": "1"}, ... ]}

This way, the user will be able to query for all the available corpora and figure out the ID.

Another example of this is from my POS tagger implementation. It is possible to view the list of languages and POS tagging methods they support via GET /core/pos, and perform the actual POS tagging for a string via POST /core/pos.

In general, adding a GET request handler to endpoints like /lang, /lang/<int:lang_id>/corpus, etc. should make the API discoverable.

Folder Structure

Right now all the resources are defined in a single file (api_json.py), and so are tests (tests.py). There is also no distinction between files containing utility functions and actual REST resources.

I briefly mentioned this in my #20 (comment).

An example of my proposed organisation is in #27. Inside the folder for a specific function (/pos), the resources will be in views.py, the database stuff (if any) in models.py, utility functions in utils.py and parameters in constants.py.

(It may be better to keep constants.py at the root of the API folder structure, to easily find and change)