ericleasemorgan / reader Goto Github PK

Distant Reader, a tool for using & understanding a corpus

License: GNU General Public License v2.0

Shell 24.42% Perl 18.53% Python 14.75% HTML 18.22% CSS 1.80% JavaScript 20.92% Raku 1.35%

distant-reading text-mining natural-language-processing hpc-systems

reader's Introduction

Distant Reader CORD

The Distant Reader CORD is a high performance computing (HPC) system which: 1) takes an almost arbitrary amount of unstructured data (text) as input and outputs a set of structured data for analysis, and 2) does this work against a specific data set called CORD-19. (Reader CORD is based on a different software suite called Distant Reader Classic which is designed for more generic sets of input.)

To do this work, the Distant Reader CORD first caches the data set. It then transforms the content into a set of plain text files. Third, the Reader does text mining and natural language processing against the text files for the purpose of feature extraction: n-grams, parts-of-speech, named-entities, etc. The results of this process is a set of tab-delimited text files. The whole of the tab-delimited text files is then distilled into a relational database. A set of tabular and narrative reports is then generated against the database. The cache, transformed plain text files, tab-delimited files, relational database, and reports are then compressed ito a single (zip) file, and returned to the... reader. [1]

The returned file is affectionately called a "study carrel". The student, researcher, or scholar is intended to peruse the study carrel for the purpose of supplementing the more traditional reading process. For more detail, links of possible interest include:

home page - https://cord.distantreader.org
fledgling study carrels - https://cord.distantreader.org/carrels/
Guide to the code - GUIDE.md
blog postings - http://sites.nd.edu/emorgan/category/distant-reader/
Slack channel - http://bit.ly/distantreader-on-slack
Twitter feed - http://twitter.com/readerdistant

As an HPC, the Distant Reader CORD is not a single computer program but instead a suite of software comprised of many individual scripts and applications. Personally, I see the scripts and applications akin to collection of poems used to make the output of human expression more cogent. Really. Seroiusly.

As a collection of scripts and applications, the Distant Reader has only been built by "standing on the shoulders of giants". Cited here in no particular order nor necessarily complete, they include these below and more:

the Perl-based LWP modules - this software is a significant part of harvesting process
Wget - an absolutely wonderful Internt spidering application
Tika - a Java-based library which transforms just about any file into plain text
Spacy - a Python module which simplifies natural language processing operations
Gensim - another Python module for natural language processing
Textacy - a Python module building on the good work of Spacey
SQLite - a cross-platform, SQL-compliant relational database library/application
OpenStack - a tool for building virtual machines
Slurm - a tool for instantiating a cluster of computer nodes and what runs on them
Airivata - a Web-based suite of software used to monitor computing jobs on a cluster
Other Python Libraries - sqlalchemy, pandas, itertools, wordcloud, scipy, sklearn, networkx, textatistic, nltk
Other Perl Modules - DBI, JSON, Archive::Zip, WebService::Solr, XML::XPath, CGI, File::Basename, File::Copy, HTML::Entities, HTML::Escape
Javascript Libraries - bootstap, jquery
Other Programs - csvstack

If you have any questions, then please don't hesitate to ask.

"Happy reading!"

[1] Just like GNU, the Distant Reader's defintion is rather recursive

Eric Lease Morgan <[email protected]>
Navari Family Center for Digital Scholarship
Hesburgh Libraries
University of Notre Dame
574/631-8604

Created: June 28, 2018
Updated: May 31, 2020

cord-19

This suite of software will prepare a data set called "CORD-19" for processing with the Distant Reader.

CORD-19 is a set of more than 50,000 full text scholarly journal articles surrounding the topic of COVID-19. Each "article" is really a JSON file containing (very) rudimentary bibliographic information, a set of paragraphs, and bibliographic citations. As a pre-processing step for the Distant Reader, the suite processes the CORD-19 metadata and its associated JSON files.

To get this software to work for you, pip install -r requirements.txt, configure ./bin/cache.sh, and the run ./bin/build.sh. The system will then:

download a zip file and its associated metadata file
uncompress the the zip file
move all the JSON files to a single directory
initialize a database
pour the metadata into the the database
output a simple narrative report summarizing the content of the metadata file

Depending on the network connection, the build process takes less than 7 minutes.

The next steps are the creation of two scripts:

Given an SQL SELECT statement, return a list of keys, and use them to initialize a Distant Reader study carrel
Given a JSON file, output a more human-readable version of the same

Wish us luck.

Eric Lease Morgan <[email protected]>
May 14, 2020

reader's People

Contributors

Stargazers

Watchers

Forkers

aculich jimfhahn mcarro10 dbrower archaeocharlie chengjiali

reader's Issues

On carrel pages ADD carrel credit/creator info & preferred citation

On carrel pages ADD carrel credit/creator info & citation and preferred citation, potentially as footer: e.g.
Meyers, N. (2020, March 26) “Distant Reading the US Academic Library Response to COVID19” URL=https://carrels.distantreader.org/library/wolff/index.htm

related to #32

loop through the database to create a full text index of all of CORD-19

relies on #15

update "Cite Us" page

Eric,

we need to create a new citation for distant reader. The easiest way to do this if for you to create a https://zenodo.org account and link the ericleasemorgan/reader repo to that. You can log into zenodo with your github credentials.

-Dave

create a database node for enhanced Distant Reader backend, and initialize the database

download CORD-19 datasets as well as the corresponding metadata file; select subset of CORD-19, save the metadata in the database, transform the JSON into plain text, and save the plain text to the file system

Add Creator ORCID to Carrel metadata & display

FAIRify carrels by adding a way to prompt for, keep & display the Carrel creator’s ORCID.
Related to #32

document (and publish) our experience

Please add a License

Eric,
From Natalie:
9. Make sure distant reader has a license on github if it doesn’t have one, assign one. Why? This is an essential step to making Distant reader FAIR - making sure it has a machine findable accessible license attached to the code. We use Apache V2 for Presqt use that copy of that file if you can’t decide what to use. https://github.com/ndlib/presqt/blob/master/LICENSE

Dave

get access to our cluster

Jiali, please send your SSH public key Eric C., and he will grant you access to our cluster. Eric C., please do the good voodoo you do.

Update the “about” link at the bottom of carrels pages https://carrels.distantreader.org/ to point to the citeus page or the github readme

double check iconography at bottom of distantreader.org

Eric,

Could you double check which funding organizations should be on the bottom of distantreader.org anyone who has made a financial or time commitment should be added. For instance, jetstream should be added.

Dave

add CDS branding to distantreader.org

Eric, in the top left of distantreader.org you have the ND logo, would it be possible to put the Navari Center Digital Scholarship logo next to it.

I should note that I don't actually think that the CDS has a logo, but the cds.library.nd.edu header "NAVARI FAMILY CENTER for DIGITAL SCHOLARSHIP" can be recreated exactly with the following css:

font-family: GPCBook,"HelveticaNeue",Helvetica,Arial,Verdana,sans-serif;
text-shadow: 1px 1px 1px rgba(0,0,0,.5);
text-rendering: optimizeLegibility;
font-style: normal;
direction: ltr;
font-weight: normal;

That first font family is GalaxiePolarisCondensed-Book and would have to be imported.

Create tickets for all this stuff

Create tickets for all this stuff below ( https://docs.google.com/document/d/1WZR1x1vUCYffXNlcs7Oe7q2UTfmFVM2zw3bnXD60MVc/edit ) in https://github.com/ericleasemorgan/reader/issues

Add provenance/bibliography for carrel contents

Add a way(maybe from reading the content of the input file or job submission info ? 1) to give credit and 2)show citation(s) and /or links for carrel source(s) including their URL targets if they are URLs - what would a “bibliography” of a carrel’s corpora type page look like? Can we build one from job input data alone?

In otherwords we need a way to cite/show what each distantreader job indexed so people can know and go visit whatever content was the input for a carrel. There should possibly be a carrel metadata page (about this carrel?) Possibly would want to point to the metadata/about page from a Cite-As feature on a carrel pages menu? #32

Auto Release Live Branch to Server.

Eric,

The live branch of the reader repo should automatically be released to the server, I can show you how to do this with github "Actions" and it should save a bit of time and should be relatively painless, similarly we should add some testing in.

Dave

introduce yourself to the reader

Please use the attached file as a sort of tutorial, and through the process I think (hope), you will be introduced to the Distant Reader process. Please tell me whether nor not the process was successful.

tutorial-01.txt

store extracted features into the database system

relies on #12

store extracted features into the database system

General UX feature upgrades

See https://carrels.distantreader.org/library/alcott-louisa/ for an example of a carrel page, Then, on all carrel pages change so the carrel title appears on all pages e.g. “ Basic Reports” not just “Basic Reports” so users always know what carrel they’re in. #31 #48
On carrel pages ADD display of the Carrel’s title #48
On carrel pages ADD carrel credit/creator info & citation and preferred citation, potentially as footer: e.g. Meyers, N. (2020, March 26) “Distant Reading the US Academic Library Response to COVID19” URL=https://carrels.distantreader.org/library/wolff/index.htm #49
Add a cite-as menu option to carrel pages that you can use to view and edit the carrel citation or a carrel metadata page (about this carrel?) Point to it from Cite-As on carrel pages menu?) #49
Add a way(maybe from reading the content of the input file or job submission info ? ) to give credit and show citation(s) and /or links for carrel source(s) including their URL targets if they are URLs - what would w “bibliography” of my carrel’s corpora type page look like? Can we build one from job input data alone? #50
Implement any remaining suggestions in ux-report.pdf #53
Implement any remaining suggestions in heuristic_Evaluation_DR(1).pdf #54
FAIRify carrels by adding a way to prompt for, keep & display the Carrel creator’s ORCID. #49
FAIRify carrels by adding a way to prompt for, keep & display a Carrel DOI #24

Update Readme line 27

@ericleasemorgan in readme need to fix in link to Slack channel - http://bit.ly/distantreader-slack to slack, on line 27 of readme, no longer valid

https://cord.distantreader.org/carrels should list titles of carrels not just filenames

https://carrels.distantreader.org/ should list titles of carrels not just filenames

review administrative rights for "reader" repository

Eric,

both Natalie ( nkmeyers ) and I should probably have administrative rights to do the following on the prime 'reader' repo:

and status and change status in https://github.com/ndlib/presqt/projects/1
assign people to 'reader' tickets/issues

could you change the settings accordingly?

-Dave

create a Web interface for searching the results

relies on #15

improve crawl success when retrieving content against javascript sites

If the Distant Reader is going to take single URLs or lists of URLS as input then it is important that the crawl succeeds in retrieving content from the sites. So, we need to choose and implement a headless browser into the harvester.

For example, the current wget crawl of the hesburgh library URL https://library.nd.edu/covid-19-response retrieves only: "Hesburgh Library Hesburgh Library Please enable javascript. VERSION: SHA ENV: ENVIRONMENT library-nd-edu-60906.0040517715371715 " because the website requires javascript to be enabled in order for the website to be rendered

Web crawlers can implement the equivalent of a headless browser engine to render this type of content successfully. Since so many sites on the web currently utilize javascript for rendering (like React sites) if the Distant Reader uses 'wget' alone then it won't be able to render the javascript dependent sites.There are several ways to automate the crawling of web content that requires javascript. Here is a short list of some application libraries that could be used: Selenium, Electron, Protractor, Puppeteer. The Distant Reader web crawling functionality needs to be enhanced by using a tool like this if it is going to successfully take single URLs or lists of URLs as input since web content is often rendered in this fashion - or through using pre-rendering tools such as Gatsby.js.

Integrate https://spacy.io/universe/project/displacy-ent

Integrate https://spacy.io/universe/project/displacy-ent for database download

loop through the database to create a full text index of the selected subset

relates to #14

get access to the cluster

Charlie, send your public SSH key to Eric C., and he will grant you SSH access to our cluster. Eric, please enable Charlie to SSH to our machines.

test

Need a Distant Reader logo.

We require a distant reader logo so that problems like #39 do not arise. This logo will go on the landing page, and will be used on the github as well.

On carrel pages ADD display of the Carrel’s title

See https://carrels.distantreader.org/library/alcott-louisa/ for an example of a carrel page, Then, on all carrel pages change so the carrel title appears on all pages e.g.
Displays as: "Carrel Title: Basic Reports” (not just “Basic Reports”)
so users always know what carrel they’re in.
Related to #32

Banner Image and Photo Credit on distantreader.org

On https://distantreader.org/ Add photo credit to the photo that’s there which is the Round Reading room at the british museum: https://images.app.goo.gl/rq93BxDE5P8pF2nk8

create one study carrel

Using the existing infrastructure, create one study carrel whose content is in any way COVID-related.

introduce yourself to the Reader

Please use the attached file as a sort of tutorial, and through the process I think (hope) you will become familiar with the Reader process. Please tell me whether or not the process worked.

tutorial-01.txt

distantreader.org something in the upper left hand corner.

Looks Like we need to change the distantreader.org landing page a little:


okay we'll change it to say Distant Reader then.

Dave
On 5/13/20 1:56 PM, Julie Vecchio wrote:
> Nothing, please. Thanks! ~j
>
> On Wed, May 13, 2020 at 1:54 PM David Molik <[email protected]> wrote:
>
>     Julie,
>
>     did she say if should be just the ND or just the NFCDS logo in the upper right hand corner?
>
>     Dave
>     On 5/13/20 12:02 PM, Julie Vecchio wrote:
>>     Hola,
>>
>>     I received a message from our Director of Communications letting me know that we need the logo that is in the upper left corner to be removed from https://distantreader.org/. 
>>
>>     I learned that we do not use institutional branding on individual research websites such as this one. She confirmed that leaving the NFCDS logo at the bottom by way of featuring the project partners is the correct approach (the same approach as the website for the Digital Humanities Research Institute that Dan ran last May: https://dhsouthbend.org/dhri/).
>>
>>     Thank you for removing that upper left logo when you have time in the near future!
>>
>>     ~j
>>     --

New ssh creds for development work in new deploy

-get list of users
-ask for public keys
-new ssh creds for development work in new deploy

create an API for interacting with the Distant Reader's structured data

write a few applications demonstrating the use of the API; for example, 1) write a program to answer Kaggle or TREC-style challenges, or 2) take as list of identifiers as input and output reports based on the associated structured data

introduce yourself the the Distant Reader process

Please SSH to the head node and practice with the attached tutorial. When you are finished I think (hope) you will have a better idea of how the Reader functions.

tutorial-01.txt

enhance the Distant Reader with more interactivity

integrate topic modeling into the interface, or 2) hyperlink features of interest and search the index

write distantreader article (for software, not Covid-19)

precursor to #10 ( #10 will cite this article )

write journal article documenting distantreader and its uses.

three viable options:

https://www.mitpressjournals.org/loi/coli
https://journals.plos.org
https://www.mdpi.com/journal/publications

test distant reader COVID-19

run the Distant Reader against a subset of CORD-19, and save the output in the form of narrative reports, tabular data sets, and in the database; this functionality already exists since this is the core function of the Distant Reader