Light

kevinbretonnelcohen / cord-19 Goto Github PK

View Code? Open in Web Editor NEW

0.0 2.0 0.0 10.64 MB

Natural language processing code for coronavirus corpus CORD-19

HTML 99.72% Perl 0.27% Shell 0.01%

cord-19's Introduction

CORD-19

Natural language processing code for coronavirus corpus CORD-19

This repository is for computational linguistics, natural language processing, and text mining tools for working with the CORD-19 corpus of papers related to the COVID-19 Open Research Dataset. See here for information on the dataset itself:

https://pages.semanticscholar.org/coronavirus-research

CORD-19-JSON-Parser.Rmd/.html: Parses the CORD-19 JSON files and pulls out the important text.

CORD-19 Word Cloud.Rmd/.html: A COVID-19-specific word cloud, analysis of word associations, and analysis of lexical frequencies. NOTE: Any other versions of WordCloud code in here are deprecated.

CORD-19 Text clustering experiments.Rmd/.html: Simple text clustering experiments using K-means and hierarchical clustering. Various parameters are varied.

lexicalFrequency.pl: Raw frequencies of words, without preprocessing.

overrepresentedWords.pl: Run this on the output of lexicalFrequency.pl to get over-represented words in the corpus.

test.overrepresented.01/.02.txt: Test data for overrepresentedWords.pl

cord-19's People

Contributors

Watchers

cord-19's Issues

JSON parser needs section-specific markers in the debugging output

Currently has text that marks where the abstract is in the debugging output. Needs similar markers for the title and for the article body.

JSON Parser: count of files processed is not being incremented

Last action of the script should be to output a count of the files that have been processed....which it is, but clearly that count is not getting incremented, as it is being output as 0, despite the fact that a ton of output is being produced.

JSON Parser needs TITLE+ABSTRACT option

Currently pulls text from titles, or abstracts, or the article body, or the full paper---but, not title + abstract, which seems like a reasonable unit of analysis.

Add section-specific markers to debugging output for title and body

Currently has text that marks where the abstract is in the debugging output. Needs similar markers for the title and for the article body.

overrepresentedwords.pl: logic error involving computing relative frequencies

I have a logic error: I need to go through all words in one corpus (and then, optionally, the other), but I'm going through the union of the words in both the corpus of interest and the reference corpus.

overrepresentedWords.pl: ratios don't look right for the corpus of interest

All ratios are coming out at 1, despite the fact that we are passing both the structured and the metamorphic tests...

JSON Parser: Needs test cases

I have some files for validation, but this still needs actual test cases.

JSON Parser: Add line breaks after article body paragraphs

When I read in the article body, I'm not preserving the divisions between paragraphs. It would be nice to keep those, although I don't need them right at this moment, since I'm doing purely lexical stuff.

Feature request: Over-represented words script: get over-represented words in the reference corpus

Currently I get the words that are over-represented in the corpus of interest as compared to the reference corpus, but not the ones that are over-represented in the reference corpus as compared to the corpus of interest. That's fine from the point of view of the research questions about the corpus of interest, but knowing which words are "over-represented" in the reference corpus as compared to the corpus of interest is a good sanity check on the functioning of the script, so let's add that in, too. It's a simple return to the previous approach of calculating the ratios for the union of the words in the vocabularies, rather than the ratios for only those words that are in the corpus of interest.

Basic analysis code needs to be refactored into functions

The code for frequency analysis needs to be refactored. The "pipeline" should be broken down into functions so that multiple datasets can be compared.

All R scripts: Make input and output directory paths easily customizable

The input and output directory paths should be put into easy-to-find variables in order for this to be more easily adaptable by others.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs