lukaswallrich / citationprofiler Goto Github PK

An R Shiny app to analyse the diversity of academic reference lists

License: Other

R 96.56% TeX 3.44%

citationprofiler's Introduction

CitationProfileR

About

CitationProfileR is an R package and Shiny web app that allows users to upload a PDF or citation file and to get statistics on the gender and geographic distribution of the citations they include. These visualizations will be provided for download, and summarized and visualized in a form that is publication-ready. The package uses data from various web service, like Crossref API, GROBID API, Gender-API, and Open Street Map, as well as the data extracted from the uploaded files.

Contributors

Contributions	Name
🔢 💻 🤔	Adriana Beltran Andrade
🔢 💻 🤔	Lika Mikhelashvili
🔢 💻 🤔	Mackie Zhou
🔢 💻 🤔	Rithika Devarakonda
🔢 🧑‍🏫	Lukas Wallrich

Definitions

Citations - A reference to a source of information in a academic paper. Citations typically include information such as author names, article title, DOI, date of publication, etc.

Diversity Statement - A diversity statement of an academic journal is a statement that acknowledges the gender and/or racial imbalance within a scientific field. The diversity statement motivates researchers to pay particular attention to the gender and racial breakdown of the authors cited in their work. It recognizes existing biases and aims for greater inclusivity in the field.

How to Access

CitationProfileR Shiny dashboard can be accessed through downloading the package along with an external hosting on an html website that will be accessible through search engines.

The link for the hosted dashboard is: http://127.0.0.1:4955

A user can launch the Shiny dashboard by first finding the app.R script in which the respective file path is: citationProfileR/inst/CitationProfileR/app.R. Once opening the file, all one needs to do is click on the run app tab at the top of the file in Rstudio.

Dependencies

There are no special dependencies. All one needs is Rstudio downloaded and installed in the latest version.

How to Install

You can install the development version of CitationProfileR from GitHub with:

# install.packages("devtools")
devtools::install_github("LukasWallrich/citationProfileR")

Functions/Datasets Included

Our package includes the following functions, which allows the user to extract information from all authors included in the paper uploaded to our app along with returning the gender prediction per every name as well. Also, they can retrieve a diversity statement and see a bar plot with the count per gender in the web app as well.

first_name_check takes in data frame of extracted citations returned from GROBID API and returns first name of every author
get_author_info takes in data frame that contains every cited author's name, paper title, and date published and returns first and last name of all cited authors from Crossref API
guess_gender takes in a cited author's name, geographic location based on country code, as well as if the user wants to use the cache feature which remembers previous predictions based on a name used in earlier iterations in order to return a data frame containing the author's name, location, and associated gender prediction and accuracy measure from GenderAPI
parse_pdf_refs takes in a pdf uploaded from a user containing a works cited page and returns the isolated references of every cited author and their respective work from GROBID
get_location takes in a data frame of all cited author's affiliations and uses Crossref API in order to return a data frame with all associated countries and country codes in the ISO 3166 standardized format for every given author

Examples

These are some basic examples for every function in our package.

First, load CitationProfileR R package:

library(CitationProfileR)

In order to use the first_name_check function, a user needs to upload a csv file to their Rstudio dashboard. After the csv file has been saved locally on one's file, they can call the function successfully. We already have some example csv files in the inst folder within the test-data sub folder that a user can access.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
first_name_check(sample_data_frame)

Likewise, we follow the same procedure for the get_author_infoimplementation as we did for the first_name_check function. The example csv files within our package will also work with this implementation.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
get_author_info(sample_data_frame)

For the guess_genderfunction, a user needs to replace the name parameter with one of their own in " " along with a country code of their choice also in " ."

#Standardized format for any use
#guess_gender(name, countrycode)

#Example of how to call the function using a name and country of their choice. In this case, the name is Rithika and the country is the United States where the associated code is the US.
guess_gender("Rithika", "US")

The parse_pdf_refs takes in a pdf uploaded into Rstudio, and there is also an example pdf available for a user to access in order to run the function

file_path <- system.file("test-data", "Wallrich_et_al_2020.pdf", package = "CitationProfileR")
parse_pdf_refs(file_path)

The get_location() function takes in a data frame with affiliations and outputs the country names and country codes of where the affiliations are located. The function has a default affiliations column name set to "affiliation.name", but the user can set a different column name. The sample_data_frame dataframe is an example data object available in our package that the user can examine the function on.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
get_location(sample_data_frame)

Data Sources

CitationProfileR source of data is any academic article in a pdf version that is uploaded to the Shiny UI by users of the package. After the pdf is uploaded, the parse_pdf_refs() function will parse the contents of the file and output a data frame with all the cited authors along with their affiliations and DOI if applicable. Then, the guess_gender() function takes in this data frame and outputs a new one including the predicted gender and probability of accuracy of every given name using the Gender-API.

Data Collection and Update Process

The data does not need to be either manually or automatically updated as the user inputs the academic article on their own.

Repo Architecture

This repository follows the standard R package structure. The R folder contains the code to the functions available in CitationProfileR separated into different R scripts. The code for the Shiny UI dashboard is in the inst folder in the repository. A user can access the final dashboard by using the link provided above or through accessing the cloned version of the repository contents on their local device.

License

How to Provide Feedback

Questions, bug reports, and feature requests can be submitted to this repo's issue queue.

Have Questions?

citationprofiler's People

Contributors

Stargazers

Watchers

citationprofiler's Issues

Set up R package

This should probably be packaged into an R package so that it can easily be installed, and also be used locally by people who want to do batch processing.

This also provides a good incentive to engage in proper functional programming - i.e. to break code into clean functions and to document them. See Hadley Wickham's R package development book for reference

ToDos for now:

Fill in DESCRIPTION
Keep adding dependencies into DESCRIPTION
Try to keep package in shape that passes tests (in Github Action)

Non-ASCII characters in `sample_data_frame` for `get_location()`

When doing the CMD check I keep getting a warning about non-ASCII characters present in the sample_data_frame. The screenshot is attached below. I reviewed the chapter 8.1.2 Documenting datasets R here and they seem to say that this warning is not a big deal. It says this warning is suppressed by R CMD check --as-cran. I could not find a way to resolve this, so I think this would be an enhancement in the future.

Enable upload of bib (and csv) files in Shiny

Example for both replace_key and guess_gender does not pass CMD check

Currently in the documentation for the replace_key and guess_gender functions, the examples given right now in the code are unable to pass the CMD check because there is some problem with how the examples running are unable to access the api_keys file that they need in order to run despite the file being in all folders relevant. Right now, I put the \dontrun{} statement in order for the CMD check to pass, but in the future we should remove this statement and try to fix this error.

final feature of guess_gender cache implementation to be resolved (name-country code combo)

Right now, there is no way to add two variables in the env_cache() function so the result will always be based on the name rather than the country code though I'm working on resolving that.

Consistency in the country names returned by `get_location()`

Country names are returned in the national language rather than English in some entries. The countrycode package should work for that unless the OSM returns English as well … as it stands, we get a mix of English names (extracted directly) and national language names (extracted from the OSM) …

credit copyrighted code

Create logo

We should create a logo for the package & app - probably a hexsticker.

The polaroid shiny app should work well.

Create RStudio Add-In

The Shiny app should be available locally as an RStudio addin - so that it can be called from the menu - should just require adding a few configuration lines once we have the app

Code for geo-location

I just wanted to play around a bit with the "new" Bing search - i.e. the chat-bot search (which I found really disappointing after all the hype). Anyway, I used it to figure out how to geo-locate an affiliation:

Option 1, using OSM

This works well, but relies on the rather heavy tmaptools package - an approach with fewer dependencies would be great (maybe just directly building on the code from this package, or on the lighter osmdata package) ... but if you are pressed for time, you can also test and use this approach, and document this as an opportunity for refactoring ...

library(tmaptools)

loc <- geocode_OSM("Universität des Saarlandes")
rev_geocode_OSM(loc$coords["x"], loc$coords["y"])
x[[1]]$country_code

Option 2, using wikidata

This would be great as it is linked to other university metadata - but searching wikidata from strings seems very difficult ... unless one of you wants to figure that out, let's skip for now :)

First name check function – clear out if not needed (and/or consider gender-only pathway)

Currently, there is a first name function that's not used as all information is requested from crossref. Could add an option to only analyse gender and make that more efficient - e.g., extract first names from references and only augment with crossref - but if we want to use location in gender estimation, that won't really make a difference.

(Auto-)deploy Shiny app

Once we have an MVP, we need to deploy it to shinyapps.io (or decide that it doesn't work well there for some reason, and choose a different service) - and then we should ensure that it gets auto-deployed with updates (through a GitHub action).

still an issue for name-countrycode unique combination stored in cache

There is a warning message now but the issue is still not completely fixed as there needs to be a way to have different combinations stored in the cache

genderize.io API

`get_location()` function efficiency

get_location() currently checks if the country_name and country_code columns have any pre-existing values other than NA after the function already got the locations from OSM and countrycode package. This might not be the most efficient way to check. In the future, it would be nice to check if there are any pre-existing values before the function employs OSM and countrycode.

Working directory for parse_pdf_refs

The working directory set in parse_pdf_refs function is a local file path. We have to make the path global. This will create an issue if anyone else tries to run the code. The variable that needs to be changed is f_bib.

OSM Limitation in `get_location()`

Appears that the OSM apparently does not have English names for many places - Philipps-Universität Marburg works, but evidently, the other version is used in affiliations. That is a limitation we need to fix at some point.

Release CitationProfileR 0.0.1

This is far in the future - but worth keeping some of the points in mind from the start (e.g., licencing of included files)

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Link to bibliometrix package for more advanced analyses

The bibliometrix package offers comprehensive bibliometric analyses e.g. regarding author centrality. These would be enhanced by the gender dimension - so in a future version, this package could be linked to bibliometrix - not sure yet whether that would just be a vignette that shows a workflow, or a wrapper around key bibliometrix functions here?

Reconsider use of genderize.io

I just came across this article that assessed gender-estimation APIs ... and found that genderize.io was by far the weakest: https://pubmed.ncbi.nlm.nih.gov/34629970/

Can you please review and reconsider which we should use? If we have to pay, up to 10 USD per month would be ok for now ...

Review resources

As discussed, I added some resources into the team-resources folder, specifically:

crossref_and_related.R - contains extract_crossref(), a function to extract crossref data if you do not have a doi (if you do have a doi, it is much easier, as shown in thet get_citation() function in there). The matching is currently based on exact text matching - probably something fuzzier would be better.
pdf_extraction.R implements one way to extract references from PDFs - note that this includes a Python function that you need to put into a different file and call through reticulate - below - or translate into R
wikidata_example.Rmd - just a simple Wikidata example - if you want to use Wikidata to identify university locations - could consider alternatives (e.g., OpenStreetMaps) - but they would seem to create problems with satelite campuses abroad which give universities addresses in multiple countries (I would presently ignore satelite campuses, unless you find a reliable way to differentiate them from the main university)
screen_matches.R - a basic Shiny app that you could build on - but there are probably better templates. Also, maybe this screen is close to something we need to include eventually to check dubious crossref matches (but that could also be a table, maybe borrowed from ASySD)

Packages to consider:

rcrossref for crossref API access
reticulate to use Python code within R (e.g., if there are self-contained functions within cleanBib) that you do not want to translate unnecessarily (most important function: source_python that just makes Python functions available to R)
bib2df to parse .bib reference files - that users can upload/provide directly

Diversity statement texts - create template to guide analysis

We need to agree on the specific template text we want to populate so that we can make sure to get the right data out of the analysis.

For now, here the most common statement as per Zurn et al. (2020):

Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minorities are under-cited relative to the number of such papers in the field [2., 3., 4., 5., 6.]. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, and other factors. We obtained predicted gender of the first and last author of each reference by using databases that store the probability of a name being carried by a man or a woman [4]i. By this measure (and excluding self-citations to the first and last authors of our current paper), our references contain 42.9% woman(first)/woman(last), 28.6% man/woman, 7.1% woman/man, and 21.4% man/man. This method is limited in that: (i) names, pronouns, and social media profiles used to construct the databases may not, in every case, be indicative of gender identity, and (ii) it cannot account for intersex, non-binary, or transgender people. We look forward to future work that could help us to better understand how to support equitable practices in science.

This only considers first and last authors and requires knowing the gender of these. In addition, at least, I would be keen to report the gender of all, especially because the last author only matters in some scientific fields.

Add countrycode to guess_gender in Shiny and set default argument

In Shiny: include country in guess_gender call

In guess_gender: set default argument for countrycode and allow for it to be NA

Add code to create author-year citations for output table

Can take this functions
https://github.com/ESHackathon/CiteSource/blob/4b740c9ef669c87f100ab48abdfc48847216dd02/R/tables.R#L304

Crossref - add year to search

I checked how to use query.bibliographic and just wanted to highlight that you use the github search for that - if you restrict it to R code, you see how other people have done this:

https://github.com/search?l=R&p=2&q=query.bibliographic&type=Code

Everyone using title and year seems to just paste them together - the only attempt to use multiple elements is commented out, so it might not work. An example for combining year and title:

https://github.com/markrobinsonuzh/os_monitor/blob/a6e7e80ac3c14235265ab1daffb72f51c137da49/code/rcrossref_rest.R

Make API calls/tests more robust

OSM calls sometimes fail in check - should probably be made more robust by using insistently

Add location analysis to Shiny

Add location analysis and reporting to Shiny

Create Shiny wireframe

Example Shiny

pnas_nudge.zip

update api_key in all folders once premium account with unlimited requests is created

Right now, the app is working but for some reason the key given under the premium account is not able to output any results most likely due to them running out. But whenever the finalized key with unlimited results is retrieved, we need to update the api_keys file in the citationProfileR, inst, and R folders.

unlimited api key request still needs to be added in all the api_keys file

Add progress bar to Shiny

Worst case just shinyspinner - but ideally for real

lukaswallrich / citationprofiler Goto Github PK

citationprofiler's Introduction

CitationProfileR

About

Contributors

Definitions

How to Access

Dependencies

How to Install

Functions/Datasets Included

Examples

Data Sources

Data Collection and Update Process

Repo Architecture

License

How to Provide Feedback

Have Questions?

citationprofiler's People

Contributors

Stargazers

Watchers

citationprofiler's Issues

Option 1, using OSM

Option 2, using wikidata

Recommend Projects

Recommend Topics

Recommend Org

Jobs