GithubHelp home page GithubHelp logo

citationprofiler's Introduction

CitationProfileR

About

CitationProfileR is an R package and Shiny web app that allows users to upload a PDF or citation file and to get statistics on the gender and geographic distribution of the citations they include. These visualizations will be provided for download, and summarized and visualized in a form that is publication-ready. The package uses data from various web service, like Crossref API, GROBID API, Gender-API, and Open Street Map, as well as the data extracted from the uploaded files.

Contributors

Contributions Name
🔢 💻 🤔 Adriana Beltran Andrade
🔢 💻 🤔 Lika Mikhelashvili
🔢 💻 🤔 Mackie Zhou
🔢 💻 🤔 Rithika Devarakonda
🔢 🧑‍🏫 Lukas Wallrich

Definitions

Citations - A reference to a source of information in a academic paper. Citations typically include information such as author names, article title, DOI, date of publication, etc.

Diversity Statement - A diversity statement of an academic journal is a statement that acknowledges the gender and/or racial imbalance within a scientific field. The diversity statement motivates researchers to pay particular attention to the gender and racial breakdown of the authors cited in their work. It recognizes existing biases and aims for greater inclusivity in the field.

How to Access

CitationProfileR Shiny dashboard can be accessed through downloading the package along with an external hosting on an html website that will be accessible through search engines.

The link for the hosted dashboard is: http://127.0.0.1:4955

A user can launch the Shiny dashboard by first finding the app.R script in which the respective file path is: citationProfileR/inst/CitationProfileR/app.R. Once opening the file, all one needs to do is click on the run app tab at the top of the file in Rstudio.

Dependencies

There are no special dependencies. All one needs is Rstudio downloaded and installed in the latest version.

How to Install

You can install the development version of CitationProfileR from GitHub with:

# install.packages("devtools")
devtools::install_github("LukasWallrich/citationProfileR")

Functions/Datasets Included

Our package includes the following functions, which allows the user to extract information from all authors included in the paper uploaded to our app along with returning the gender prediction per every name as well. Also, they can retrieve a diversity statement and see a bar plot with the count per gender in the web app as well.

  • first_name_check takes in data frame of extracted citations returned from GROBID API and returns first name of every author

  • get_author_info takes in data frame that contains every cited author's name, paper title, and date published and returns first and last name of all cited authors from Crossref API

  • guess_gender takes in a cited author's name, geographic location based on country code, as well as if the user wants to use the cache feature which remembers previous predictions based on a name used in earlier iterations in order to return a data frame containing the author's name, location, and associated gender prediction and accuracy measure from GenderAPI

  • parse_pdf_refs takes in a pdf uploaded from a user containing a works cited page and returns the isolated references of every cited author and their respective work from GROBID

  • get_location takes in a data frame of all cited author's affiliations and uses Crossref API in order to return a data frame with all associated countries and country codes in the ISO 3166 standardized format for every given author

Examples

These are some basic examples for every function in our package.

First, load CitationProfileR R package:

library(CitationProfileR)

In order to use the first_name_check function, a user needs to upload a csv file to their Rstudio dashboard. After the csv file has been saved locally on one's file, they can call the function successfully. We already have some example csv files in the inst folder within the test-data sub folder that a user can access.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
first_name_check(sample_data_frame)

Likewise, we follow the same procedure for the get_author_infoimplementation as we did for the first_name_check function. The example csv files within our package will also work with this implementation.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
get_author_info(sample_data_frame)

For the guess_genderfunction, a user needs to replace the name parameter with one of their own in " " along with a country code of their choice also in " ."

#Standardized format for any use
#guess_gender(name, countrycode)

#Example of how to call the function using a name and country of their choice. In this case, the name is Rithika and the country is the United States where the associated code is the US.
guess_gender("Rithika", "US")

The parse_pdf_refs takes in a pdf uploaded into Rstudio, and there is also an example pdf available for a user to access in order to run the function

file_path <- system.file("test-data", "Wallrich_et_al_2020.pdf", package = "CitationProfileR")
parse_pdf_refs(file_path)

The get_location() function takes in a data frame with affiliations and outputs the country names and country codes of where the affiliations are located. The function has a default affiliations column name set to "affiliation.name", but the user can set a different column name. The sample_data_frame dataframe is an example data object available in our package that the user can examine the function on.

file_path <- system.file("test-data", "test_citations_table2.csv", package = "CitationProfileR")
sample_data_frame <- read.csv(file_path)
get_location(sample_data_frame)

Data Sources

CitationProfileR source of data is any academic article in a pdf version that is uploaded to the Shiny UI by users of the package. After the pdf is uploaded, the parse_pdf_refs() function will parse the contents of the file and output a data frame with all the cited authors along with their affiliations and DOI if applicable. Then, the guess_gender() function takes in this data frame and outputs a new one including the predicted gender and probability of accuracy of every given name using the Gender-API.

Data Collection and Update Process

The data does not need to be either manually or automatically updated as the user inputs the academic article on their own.

Repo Architecture

This repository follows the standard R package structure. The R folder contains the code to the functions available in CitationProfileR separated into different R scripts. The code for the Shiny UI dashboard is in the inst folder in the repository. A user can access the final dashboard by using the link provided above or through accessing the cloned version of the repository contents on their local device.

License

MIT License. Copyright (c) 2023 CitationProfileR authors.

How to Provide Feedback

Questions, bug reports, and feature requests can be submitted to this repo's issue queue.

Have Questions?

Contact us at [email protected] or [email protected].

citationprofiler's People

Contributors

abeltranandrade avatar lmikhelashvili avatar lukaswallrich avatar mackiezhou avatar rithika-d avatar

Stargazers

 avatar  avatar

Watchers

 avatar

citationprofiler's Issues

Set up R package

This should probably be packaged into an R package so that it can easily be installed, and also be used locally by people who want to do batch processing.

This also provides a good incentive to engage in proper functional programming - i.e. to break code into clean functions and to document them. See Hadley Wickham's R package development book for reference

ToDos for now:

  • Fill in DESCRIPTION
  • Keep adding dependencies into DESCRIPTION
  • Try to keep package in shape that passes tests (in Github Action)

Non-ASCII characters in `sample_data_frame` for `get_location()`

When doing the CMD check I keep getting a warning about non-ASCII characters present in the sample_data_frame. The screenshot is attached below. I reviewed the chapter 8.1.2 Documenting datasets R here and they seem to say that this warning is not a big deal. It says this warning is suppressed by R CMD check --as-cran. I could not find a way to resolve this, so I think this would be an enhancement in the future.

Screen Shot 2023-05-03 at 1 23 59 PM

Example for both replace_key and guess_gender does not pass CMD check

Currently in the documentation for the replace_key and guess_gender functions, the examples given right now in the code are unable to pass the CMD check because there is some problem with how the examples running are unable to access the api_keys file that they need in order to run despite the file being in all folders relevant. Right now, I put the \dontrun{} statement in order for the CMD check to pass, but in the future we should remove this statement and try to fix this error.

Consistency in the country names returned by `get_location()`

Country names are returned in the national language rather than English in some entries. The countrycode package should work for that unless the OSM returns English as well … as it stands, we get a mix of English names (extracted directly) and national language names (extracted from the OSM) …

Create logo

We should create a logo for the package & app - probably a hexsticker.

The polaroid shiny app should work well.

Code for geo-location

I just wanted to play around a bit with the "new" Bing search - i.e. the chat-bot search (which I found really disappointing after all the hype). Anyway, I used it to figure out how to geo-locate an affiliation:

Option 1, using OSM

This works well, but relies on the rather heavy tmaptools package - an approach with fewer dependencies would be great (maybe just directly building on the code from this package, or on the lighter osmdata package) ... but if you are pressed for time, you can also test and use this approach, and document this as an opportunity for refactoring ...

library(tmaptools)

loc <- geocode_OSM("Universität des Saarlandes")
rev_geocode_OSM(loc$coords["x"], loc$coords["y"])
x[[1]]$country_code

Option 2, using wikidata

This would be great as it is linked to other university metadata - but searching wikidata from strings seems very difficult ... unless one of you wants to figure that out, let's skip for now :)

(Auto-)deploy Shiny app

Once we have an MVP, we need to deploy it to shinyapps.io (or decide that it doesn't work well there for some reason, and choose a different service) - and then we should ensure that it gets auto-deployed with updates (through a GitHub action).

`get_location()` function efficiency

get_location() currently checks if the country_name and country_code columns have any pre-existing values other than NA after the function already got the locations from OSM and countrycode package. This might not be the most efficient way to check. In the future, it would be nice to check if there are any pre-existing values before the function employs OSM and countrycode.

Working directory for parse_pdf_refs

The working directory set in parse_pdf_refs function is a local file path. We have to make the path global. This will create an issue if anyone else tries to run the code. The variable that needs to be changed is f_bib.

Release CitationProfileR 0.0.1

This is far in the future - but worth keeping some of the points in mind from the start (e.g., licencing of included files)

First release:

Prepare for release:

  • git pull
  • urlchecker::url_check()
  • devtools::check(remote = TRUE, manual = TRUE)
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • git push

Submit to CRAN:

  • usethis::use_version('patch')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • git push
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • usethis::use_news_md()
  • git push

Link to bibliometrix package for more advanced analyses

The bibliometrix package offers comprehensive bibliometric analyses e.g. regarding author centrality. These would be enhanced by the gender dimension - so in a future version, this package could be linked to bibliometrix - not sure yet whether that would just be a vignette that shows a workflow, or a wrapper around key bibliometrix functions here?

Review resources

As discussed, I added some resources into the team-resources folder, specifically:

  • crossref_and_related.R - contains extract_crossref(), a function to extract crossref data if you do not have a doi (if you do have a doi, it is much easier, as shown in thet get_citation() function in there). The matching is currently based on exact text matching - probably something fuzzier would be better.
  • pdf_extraction.R implements one way to extract references from PDFs - note that this includes a Python function that you need to put into a different file and call through reticulate - below - or translate into R
  • wikidata_example.Rmd - just a simple Wikidata example - if you want to use Wikidata to identify university locations - could consider alternatives (e.g., OpenStreetMaps) - but they would seem to create problems with satelite campuses abroad which give universities addresses in multiple countries (I would presently ignore satelite campuses, unless you find a reliable way to differentiate them from the main university)
  • screen_matches.R - a basic Shiny app that you could build on - but there are probably better templates. Also, maybe this screen is close to something we need to include eventually to check dubious crossref matches (but that could also be a table, maybe borrowed from ASySD)

Packages to consider:

  • rcrossref for crossref API access
  • reticulate to use Python code within R (e.g., if there are self-contained functions within cleanBib) that you do not want to translate unnecessarily (most important function: source_python that just makes Python functions available to R)
  • bib2df to parse .bib reference files - that users can upload/provide directly

Diversity statement texts - create template to guide analysis

We need to agree on the specific template text we want to populate so that we can make sure to get the right data out of the analysis.

For now, here the most common statement as per Zurn et al. (2020):

Recent work in several fields of science has identified a bias in citation practices such that papers from women and other minorities are under-cited relative to the number of such papers in the field [2., 3., 4., 5., 6.]. Here we sought to proactively consider choosing references that reflect the diversity of the field in thought, form of contribution, gender, and other factors. We obtained predicted gender of the first and last author of each reference by using databases that store the probability of a name being carried by a man or a woman [4]i. By this measure (and excluding self-citations to the first and last authors of our current paper), our references contain 42.9% woman(first)/woman(last), 28.6% man/woman, 7.1% woman/man, and 21.4% man/man. This method is limited in that: (i) names, pronouns, and social media profiles used to construct the databases may not, in every case, be indicative of gender identity, and (ii) it cannot account for intersex, non-binary, or transgender people. We look forward to future work that could help us to better understand how to support equitable practices in science.

This only considers first and last authors and requires knowing the gender of these. In addition, at least, I would be keen to report the gender of all, especially because the last author only matters in some scientific fields.

Crossref - add year to search

I checked how to use query.bibliographic and just wanted to highlight that you use the github search for that - if you restrict it to R code, you see how other people have done this:

https://github.com/search?l=R&p=2&q=query.bibliographic&type=Code

Everyone using title and year seems to just paste them together - the only attempt to use multiple elements is commented out, so it might not work. An example for combining year and title:

https://github.com/markrobinsonuzh/os_monitor/blob/a6e7e80ac3c14235265ab1daffb72f51c137da49/code/rcrossref_rest.R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.