science-for-nature-and-people / bibscan Goto Github PK

View Code? Open in Web Editor NEW

5.0 2.0 8.0 319 KB

R package to batch download PDFs from a Web of Science search

R 1.99% TeX 98.01%

literature-review literature-mining webofscience web-of-science bibliographic-database

bibscan's Introduction

BibScan

An R package to batch download PDFs from a .bib file

Installing

To install this package from GitHub you first need to install and run the devtools package.

install.packages("devtools")

To install the package, you can then execute

devtools::install_github("Science-for-Nature-and-People/BibScan")

Getting your token with Publishers

Before being able to use BibScan, you will have to get an authtetification token with CrossRef to be able to access content from Willey and Elsevier. Here are the intructions from crminer package on how to do this: https://github.com/ropensci/crminer#authentication

We recommend you to add your token to your R environement by editing your .Renviron file adding the variable name: CROSSREF_TDM= followed by your token.

Downloading files

To download files you need to be on a server that has a license to download from journal websites. The success rate of this package depends on the institutional access of the institution whose server you are on.

First, download a .bib file from a Web of Science search. Make sure that your .bib file includes a DOI.

You can then run

article_pdf_download(indir, outdir)

This will download PDFs from the .bib files in the director indir and save those PDFs in the director outdir.

Downloading files from Colandr

This tool has been designed to download files from literature reviews in the Colandr tool. The tool reads in a directory of .bib files that were imported into Colandr. The match argument can be used to specify the location of a .csv exported from Colandr. When this is specified, only the titles included in the .csv--which are a subset of the .bib file--will be downloaded.

article_pdf_download(indir='~/Documents/bibdir', '~/Documents/outdir', '~/Documents/sorted-papers.csv')

License

This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 3, as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

A copy of the GNU General Public License, version 3, is available at https://www.r-project.org/Licenses/GPL-3

Authors

Stephen Wood - Creator, Author - swood-ecology
Julien Brun - Author - brunj7
Timothy Nguyen - Author - timothydnguyen

bibscan's People

Contributors

Stargazers

Watchers

Forkers

brunj7 swood-ecology cristinasparks msleckman timcashion nathanhwangbo renespijker mj163163

bibscan's Issues

Doesn't filter out selected papers from Colandr

The package downloads all papers in the exported Colandr sheet, but doesn't filter out the ones that were selected through the Colandr process.

Other package dependencies

It looks like there are three more package dependencies when loading 'BibScan' - rvest, jsonlite, and xml2. I was getting error messages of 'function not found' for a few functions that I think are from those packages - "html_nodes" and "fromJSON" most notably.

Error in "select" function

I also had issues with the dependency packages not loading properly.

After manually loading the dependency packages, the following error appeared:

Error in select(., citation_title, citation_authors, citation_journal_name) :
could not find function "select"

Output prior to error:
Converting your isi collection into a bibliographic dataframe

Articles extracted 47
Done!

Genereting affiliation field tag AU_UN from C1: Done!

Parsed with column specification:
cols(
study_id = col_integer(),
deduplication_status = col_character(),
citation_screening_status = col_character(),
fulltext_screening_status = col_character(),
data_extraction_screening_status = col_character(),
data_source_type = col_character(),
data_source_name = col_character(),
data_source_url = col_character(),
citation_title = col_character(),
citation_abstract = col_character(),
citation_authors = col_character(),
citation_journal_name = col_character(),
citation_journal_volume = col_integer(),
citation_pub_year = col_integer(),
citation_keywords = col_character(),
fulltext_filename = col_character(),
fulltext_exclude_reasons = col_character()
)

Further investigate the discrepancies between downloads

Steve and I did not get the same number of download on Leslie data.

Check what are the discrepancies and if they are due to university subscriptions or something else

Replaces .bib file in output directory

If the output directory is the same as the input directory, and the .bib file is in that directory, running the package will remove the .bib file

issue with Dillon bib file

Dillon was trying to use BibScan to access a bunch of papers and couldn't get a few that seemed like should work. Attached is a .bib of the files that didn't download. I tried them on my machine and BibScan says they don't have DOIs, but when you look at the .bib file some of them clearly do. Does this work for you?

add travis

Add travis to the package

dirname error

from @kanedan29 i get an error message that says Error in dirname(outfilepath) : object 'outfilepath' not found even though my output folder definitely exists

Low Retrieval Rate

One user of the Bibscan library is asking on tips on how to improve the retrieval rate so my task for today was to figure out why the retrieval rate was so low. First, I ran the code given to me and got the same number of successful pdf retrievals. Based on the error messages given, it appears that the links don't work (don't know if this is obvious or not due to lack of knowledge about this package). To look into it more, I looked at the first ten documents. Some problems that I noticed was the documents from elsevier and wiley were not working. After trying to figure out why, I landed on this page: CrossRef/rest-api-doc#96. Also, in the crimer package, they said "At least Elsevier and I think Wiley also check your IP address in addition to requiring the authentication token". So maybe that's why these websites aren't working. For springerlink, it says that "Page Not Found". For the cambridge website, it gives me the warning pop up message "Unfortunately you do not have access to this content, please use the Get access link below for information on how to access this content." These are the websites/links that were from the first ten rows. Other than these error messages, I'm not really sure what else to look at since I'm pretty new on how this code (especially crimer) works.

harmonize styling

There are currently many different stylings in the package. We should try to make it more homogenous to help contributions

The installations are really long.

Not sure what all of the other dependencies are. But any idea why it takes so long to install?

modularize the article_pdf_download function

This function does a lot of things that should be relying on subfunctions

Improve PDF filenames from the publisher's default

See if we can rename the file with more explanatory names

`//` in the path of downloaded files

Seems due to how I set up the cache if cr_miner. Find a better way to do this, or a past processing as plan B. Seems to not disturb the doenload and file manipulation (at least on OSX/unix)

parsing failure

I got the below error when trying to run the main function. Attached are the files I used.

article_pdf_download(infilepath='~/Documents/Temporary/Lesley', colandr=screened_abstracts)

Converting your isi collection into a bibliographic dataframe

Articles extracted   100 
Articles extracted   200 
Articles extracted   300 
Articles extracted   326 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!


Converting your isi collection into a bibliographic dataframe

Articles extracted   43 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!

Warning: 1 parsing failure.
row # A tibble: 1 x 5 col     row col   expected   actual    file         expected   <int> <chr> <chr>      <chr>     <chr>        actual 1     5 NA    73 columns 9 columns literal data file # A tibble: 1 x 5

Error in filter_impl(.data, quo) : 
  Evaluation error: object 'citation_screening_status' not found.
In addition: Warning messages:
1: In if (grepl("\n", x)) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("\n", path)) return(path) :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
4: Missing column names filled in: 'X73' [73] 
5: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
7: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 2)

files.zip

Add test on the parameters passed by the users, such as type, ....
Add basic units testing on the package