GithubHelp home page GithubHelp logo

science-for-nature-and-people / bibscan Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 8.0 319 KB

R package to batch download PDFs from a Web of Science search

R 1.99% TeX 98.01%
literature-review literature-mining webofscience web-of-science bibliographic-database

bibscan's Introduction

BibScan

An R package to batch download PDFs from a .bib file

Installing

To install this package from GitHub you first need to install and run the devtools package.

install.packages("devtools")

To install the package, you can then execute

devtools::install_github("Science-for-Nature-and-People/BibScan")

Getting your token with Publishers

Before being able to use BibScan, you will have to get an authtetification token with CrossRef to be able to access content from Willey and Elsevier. Here are the intructions from crminer package on how to do this: https://github.com/ropensci/crminer#authentication

We recommend you to add your token to your R environement by editing your .Renviron file adding the variable name: CROSSREF_TDM= followed by your token.

Downloading files

To download files you need to be on a server that has a license to download from journal websites. The success rate of this package depends on the institutional access of the institution whose server you are on.

First, download a .bib file from a Web of Science search. Make sure that your .bib file includes a DOI.

You can then run

article_pdf_download(indir, outdir)

This will download PDFs from the .bib files in the director indir and save those PDFs in the director outdir.

Downloading files from Colandr

This tool has been designed to download files from literature reviews in the Colandr tool. The tool reads in a directory of .bib files that were imported into Colandr. The match argument can be used to specify the location of a .csv exported from Colandr. When this is specified, only the titles included in the .csv--which are a subset of the .bib file--will be downloaded.

article_pdf_download(indir='~/Documents/bibdir', '~/Documents/outdir', '~/Documents/sorted-papers.csv')

License

This package is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License, version 3, as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See the GNU General Public License for more details.

A copy of the GNU General Public License, version 3, is available at https://www.r-project.org/Licenses/GPL-3

Authors

bibscan's People

Contributors

brunj7 avatar grantnolasco avatar swood-ecology avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

bibscan's Issues

Other package dependencies

It looks like there are three more package dependencies when loading 'BibScan' - rvest, jsonlite, and xml2. I was getting error messages of 'function not found' for a few functions that I think are from those packages - "html_nodes" and "fromJSON" most notably.

Error in "select" function

  • I also had issues with the dependency packages not loading properly.

After manually loading the dependency packages, the following error appeared:

Error in select(., citation_title, citation_authors, citation_journal_name) :
could not find function "select"


Output prior to error:
Converting your isi collection into a bibliographic dataframe

Articles extracted 47
Done!

Genereting affiliation field tag AU_UN from C1: Done!

Parsed with column specification:
cols(
study_id = col_integer(),
deduplication_status = col_character(),
citation_screening_status = col_character(),
fulltext_screening_status = col_character(),
data_extraction_screening_status = col_character(),
data_source_type = col_character(),
data_source_name = col_character(),
data_source_url = col_character(),
citation_title = col_character(),
citation_abstract = col_character(),
citation_authors = col_character(),
citation_journal_name = col_character(),
citation_journal_volume = col_integer(),
citation_pub_year = col_integer(),
citation_keywords = col_character(),
fulltext_filename = col_character(),
fulltext_exclude_reasons = col_character()
)

issue with Dillon bib file

Dillon was trying to use BibScan to access a bunch of papers and couldn't get a few that seemed like should work. Attached is a .bib of the files that didn't download. I tried them on my machine and BibScan says they don't have DOIs, but when you look at the .bib file some of them clearly do. Does this work for you?

dirname error

from @kanedan29 i get an error message that says Error in dirname(outfilepath) : object 'outfilepath' not found even though my output folder definitely exists

Low Retrieval Rate

One user of the Bibscan library is asking on tips on how to improve the retrieval rate so my task for today was to figure out why the retrieval rate was so low. First, I ran the code given to me and got the same number of successful pdf retrievals. Based on the error messages given, it appears that the links don't work (don't know if this is obvious or not due to lack of knowledge about this package). To look into it more, I looked at the first ten documents. Some problems that I noticed was the documents from elsevier and wiley were not working. After trying to figure out why, I landed on this page: CrossRef/rest-api-doc#96. Also, in the crimer package, they said "At least Elsevier and I think Wiley also check your IP address in addition to requiring the authentication token". So maybe that's why these websites aren't working. For springerlink, it says that "Page Not Found". For the cambridge website, it gives me the warning pop up message "Unfortunately you do not have access to this content, please use the Get access link below for information on how to access this content." These are the websites/links that were from the first ten rows. Other than these error messages, I'm not really sure what else to look at since I'm pretty new on how this code (especially crimer) works.

harmonize styling

There are currently many different stylings in the package. We should try to make it more homogenous to help contributions

`//` in the path of downloaded files

Seems due to how I set up the cache if cr_miner. Find a better way to do this, or a past processing as plan B. Seems to not disturb the doenload and file manipulation (at least on OSX/unix)

parsing failure

I got the below error when trying to run the main function. Attached are the files I used.

article_pdf_download(infilepath='~/Documents/Temporary/Lesley', colandr=screened_abstracts)

Converting your isi collection into a bibliographic dataframe

Articles extracted   100 
Articles extracted   200 
Articles extracted   300 
Articles extracted   326 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!


Converting your isi collection into a bibliographic dataframe

Articles extracted   43 
Done!


Genereting affiliation field tag AU_UN from C1:  Done!

Warning: 1 parsing failure.
row # A tibble: 1 x 5 col     row col   expected   actual    file         expected   <int> <chr> <chr>      <chr>     <chr>        actual 1     5 NA    73 columns 9 columns literal data file # A tibble: 1 x 5

Error in filter_impl(.data, quo) : 
  Evaluation error: object 'citation_screening_status' not found.
In addition: Warning messages:
1: In if (grepl("\n", x)) { :
  the condition has length > 1 and only the first element will be used
2: In if (grepl("\n", path)) return(path) :
  the condition has length > 1 and only the first element will be used
3: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
4: Missing column names filled in: 'X73' [73] 
5: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
6: In if (grepl("\n", file)) { :
  the condition has length > 1 and only the first element will be used
7: In rbind(names(probs), probs_f) :
  number of columns of result is not a multiple of vector length (arg 2)

files.zip

Dependencies not loading

From @kanedan29

I’m running into two problems with bibscan. 1.) when I load the library it’s not loading the dependencies, so I get error messages about specific functions.

PLOS One article are returned as html

With the crminer version of the code, it seems that the article from PLOS One are not accessed as PDF but as html documents. Need to investigate why

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.