GithubHelp home page GithubHelp logo

gastonbecerra / ojsr Goto Github PK

View Code? Open in Web Editor NEW
3.0 1.0 1.0 9.43 MB

R package to crawl and scrape OJS (open journal system)

R 2.01% TeX 0.23% HTML 97.76%
rstats ojs oai-pmh scraper web-scraping

ojsr's Introduction

OJS Scraper for R

CRAN status R-CMD-check

The aim of this package is to aid you in crawling OJS archives, issues, articles, galleys, and search results, and retrieving/scraping metadata from articles. ojsr functions rely on OJS routing conventions to compose the URL for different scraping scenarios.

Installation

From CRAN:

install.packages('ojsr') 

From Github:

install.packages('devtools') 
devtools::install_github("gastonbecerra/ojsr")

ojsr functions

  • get_issues_from_archive(): scrapes issues URLs from OJS issues archive
  • get_articles_from_issue(): scrapes articles URLs from the ToC of OJS issues
  • get_articles_from_search(): scrapes OJS search results for a given criteria to retrieve articles URLs
  • get_galleys_from_article(): scrapes galleys URLs from OJS articles
  • get_html_meta_from_article(): scrapes metadata from OJS articles HTML
  • get_oai_meta_from_article(): retrieves OAI records for OJS articles
  • parse_base_url(): parses URLs against OJS routing conventions to retrieve the base URL
  • parse_oai_url(): parses URLs against OJS routing conventions to retrieve the OAI protocol URL

Example

Let's say we want to collect metadata from some journals to compare their top keywords. We have the journals' names and URLs, and can use ojsr to scrap their issues, articles and metadata.


library(dplyr) 
library(ojsr)

journals <- data.frame ( cbind(
    name = c( "Revista Evaluar", "PSocial" ),
    url = c( "https://revistas.unc.edu.ar/index.php/revaluar", "https://publicaciones.sociales.uba.ar/index.php/psicologiasocial")
  ), stringsAsFactors = FALSE )

# we are using the journal URL as input to retrieve the issues
issues <- ojsr::get_issues_from_archive(input_url = journals$url) 

# we are using the issues URL we just scraped as an input to retrieve the articles
articles <- ojsr::get_articles_from_issue(input_url = issues$output_url)

# we are using the articles URL we just scraped as an input to retrieve the metadata
metadata <- ojsr::get_html_meta_from_article(input_url = articles$output_url)

# let's parse the base URLs from journals and metadata, so we can bind by journal
journals$base_url <- ojsr::parse_base_url(journals$url)
metadata$base_url <- ojsr::parse_base_url(metadata$input_url)

metadata %>% filter(meta_data_name=="citation_keywords") %>% # filtering only keywords
  left_join(journals) %>% # include journal names
  group_by(base_url, keyword = meta_data_content) %>% tally(sort=TRUE) 

ojsr's People

Contributors

gastonbecerra avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar

Forkers

origrata

ojsr's Issues

Harvest geospatial metadata when ojsGeo plugin is used

I'm working on an OJS Plugin that collects and exposes geospatial metadata for articles, see https://github.com/TIBHannover/ojsGeo/

I would like to document the idea (e.g., in an issue) of extending ojsr with a feature to harvest that metadata as well and ideally provide it as ready to use sf objects so you can show OJS articles on a map with just a few lines of R code. The dependency on sf should be flexible, i.e., the availability if checked only when havest_geo = TRUE.

Full disclosure: no production instance of OJS is using the plugin yet, but I'll add updates here once interesting data becomes available.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.