ojsr's Introduction

OJS Scraper for R

The aim of this package is to aid you in crawling OJS archives, issues, articles, galleys, and search results, and retrieving/scraping metadata from articles. ojsr functions rely on OJS routing conventions to compose the URL for different scraping scenarios.

Installation

From CRAN:

install.packages('ojsr')

From Github:

install.packages('devtools') 
devtools::install_github("gastonbecerra/ojsr")

ojsr functions

get_issues_from_archive(): scrapes issues URLs from OJS issues archive
get_articles_from_issue(): scrapes articles URLs from the ToC of OJS issues
get_articles_from_search(): scrapes OJS search results for a given criteria to retrieve articles URLs
get_galleys_from_article(): scrapes galleys URLs from OJS articles
get_html_meta_from_article(): scrapes metadata from OJS articles HTML
get_oai_meta_from_article(): retrieves OAI records for OJS articles
parse_base_url(): parses URLs against OJS routing conventions to retrieve the base URL
parse_oai_url(): parses URLs against OJS routing conventions to retrieve the OAI protocol URL

Example

Let's say we want to collect metadata from some journals to compare their top keywords. We have the journals' names and URLs, and can use ojsr to scrap their issues, articles and metadata.


library(dplyr) 
library(ojsr)

journals <- data.frame ( cbind(
    name = c( "Revista Evaluar", "PSocial" ),
    url = c( "https://revistas.unc.edu.ar/index.php/revaluar", "https://publicaciones.sociales.uba.ar/index.php/psicologiasocial")
  ), stringsAsFactors = FALSE )

# we are using the journal URL as input to retrieve the issues
issues <- ojsr::get_issues_from_archive(input_url = journals$url) 

# we are using the issues URL we just scraped as an input to retrieve the articles
articles <- ojsr::get_articles_from_issue(input_url = issues$output_url)

# we are using the articles URL we just scraped as an input to retrieve the metadata
metadata <- ojsr::get_html_meta_from_article(input_url = articles$output_url)

# let's parse the base URLs from journals and metadata, so we can bind by journal
journals$base_url <- ojsr::parse_base_url(journals$url)
metadata$base_url <- ojsr::parse_base_url(metadata$input_url)

metadata %>% filter(meta_data_name=="citation_keywords") %>% # filtering only keywords
  left_join(journals) %>% # include journal names
  group_by(base_url, keyword = meta_data_content) %>% tally(sort=TRUE)

ojsr's People

Contributors

Stargazers

Watchers

ojsr's Issues

el timeout de los scrapers tiene que ser un parametro

poner como parametro con default bajo
asegurar que timeout de rcurl sea de navegacion y no de cache

Harvest geospatial metadata when ojsGeo plugin is used

I'm working on an OJS Plugin that collects and exposes geospatial metadata for articles, see https://github.com/TIBHannover/ojsGeo/

I would like to document the idea (e.g., in an issue) of extending ojsr with a feature to harvest that metadata as well and ideally provide it as ready to use sf objects so you can show OJS articles on a map with just a few lines of R code. The dependency on sf should be flexible, i.e., the availability if checked only when havest_geo = TRUE.

Full disclosure: no production instance of OJS is using the plugin yet, but I'll add updates here once interesting data becomes available.

Recommend Projects

gastonbecerra / ojsr Goto Github PK

ojsr's Introduction

OJS Scraper for R

Installation

ojsr functions

Example

ojsr's People

Contributors

Stargazers

Watchers

Forkers

ojsr's Issues

el timeout de los scrapers tiene que ser un parametro

Harvest geospatial metadata when ojsGeo plugin is used

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs