GithubHelp home page GithubHelp logo

doidata's Introduction

doidata

Project Status: Concept โ€“ Minimal or no implementation has been done yet, or the repository is only intended to be a limited example, demo, or proof-of-concept.

Simple repository data access

This is mostly an idea I am hoping others will join in on. Feel free to throw ideas in the issues or amend this README! - Noam

Introduction

At rOpenSci and in associated open science groups, we often encourage scientists to deposit and use data in public repositories that have stable, long-term archival infrastructure and robust metadata. Such repositories include Zenodo, Figshare, Dryad, and a variety of more specific ones. A frequent mode of use is to download files from these repositories, break the link with the original version or metadata, and include some portion or derived form of these data in a new project folder. This leads to fragmentation of data.

One of the reasons for this use mode is that API navigation of these repositories can be daunting or overly complex. On the other hand, R data packages are a popular way to distribute data to make it very easy to use, but this is R-specific and breaks connections to archival repositories.

The aim of this package is to simplify the workflow of using archived data to a single line, like so:

my_data <- datadoi::get_data("10.1234/somerecord987/FILENAME.csv")

This would parse the DOI, navigate the repository API (Figshare, Zenodo, Dryad, Open Science Framework, etc.) to find the associated file, and download it. If the repository has metadata describing how the data should be parsed, it will be used. Otherwise it can guess using rio or take an argument to return the information raw or write it to disk.

Some notes:

  • Non-archival/DOI-granting sources (GitHub, data.world) could be supported, these would be secondary as the goal would be encourage use archival repositories
  • Versioning would be handled on the repository side, though get_data() could take a version= argument for those repos that have versions but not versioned DOIs (e.g., Figshare)
  • This differs from datastorr in that it's not a framework for versioning data, and it seeks to avoid creating new packages for data. It could borrow some of datastorr cacheing components, though.
  • Download cacheing would be optional
  • Github-linked Zenodo repositories unfortunately don't store files individually, but as a single ZIP of the GitHub release. The package should detect this, download, (cache), unzip and retrieve individual files automatically.
  • There might an option or function to return a citation, too, though mostly the idea here is that by keeping the DOI in the code you maintain a conceptual link the original record.
  • This probably requires up-to-date packages on the key repositories (Figshare, Zenodo, OSF. Dryad, DataONE), though quick-and-dirty methods might be doable for some repositories rather than wait for full-blown API clients.
  • Default behavior should probably include a message with the citation and version of the data. The idea is to have a "pit of success" that makes it easiest to include a citation to data and keep the link to the original data.

doidata's People

Contributors

noamross avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.