GithubHelp home page GithubHelp logo

ropensci / targets Goto Github PK

View Code? Open in Web Editor NEW
870.0 18.0 68.0 6 MB

Function-oriented Make-like declarative workflows for R

Home Page: https://docs.ropensci.org/targets/

License: Other

R 99.55% Shell 0.20% TeX 0.25%
reproducibility high-performance-computing r data-science pipeline rstats r-package workflow targets reproducible-research

targets's Introduction

targets

ropensci JOSS zenodo R Targetopia CRAN status check codecov lint

Pipeline tools coordinate the pieces of computationally demanding analysis projects. The targets package is a Make-like pipeline tool for statistics and data science in R. The package skips costly runtime for tasks that are already up to date, orchestrates the necessary computation with implicit parallel computing, and abstracts files as R objects. If all the current output matches the current upstream code and data, then the whole pipeline is up to date, and the results are more trustworthy than otherwise.

Philosophy

A pipeline is a computational workflow that does statistics, analytics, or data science. Examples include forecasting customer behavior, simulating a clinical trial, and detecting differential expression from genomics data. A pipeline contains tasks to prepare datasets, run models, and summarize results for a business deliverable or research paper. The methods behind these tasks are user-defined R functions that live in R scripts, ideally in a folder called "R/" in the project. The tasks themselves are called “targets”, and they run the functions and return R objects. The targets package orchestrates the targets and stores the output objects to make your pipeline efficient, painless, and reproducible.

Prerequisites

  1. Familiarity with the R programming language, covered in R for Data Science.
  2. Data science workflow management techniques.
  3. How to write functions to prepare data, analyze data, and summarize results in a data analysis project.

Installation

If you are using targets with crew for distributed computing, it is recommended to use crew version 0.4.0 or higher.

install.packages("crew")

There are multiple ways to install the targets package itself, and both the latest release and the development version are available.

Type Source Command
Release CRAN install.packages("targets")
Development GitHub remotes::install_github("ropensci/targets")
Development rOpenSci install.packages("targets", repos = "https://dev.ropensci.org")

Get started in 4 minutes

The 4-minute video at https://vimeo.com/700982360 demonstrates the example pipeline used in the walkthrough and functions chapters of the user manual. Visit https://github.com/wlandau/targets-four-minutes for the code and https://rstudio.cloud/project/3946303 to try out the code in a browser (no download or installation required).

Usage

To create a pipeline of your own:

  1. Write R functions for a pipeline and save them to R scripts (ideally in the "R/" folder of your project).
  2. Call use_targets() to write key files, including the vital _targets.R file which configures and defines the pipeline.
  3. Follow the comments in _targets.R to fill in the details of your specific pipeline.
  4. Check the pipeline with tar_visnetwork(), run it with tar_make(), and read output with tar_read(). More functions are available.

Documentation

Help

Please read the help guide to learn how best to ask for help using targets.

Courses

Selected talks

English

Español

日本語

Example projects

Apps

Deployment

Extending and customizing targets

Code of conduct

Please note that this package is released with a Contributor Code of Conduct.

Citation

citation("targets")
To cite targets in publications use:

  Landau, W. M., (2021). The targets R package: a dynamic Make-like
  function-oriented pipeline toolkit for reproducibility and
  high-performance computing. Journal of Open Source Software, 6(57),
  2959, https://doi.org/10.21105/joss.02959

A BibTeX entry for LaTeX users is

  @Article{,
    title = {The targets R package: a dynamic Make-like function-oriented pipeline toolkit for reproducibility and high-performance computing},
    author = {William Michael Landau},
    journal = {Journal of Open Source Software},
    year = {2021},
    volume = {6},
    number = {57},
    pages = {2959},
    url = {https://doi.org/10.21105/joss.02959},
  }

targets's People

Contributors

billdenney avatar boshek avatar hadley avatar joelnitta avatar kendonb avatar komatsuna4747 avatar krlmlr avatar liutiming avatar malcolmbarrett avatar markedmondson1234 avatar mattwarkentin avatar pat-s avatar robinlovelace avatar robitalec avatar russhyde avatar stuvet avatar svraka avatar wlandau avatar wlandau-lilly avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

targets's Issues

Full runthrough of an example workflow

Implement a run_targets() function to run the correct targets and the correct order and store the results. Use the "speed" memory strategy from drake. Then move on to #21 and #22.

Memory management at the target level

  • New "targets_memory" class, a high-performant abstraction of an environment.
  • New "targets_cache" class with multiple layers of "targets_memory" objects.
  • Give each target its own "targets_cache" object so it exhibits crate-like behavior on HPC.

Update design vignette

  • No bundles.
  • We draw downstream edges from both buds and branches to everything downstream.

Cancel targets mid-build

Use custom error condition like in drake. This should avoid the need for fancy condition/change triggers.

target class

  • cmd field.
  • All dependency hashes:
    • cmd$hash().
    • Hashes of the dependencies of cmd from hash_envir().
    • Hashes of the targets directly upstream.
  • Names of the targets directly upstream.
  • Delegate pkgs() and eval() to cmd.
  • Field for the return value.
  • Other metadata such as runtime and seed. Need to think about organization.

Encapsulate return values

  • Is a value NULL because it was not set yet or NULL because the user returned NULL?
  • Should be responsible for formatting itself.

Static analysis of functions and other globals

  1. Create a dependency graph of functions and other globals in the environment using static code analysis.
  2. Hash all the objects in topological order. The hash of a function is the hash of the deparsed body together with the hashes of all the immediate upstream neighbors.
  3. Keep track of the result in a named character vector.

deplist class

internal data frame with columns "name", "direction", "branching" (logical), and "class". Plus methods to interrogate the dependency structure for the sake of build decisions.

Specialized database structure

For #5 and #30.

  • Should append to the rows of a file like a txtq. But since the index of the head row is session-specific, we probably only need the database file itself (no lockfile either).
  • We only append rows to make target-specific updates, so we know later rows are chronologically later.
  • When updating targets, only transact every 1 or so seconds (or a customizable interval).
  • Clean up at the beginning and end of make() sessions which just means reading in the DB and removing early rows to deduplicate targets.
  • Cleaning and pruning will involve this data structure.

Buds

  • Implement a new "targets_bud" class to make slices of stems into targets. This will help us map() and cross() over stems.
  • Implement stem$produce_children() to create a named list of buds.

API

Options

  • tar_option()
  • tar_options()

Dependencies

  • tar_knitr()
  • tar_package()

Pipeline

  • tar_cue()
  • tar_pipeline()
  • tar_target()
  • tar_target_external()
  • tar_validate()

Visuals

  • tar_glimpse()
  • tar_graph()

Run

  • tar_make()
  • tar_make_clustermq()
  • tar_make_future()

Runtime

  • tar_cancel()
  • tar_name()
  • tar_path()

Data

  • tar_load()
  • tar_meta()
  • tar_progress()
  • tar_read()

Clean

  • tar_clean()
  • tar_deduplicate()
  • tar_destroy()
  • tar_invalidate()
  • tar_prune()

Memory strategies at the whole workflow level

Related: #19. We should think about when objects should be kept in memory or released. The advantage of keeping them for longer is we do not need to access storage, but that could make memory blow up.

Branching

In targets, all branching will be dynamic branching.

Let's tackle the infrastructure before we get too far ahead. One idea: we could use target iterators/generators. Maybe the abstract factory pattern?

Bundles

When we come to a pattern in a pipeline, we should do the following:

  1. Add the branches to the graph and priority queue.
  2. Replace the pattern with a bundle of the same name.

A bundle is a target whose rule is just to aggregate the children. We may want different kinds of bundles based on the iteration type: list() vs vec_c().

HPC

clustermq only. I would think about future, but right now it still totally relies on batchtools for distributed computing.

Scheduler

Requires a graph, but we should not need to hold on to the graph to operate the queue. That last point is a weakness of drake's internals.

Pipeline-level settings

Totally global:

  • names
  • reporter
  • algorithm
  • workers
  • garbage collection
  • template

Target-level defaults:

  • packages
  • library
  • memory
  • deployment
  • storage
  • retrieval
  • format
  • failure mode
  • trigger

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.