davisvaughan / furrr Goto Github PK

View Code? Open in Web Editor NEW

691.0 691.0 39.0 3.71 MB

Apply Mapping Functions in Parallel using Futures

Home Page: https://furrr.futureverse.org/

License: Other

R 100.00%

furrr's People

Contributors

Stargazers

Watchers

furrr's Issues

add programatic creation/connection to EC2 in vignette

The last week or so, I've been working from this gist to sort out ways to programmatically create EC2 instance(s) using reticulate and Python's boto3. I managed to sort most of the details out and thought I would contribute it back to the vignette so others could implement.

My edits are available via jmlondon/furrr@0b802fb

This is just a first stab at incorporating into the vignette. I'm happy to keep improving or file a PR if you think it would be of use.

Solving the error) connect_to localhost: unknown host (nodename nor servname provided, or not known)

https://gist.github.com/CerebralMastication/478b91e4c4f06abc3011cc42fe1e14c9

This is not a furrr issue, but I would like to document this in case it ever comes up again (heavens forbid).

Terminal error:

connect_to localhost: unknown host (nodename nor servname provided, or not known)

If this error happens, while on a Mac (or linux?), while trying to connect to any remote cluster by reverse ssh using the connection -R <port>:localhost:<port> (this is the default with future::makePSOCKcluster()), then check to see if you actually have localhost defined in your /etc/host file. It should exist and read:

##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting.  Do not change this entry.
##
127.0.0.1localhost
255.255.255.255broadcasthost
::1             localhost

If for some reason that file is deleted, or is empty, then this error will occur.

cc @CerebralMastication because of all the pain we went through for this one

Nested future_map / map generates warning on ..1 argument

Consider these examples:

Works as expected

library(purrr)
library(furrr)
#> Loading required package: future
map(1:2, ~map_int(6:7, ~.x))
#> [[1]]
#> [1] 6 7
#> 
#> [[2]]
#> [1] 6 7

^{Created on 2018-11-19 by the reprex package (v0.2.1.9000)}
2. Works as expected

library(purrr)
library(furrr)
#> Loading required package: future
future_map(1:2, ~map_int(6:7, ~.x))
#> [[1]]
#> [1] 6 7
#> 
#> [[2]]
#> [1] 6 7

^{Created on 2018-11-19 by the reprex package (v0.2.1.9000)}
3. But this one throws a warning

library(purrr)
library(furrr)
#> Loading required package: future
future_map(1:2, ~map_int(6:7, ~..1))
#> Warning: <anonymous>: ..1 may be used in an incorrect context
#> [[1]]
#> [1] 6 7
#> 
#> [[2]]
#> [1] 6 7

^{Created on 2018-11-19 by the reprex package (v0.2.1.9000)}
4. And standard purrr does not

library(purrr)
library(furrr)
#> Loading required package: future
map(1:2, ~map_int(6:7, ~..1))
#> [[1]]
#> [1] 6 7
#> 
#> [[2]]
#> [1] 6 7

^{Created on 2018-11-19 by the reprex package (v0.2.1.9000)}

I believe it is worth making future_map consistent with map providing that a user understands to what exactly ..1 is evaluated in a nested map scenario.

Double nesting and using `~` inside mutate() loses the attached packages

When double nesting inside a mutate() call, that tibble() is not found. The second map() is trying to use the ~ that we are trying to "recreate" on the worker side using rlang::new_data_mask(). This doesn't seem to work right. But base::~ does work. Going to change over to that.

library(furrr)
library(dplyr)
library(purrr)
library(repurrrsive)

plan(multicore)

ex <- repurrrsive::gh_repos

ex_tbl <- tibble(ex)

res <- ex_tbl %>%
  mutate(
    ex2 = future_map(ex, ~{
      map(.x, ~ {
        #sessionInfo()
        tibble()
        #.x$id
      })
    })
  )
#> Error in mutate_impl(.data, dots): Evaluation error: could not find function "tibble".

res <- ex_tbl %>%
  mutate(
    ex2 = future_map(ex, ~{
      map(.x, function(.x) {
        #sessionInfo()
        tibble()
        #.x$id
      })
    })
  )

Created on 2018-04-18 by the reprex package (v0.2.0).

Bump `globals` requirement - New version of globals throws is.na() warning.

Look at sort(unique(globals)) in the findGlobals() function. Talk with Henrik about this

Version of functions that leave data where it is?

I'm not familiar enough with future to know whether this is possible or not, but it would be nice to have a version of the future functions that leave the data where it is (i.e. spread over multiple processes)

To get the data back you'd either call something like collect() or use future_reduce().

Control whether progress is allowed with class(plan())

Have a helper that detects for multisession vs sequential vs remote and so on.

Use option object

Rather than including future.globals = TRUE, future.packages = NULL, future.seed = FALSE, future.lazy = FALSE, future.scheduling = 1.0 in the arguments for every function, I think it would be cleaner to create an "option" object:

future_opts <- function(future.globals = TRUE, future.packages = NULL, future.seed = FALSE, future.lazy = FALSE, future.scheduling = 1.0) {
  
}

Then each function would only need a single argument, and it will be much to add options in the future.

furrr not working with fst package

I have ~ 3000 fst files. These are organized as fst objects in the list fst_objs. I want to subset all of these objects using the following function:

filter_select <- function(fst_obj, filter, selection) {
  filter_eval <- eval(parse(text = filter)) 
  fst_obj[filter_eval, selection]
}

Using map_dfr(fst_objs, filter_select, filter, selection) where filter = 'fst_obj$INSTRUMENT == "DE0009652669"' and selection = 1:20 works fine and returns a data.frame with ~ 800 rows.

Replacing map_dfr() by future_map_dfr() returns a 0 x 0 tibble.

I suspect this is related to the future package since a similar problem occurs with lapply() and future_lapply().

Viewing/signaling conditions signaled in futures

One of the issues that I've thought about in furrr (and in future in general, actually), is that when the plan being used isn't sequential/transparent, the only condition that gets signaled to functions outside of the future call is the error condition. To the best of my knowledge, warnings and messages that arise within future_map/future are basically gone once the results are returned.
While I understand that the whole concept of futures precludes higher-up processes from dictating how the lower processes within the futures deal with these conditions, it seems to me inadvisable that potentially important warnings/messages are automatically thrown out. It seems like this design choice hampers "good code".

I don't know what your thoughts are about this, but if you're interested in implementing something like this, I've extended the future_map functions in my personal custom codebase so that I can collect all the messages, warnings, and errors signaled in these functions, and I can then signal/view everything that was collected, in case the functions calling future_map need to do anything about them.

I uploaded the relevant code as a gist that can be accessed here. Basically, I wrote a function that "pries open" the future_map functions, wraps the .f function in a function that collects conditions, and saves them as new functions. I went with this strategy for my code because I didn't see the point in hard-copying your code into new functions just to change something so small, but it has the added benefit of making things very flexible. I don't know if there are any theoretical problems with the condition collection in collect_all, but it's worked in all the scenarios I've been using it.

Just food for thought, basically! If you think such things are in the purview of this package, I could spruce it up and make a pull request, but I'd also be fine with you making your own code conceptually based off mine, if you just acknowledge me. At some level, I think stuff like this should probably be addressed in the future package itself, so I'll probably be asking Henrik about it regardless.

Implement `walk()` family

https://twitter.com/jsonbecker/status/996827581667266561

scope issue with `future_map()`?

I'm working with a large amount of json on disk and trying to extract just the elements I need.
I'm open to the idea that this is a bad json parsing approach and I should revise it, but I have no idea why this works with map() but not with future_map()

  
library(furrr)
#> Loading required package: future
library(purrr)
library(magrittr)
#> 
#> Attaching package: 'magrittr'
#> The following object is masked from 'package:purrr':
#> 
#>     set_names
library(jsonlite)
#> 
#> Attaching package: 'jsonlite'
#> The following object is masked from 'package:purrr':
#> 
#>     flatten
plan(multiprocess)

repos <- c(
  "https://api.github.com/users/hadley/repos",
  "https://api.github.com/users/davisvaughan/repos"
  )

accessor <- function(x) x$owner

get_owners <- compose(
  function(x) pluck(x, accessor),
  function(x) fromJSON(x)
)

works <- 
  repos %>% 
  map(safely(get_owners))

works[[1]]$error
#> NULL

doesnt <- 
  repos %>%
  future_map(safely(get_owners))

doesnt[[1]]$error
#> <simpleError in dots_splice(...): object 'accessor' not found>

Created on 2018-10-03 by the reprex package (v0.2.0).

Error in curl::curl_fetch_memory ?

I sometimes get this error when I run furrr::future_map . The weird thing is that i don't always get that error even if I run exactly the same script.

Error in curl::curl_fetch_memory(url, handle = handle) : 
  schannel: next InitializeSecurityContext failed: SEC_E_BUFFER_TOO_SMALL (0x80090321) - The buffers supplied to a function was too small.

Any ideas on how to fix this? In case is relevant, I'm using multiprocess on Windows.

furrr is very magical

e.g. https://github.com/DavisVaughan/furrr/blob/master/R/future_map2_template.R

I think you should consider an interface where the user has more control over exactly what is sent to the various backends, and then build a more magical interface on top of that.

You could imagine maybe something like this:

backend <- furrr_backend()
backend$load_packages("purrr")

x <- runif(1e6)
backend$copy_data(x)
backend$run_code(y <- runif(1))

future_map(1:5, ~ .x + y, backend = backend)

Error in plan(cluster, workers = cl) when using more than 13 workers with EC2 instance

I am using an EC2 instance to do some basic calculations, but I am finding that as soon as I try and start more than 12 workers, the plan(cluster, workers = cl) function crashes with:

> plan(cluster, workers = cl)
plan(): plan_init() of ‘tweaked’, ‘cluster’, ‘multiprocess’, ‘future’, ‘function’ ...
Error in sprintf(...) : 'fmt' length exceeds maximal format length 8192

To reproduce the error, set workers = rep(public_ip, times = 15L) in makeClusterPSOCK:


options(future.debug = TRUE)

# A r3.4xlarge AWS instance
# Created from http://www.louisaslett.com/RStudio_AMI/
public_ip <- "18.206.46.236"
# This is where my pem file lives (password file to connect).
ssh_private_key_file <- "~/Desktop/programming/AWS/key-pair/dvaughan.pem"

cl <- makeClusterPSOCK(
  
  # Public IP number of EC2 instance
  workers = rep(public_ip, times = 15L),
  
  # User name (always 'ubuntu')
  user = "ubuntu",
  
  # Use private SSH key registered with AWS
  rshopts = c(
    "-o", "StrictHostKeyChecking=no",
    "-o", "IdentitiesOnly=yes",
    "-i", ssh_private_key_file
  ),
  
  # Set up .libPaths() for the 'ubuntu' user and
  # install furrr
  rscript_args = c(
    "-e", shQuote("local({p <- Sys.getenv('R_LIBS_USER'); dir.create(p, recursive = TRUE, showWarnings = FALSE); .libPaths(p)})"),
    "-e", shQuote("if (!require('furrr')) install.packages('furrr')")
  ),
  
  # Switch this to TRUE to see the code that is run on the workers without
  # making the connection
  dryrun = FALSE
)

cl

library(furrr)
plan(cluster, workers = cl)  ## error here

f <- function(x){
  x <- rnorm(x)
  return("success")
}

cars2_mod_future <- as.list(rep(1e7, 50)) %>%
  ## error here
 future_map(~f(.x))

parallel::stopCluster(cl)

Second error when running future_map:

future_map_*() ...
Finding globals ...
getGlobalsAndPackages() ...
Searching for globals...
- globals found: [7] ‘.f’, ‘f’, ‘{’, ‘<-’, ‘x’, ‘rnorm’, ‘return’
Searching for globals ... DONE
Resolving globals: FALSE
The total size of the 3 globals is 32.58 KiB (33360 bytes)
- globals: [3] ‘.f’, ‘f’, ‘x’
- packages: [1] ‘stats’
getGlobalsAndPackages() ... DONE
 - globals found: [3] ‘.f’, ‘f’, ‘x’
 - needed namespaces: [1] ‘stats’
Finding globals ... DONE
Getting '...' globals ...
resolve() on list ...
 recursive: 0
 length: 1
 elements: ‘...’
 length: 0 (resolved future 1)
resolve() on list ... DONE
Getting '...' globals ... DONE
Globals to be used in all futures:
Error in sprintf(...) : 'fmt' length exceeds maximal format length 8192

Session:

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.4 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.0
LAPACK: /usr/lib/lapack/liblapack.so.3.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] h2o_3.20.0.8          usethis_1.4.0         bindrcpp_0.2.2        doParallel_1.0.14     iterators_1.0.10      foreach_1.4.4         furrr_0.1.0           future_1.10.0        
 [9] glue_1.3.0            ggplot2_3.1.0         RMySQL_0.10.15        DBI_1.0.0             tidyr_0.8.2           lubridate_1.7.4       readr_1.2.1           dplyr_0.7.8          
[17] purrr_0.2.5

Don't attach the future package

I think it's generally bad practice to attach other packages on load. Could you instead just re-export the handful of functions that you need from future?

Nested futures

Hey Davis,
this is a question really.
In your aws.ec2 script (https://gist.github.com/DavisVaughan/5aac4a2757c0947a499d25d28a8ca89b), on line 113, we are sending tasks to each node of the cluster.
Is it possible to use a plan(multiprocess) inside each node so that the task received is run in parallel (multicore) ?
In my case (hyperparameter tuning), what I would like to do is send the folds to different machines, but then on each machine do the parameter search in parallel.
I think I saw a mention of something like this on the future vignette, but would like to get your input on it (and how it would be done in furrr).
Thx !

is dynamic task allocation across workers possible?

(I tried to frame these questions generally, but if it shades too far to advice on an idiosyncratic application, feel free to close!)

I'm running a fairly large bootstrap routine through furrr::future_pmap and plan(multiprocess) right now. It's running on an m4.16xlarge EC2 instance (64 core/256 GB RAM) and going slower than I anticipated. I think I see what's happening, but wanted to ask about the intended behavior first.

I can watch htop along the way. I see all available cores busy when I start the job, as expected, but typically by ~75% of expected elapsed runtime I see the number of cores in active use decline steadily. The last few minutes of a run are typically occupied by just a few working cores. I'm back-enveloping that decline as worth perhaps 20-25% of total runtime.

I assume what's happening is that work is divided across cores at the outset, and there's enough variance in the runtime of my function calls that some cores can finish their work faster than others.

Is that assumption true?
Is there an option for dynamic task allocation, within furrr itself or future?
If there's no possibility of dynamic allocation, are there any general good practices to claw back some of that runtime?

So far, I've just tried minimizing the memory footprint and runtime variance for these function calls. It's timing in the range of 400-600ms per call, median at 465ms. The memory management overhead isn't too bad -- no large globals being copied, everything inside the call is vectorized carefully, and the returned result is a numeric matrix of just 40 elements. The machine never exceeds 60GB of RAM (25% of total) even at the largest jobs.

I've also hit this in very different contexts, e.g. webscraping through a proxy that enables parallel hits, where the function call runtime variance is much higher. In that case, I found it more efficient to write out logs, stop the job, extract from logs whatever didn't finish, and reallocate the work in a new future_map call. (That doesn't seem like it's a good practice, but in a pinch....)

(A more general observation. I'm running a 1M-replicate bootstrap over 1M individual raster cells and repeating that for ~3,000 rasters, all in R, all in about 200 lines of code. I am so happy with this package. furrr has cut the time between testing a function and scaling it to absurd levels down to basically 0. Thank you!)

Documentation updates:

Grouped data frames don't allow furrr to do what it is designed for. See #28
Dont use multicore for generating images with ggsave() (or any graphics device) with X11. See #27

furrr stumbles over grouped dataframes

I've run into some wired behavior when using dplyr and mutate and the dataframe is grouped. Calculations with future_map take forever compared to usage of purrr::map. This becomes cumbersome for my workflow which resembles
df %>% group_by(some_var) %>% nest() %>% mutate(results = future_map(data, some_expensive_calculation)

I've attached a (hopefully helpful) reprex:
Example from github works as expected.

library(rsample)
#> Lade nötiges Paket: broom
#> Lade nötiges Paket: tidyr
#> 
#> Attache Paket: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill
data("attrition")
names(attrition)
#>  [1] "Age"                      "Attrition"               
#>  [3] "BusinessTravel"           "DailyRate"               
#>  [5] "Department"               "DistanceFromHome"        
#>  [7] "Education"                "EducationField"          
#>  [9] "EnvironmentSatisfaction"  "Gender"                  
#> [11] "HourlyRate"               "JobInvolvement"          
#> [13] "JobLevel"                 "JobRole"                 
#> [15] "JobSatisfaction"          "MaritalStatus"           
#> [17] "MonthlyIncome"            "MonthlyRate"             
#> [19] "NumCompaniesWorked"       "OverTime"                
#> [21] "PercentSalaryHike"        "PerformanceRating"       
#> [23] "RelationshipSatisfaction" "StockOptionLevel"        
#> [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
#> [27] "WorkLifeBalance"          "YearsAtCompany"          
#> [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
#> [31] "YearsWithCurrManager"

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
rs_obj
#> #  20-fold cross-validation repeated 10 times 
#> # A tibble: 200 x 3
#>    splits       id       id2   
#>    <list>       <chr>    <chr> 
#>  1 <S3: rsplit> Repeat01 Fold01
#>  2 <S3: rsplit> Repeat01 Fold02
#>  3 <S3: rsplit> Repeat01 Fold03
#>  4 <S3: rsplit> Repeat01 Fold04
#>  5 <S3: rsplit> Repeat01 Fold05
#>  6 <S3: rsplit> Repeat01 Fold06
#>  7 <S3: rsplit> Repeat01 Fold07
#>  8 <S3: rsplit> Repeat01 Fold08
#>  9 <S3: rsplit> Repeat01 Fold09
#> 10 <S3: rsplit> Repeat01 Fold10
#> # ... with 190 more rows

mod_form <- as.formula(Attrition ~ JobSatisfaction + Gender + MonthlyIncome)

library(broom)
## splits will be the `rsplit` object with the 90/10 partition
holdout_results <- function(splits, ...) {
    # Fit the model to the 90%
    mod <- glm(..., data = analysis(splits), family = binomial)
    # Save the 10%
    holdout <- assessment(splits)
    # `augment` will save the predictions with the holdout data set
    res <- broom::augment(mod, newdata = holdout)
    # Class predictions on the assessment set from class probs
    lvls <- levels(holdout$Attrition)
    predictions <- factor(ifelse(res$.fitted > 0, lvls[2], lvls[1]),
                          levels = lvls)
    # Calculate whether the prediction was correct
    res$correct <- predictions == holdout$Attrition
    # Return the assessment data set with the additional columns
    res
}


# old example ---------------------------------------------------------------------------------
library(purrr)
library(tictoc)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 2.87 sec elapsed

library(furrr)
#> Lade nötiges Paket: future
plan(multiprocess, workers = 4)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.78 sec elapsed

plan(multiprocess, workers = 8)
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj$results <- future_map(rs_obj$splits, holdout_results, mod_form)
toc()
#> 1.251 sec elapsed

Using dplyr's mutate for adding the new columns.

# using dplyr ---------------------------------------------------------------------------------
library(dplyr)
#> 
#> Attache Paket: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 3.073 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 1.088 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 0.793 sec elapsed

Now for the grouped dataframes:

# grouped data.frame --------------------------------------------------------------------------
set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = map(splits, holdout_results, mod_form))
toc()
#> 2.883 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, holdout_results, mod_form))
toc()
#> 12.228 sec elapsed

set.seed(4622)
rs_obj <- vfold_cv(attrition, v = 20, repeats = 10) %>% mutate(g_id = row_number()) %>% group_by(g_id)
tic()
rs_obj <- rs_obj %>% mutate(results = future_map(splits, function(x) holdout_results(x, mod_form)))
toc()
#> 11.633 sec elapsed

The calculation with furrr on the grouped dataframe takes considerably longer than purrr which I would not expect.

Created on 2018-08-02 by the reprex package (v0.2.0).

globals inside .x are not currently found

gather_globals_and_packages() should search .x as well.

if .x has a function call inside it from the globals env, it likely won't be found.

IDEA - steps towards unified future_map_template()

Probably don't want to use quo(). We want ..future_x_ii to change on each iteration of the for loop, but each iteration uses the same future expression. quo() would capture the environment of the first time around and this might cause problems

# The hardest part of using a common template would be building up the
# future call. One nice way to do this would be with rlang expr()

library(future)
library(rlang)

# A global var to find
n <- 2

####################################################

# rng?
# this is our dynamic expression to test with
rng <- TRUE
rng_expr <- if(rng) {
  expr({set.seed(1)})
} else {
  expr({})
}

f_expr <- expr({
  !! rng_expr
  ret <- rnorm(n)
  ret
})
f_expr
#> {
#>     {
#>         set.seed(1)
#>     }
#>     ret <- rnorm(n)
#>     ret
#> }

future_call <- expr(future(!!f_expr, evaluator = multiprocess))
future_call
#> future({
#>     {
#>         set.seed(1)
#>     }
#>     ret <- rnorm(n)
#>     ret
#> }, evaluator = multiprocess)

val <- eval(future_call)
value(val)
#> [1] -0.6264538  0.1836433

# it worked!
val2 <- eval(future_call)
value(val2)
#> [1] -0.6264538  0.1836433

####################################################

# rng?
# this is our dynamic expression to test with
rng <- FALSE
rng_expr <- if(rng) {
  expr({set.seed(1)})
} else {
  expr({})
}

f_expr <- expr({
  !! rng_expr
  ret <- rnorm(n)
  ret
})
f_expr
#> {
#>     {
#>     }
#>     ret <- rnorm(n)
#>     ret
#> }

future_call <- expr(future(!!f_expr, evaluator = multiprocess))
future_call
#> future({
#>     {
#>     }
#>     ret <- rnorm(n)
#>     ret
#> }, evaluator = multiprocess)

val <- eval(future_call)
value(val)
#> [1] -0.5413330 -0.1550716

Created on 2018-08-25 by the reprex package (v0.2.0).

Advanced furrr vignette

Regarding #19 - AWS, clusters, and multilevel futures

rlang tilde pointer lost when exporting to workers through multisession

Also see the Sale_Price ~ . example in issue #3 where this first came up.

This is a small reproducible example. This also happens when just using pure future() calls, so it is not specific to future_map().

library(tidyverse)
library(furrr)
library(gapminder)

by_country <- gapminder %>% 
  group_by(country, continent) %>% 
  nest()

# This works fine. Shared memory
plan(multicore)
by_country_with_mod <- by_country %>% 
  mutate(model = future_map(data, ~lm(lifeExp ~ year, data = .x)))

# This doesn't work. The ~ in lifeExp ~ year is seen as a 0x0 pointer on each worker
plan(multisession)
by_country_with_mod <- by_country %>% 
  mutate(model = future_map(data, ~lm(lifeExp ~ year, data = .x)))

# Error in mutate_impl(.data, dots) : 
#  Evaluation error: NULL value passed as symbol address.

@lionel- do you have any advice? I don't know who else to ask. The rlang ~ from the mutate()'s data mask environment is seen as a global variable to be exported to each multisession worker by future, which is good, but when it exports, I think it has to serialize() the object, so the pointer address is lost and can't be used on the other side.

One solution I had was to redefine a new data mask on each worker, and pull the ~ function off of it. Something like the following would run on each worker, with ...future.f being the mapper version of ~lm(lifeExp ~ year, data = .x)

...future.f.env <- environment(...future.f)
mask <- rlang::as_data_mask(list(a=1))
...future.f.env$`~` <- mask$`~`

Do I lose anything here? Does the rlang ~ contain information specific to the environment it's created in that I would be losing? It seems to work for this use case. Thanks in advance for any help.

Bump `globals` requirement when it hits CRAN

For the performance boost when searching .x

collaborate with future.apply?

It seems that your code here are just another extension of HenrikBengtsson/future.apply (as you stated in your readme).

HenrikBengtsson/future.apply#9

console being held on call to sge

is there a way to free the console when sending a job to a grid (like sge)? this would seem like a natural thing to do since the master isnt actually working.

thanks

<U+2500> appears instead of ─ with .progress = TRUE

On my work system, the code <U+2500> appears instead of ─ when using .progress = TRUE. This might be an issue with my shell or terminal. Maybe it would be possible to detect if this is gonna happen, and use a simpler dash instead?

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] furrr_0.1.0    future_1.7.0   pacman_0.4.6   colorout_1.1-2

loaded via a namespace (and not attached):
[1] compiler_3.4.4   magrittr_1.5     parallel_3.4.4   listenv_0.7.0    codetools_0.2-15 digest_0.6.15   
[7] globals_0.11.0   rlang_0.2.1.9000 purrr_0.2.5

Custom methods are invisible for workers in PSOCK cluster

I have an issue when creating PSOCK cluster on Windows machine:

library(furrr)
cl <- parallel::makePSOCKcluster(3)
plan(cluster, workers = cl)
future_map_dfr(...)

It works great until I pass to future_map_dfr function which use custom methods for predict (perdict.label_enc and others).
The error is

Error in UseMethod("predict") : 
  no applicable method for 'predict' applied to an object of class "c('label_enc', 'data.table', 'data.frame')"

These metods are defined in sourced .R file and are visible in global enviromnet. When I call clusterEvalQ() with all required source() inside and/or clusterExport() with all functions names, it has no effect.
future::makeClusterPSOCK() and plan(multiprocess) give me the same error.

Function example:

label_encoder <- function(data, cols) {
    checkmate::assert_data_table(data)
    checkmate::assert_character(cols)
    checkmate::assert_names(names(data), must.include = cols)

    res <- melt(data[, .SD, .SDcols = cols],
                measure.vars = cols, variable.factor = FALSE)
    res <- unique(res)
    res[, number := rowid(variable)]

    class(res) <- c("label_enc", class(res))
    return(res)
}

predict.label_enc <- function(object, newdata, suffix = "_lab", drop = FALSE, ...) {
    checkmate::assert_data_table(newdata)

    cols <- intersect(object[, unique(variable)], names(newdata))
    if (length(cols) == 0L) {
        return(newdata[])
    }
    cols_new <- paste0(cols, suffix)
    for (i in seq_along(cols)) {
        newdata[object[variable == cols[i]],
                (cols_new[i]) := number,
                on = paste0(cols[i], "==value")]
    }
    if (drop && nzchar(suffix)) {
        newdata[, (cols) := NULL]
    }

    # return(newdata[])
}

So, label_encoder() is called first (without an error), but next call is predict() for "label_enc" object.

furrr functions within functions

Hi Davis,

Apologies for the question cause I am clearly missing something here and I would appreciate your help in understanding the usage of furrr functions within functions.

The below function fails and I do understand that this is the expected behavior

  x <- c(1,2)
  y <- 2
  future_map(.x = x, .f = ~ .x + y, .options = future_options(globals = "x"))

However in the below example it seems to be working fine and I don't really understand why all the objects defined within a function are essentially deemed to be "globals".

test_fn <- function() {
  x <- c(1,2)
  y <- 2
  future_map(.x = x, .f = ~ .x + y, .options = future_options(globals = "x"))
}

test_fn()

The problem I am facing with this is that I may have sizable objects in my function but I don't want them to be exported to every worker as it materially degrades performance.

furrr should never evaluate arguments to .f early (i.e. use in NSE)

Hi,
I seem to be having an issue with dplyr's quasiquotation in furrr that doesn't occur in purrr.

This is a pretty niche example and I'm sure there's a better way of doing this without using the dummy_argument set up but presumably the functionality between the two packages should be similar regardless.

library(dplyr)
library(purrr)
library(furrr)
plan(multiprocess)

df_1 <- data.frame(neighbour_pet = c('horse', 'rattlesnake', 'cat'),
                   my_pet = c('dog', 'cat', 'mouse'))

filter_function <- function(data, filter_var, dummy_argument){
  filter_var <- enquo(filter_var)
  
  ## Stuff happens here ##
  
  
  data <- data %>% 
    filter(!!filter_var == 'dog')
  return(data)
}

df_filtered_purrr <- 1:100 %>% 
  map(filter_function, filter_var = my_pet, data = df_1)


df_filtered_furrr <- 1:100 %>% 
  future_map(filter_function, filter_var = my_pet, data = df_1)

df_filtered_purrr works fine whilst df_filtered_furrr leads to this error:

Error in do.call(call, args = c(list(".f"), list(...))) :
object 'my_pet' not found

In reality I'm running simulations with slightly different data each time and I need to use filter within the filter_function after I've perturbed the data. This isn't a big issue as I can just use map but thought it best to share - I tried using future_options but it wasn't immediately obvious to me what to do - apologies if this is just a result of my incompetence.

Thanks for all your work on the package!

Automatically set plan() based on OS

Awesome package, thanks for this! I would like to implement it in an internal package at work. Problem is some users are on mac, while others are on windows.

From the README:

# You set a "plan" for how the code should run. The easiest is multiprocess
# On Mac this picks plan(multicore) and on Windows this picks plan(multisession)
plan(multiprocess)

Is there reason you do not auto-detect the OS and set the plan() accordingly?
And if this 'auto-detect' is something that you do not want by default, we would perhaps expand furrr::future_options() so you could do something like :

future_map(c(2, 2, 2), ~Sys.sleep(.x), .options = future_options(plan = "auto-detect")

If you agree this is something that could be improved, I'd be happy to give it a try and make a pull request :)

Cannot allocate memory on an interactive Slurm node

Trying to run a benchmark on an interactive Slurm node with 40 cores. The code simply reads 48 csv files and rbinds them and spits a a new tibble with the size around 2.5gb . Aim is to compare the serial vs parallel speeds.

Serial version, using purrr is working fine.

library(tidyverse)
library(furrr)
library(microbenchmark)

files <- list.files("/scratch/FILELOCATION/csvs",full.names = T)
ptm <- proc.time()
microbenchmark::microbenchmark(joined_dataset <- purrr::map_df(files, read_csv),times = 1000,unit = 's')
proc.time() - ptm

furrr version, on the other hand, works fine if I run the code once i.e. without the microbenchmark().

library(tidyverse)
library(furrr)
library(microbenchmark)

plan(multicore)
files <- list.files("/scratch/c.c1541911/csvs",full.names = T)
ptm <- proc.time()
microbenchmark(joined_dataset <- furrr::future_map_dfr(files, read_csv),times = 1000,unit = "s")
proc.time() - ptm

But whenever I wrap microbenchmark() around furrr to compare with serial version, after several runs, the job stops and throws the following error

Error in mcfork(detached) : 
  unable to fork, possible reason: Cannot allocate memory
Calls: microbenchmark ... run.MulticoreFuture -> do.call -> <Anonymous> -> mcfork
Execution halted
Error while shutting down parallel: unable to terminate some child processes

Not sure why I'm getting memory error because I think furrr handles gc automatically. Even if that was not the case, the joined_dataset should be overwritten so I don't think it's an issue.

sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.4 (Maipo)

Matrix products: default
BLAS: /apps/languages/R/3.5.1/el7/AVX512/intel-2018/lib64/R/lib/libRblas.so
LAPACK: /apps/languages/R/3.5.1/el7/AVX512/intel-2018/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.5.1

Any ideas what's going wrong?

options change ignored in future_map

I have already asked this question in stack overflow but I didn't get any response so apologies for the cross posting. The question is updated with a reproducible example;

I am trying to use future_map from package furrr within a function I wrote. The function has a dependency on a particular option set in options(). When I change this option to something else it is completely ignored when the function runs in plan(multisession). Is there a way to "inform" every worker on the option change? If I change back to plan(sequential) the function works as expected.

Here is a reproducible example:

# Two functions 
abc <- function(n) {
   n * getOption("tst_value")
}
      
future_abc_mult <- function(n) {
   future_map(n, ~abc(.))
}

# Load packages:
library(furrr)

# Set options:
options(tst_value = 1)

# Try using default plan; sequential  
n = c(1,2,3,4)
future_abc_mult(n)

# Try using multiprocess plan  
plan(multiprocess)
future_abc_mult(n)

Here is my sessionInfo

sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] furrr_0.1.0 future_1.9.0 dplyr_0.7.4 purrr_0.2.5 readr_1.1.1
[6] tidyr_0.7.2 tibble_1.3.4 ggplot2_2.2.1 tidyverse_1.1.1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.13 cellranger_1.1.0 compiler_3.4.2 plyr_1.8.4 bindr_0.1
[6] forcats_0.2.0 tools_3.4.2 digest_0.6.15 lubridate_1.6.0 jsonlite_1.5
[11] nlme_3.1-131 gtable_0.2.0 lattice_0.20-35 pkgconfig_2.0.1 rlang_0.2.1
[16] psych_1.7.8 yaml_2.1.14 parallel_3.4.2 haven_1.1.0 bindrcpp_0.2
[21] xml2_1.1.1 stringr_1.2.0 httr_1.3.1 globals_0.12.1 hms_0.3
[26] grid_3.4.2 glue_1.1.1 listenv_0.7.0 R6_2.2.2 readxl_1.0.0
[31] foreign_0.8-69 modelr_0.1.1 reshape2_1.4.3 magrittr_1.5 codetools_0.2-15
[36] scales_0.5.0 rvest_0.3.2 assertthat_0.2.0 mnormt_1.5-5 colorspace_1.3-2
[41] stringi_1.1.5 lazyeval_0.2.0 munsell_0.4.3 broom_0.4.2

Error: a forked child should not open a graphics device

I am trying to run a multiprocess feature on a Linux machine, in which each worker should produce a plot using ggplot2, and I am getting the error in the title.
The traceback is not very helpful:

> traceback()
14: stop(FutureEvaluationError(future))
13: value.Future(tmp, ...)
12: NextMethod("value")
11: value.MulticoreFuture(tmp, ...)
10: value(tmp, ...)
9: values.list(fs)
8: values(fs)
7: multi_resolve(fs, names(.x))
6: future_map_template(purrr::map, "list", .x, .f, ..., .progress = .progress, 
       .options = .options)
5: future_map(apply(crossing(years, months), 1, paste, collapse = ""), 
       read_and_plot) at plot_month_ts.R#93
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("plot_month_ts.R")

Here is an example MRE that gives the error (which works if sequentially executed):

library(ggplot2)
library(furrr)
f = function(n) {
    p <- ggplot(mtcars, aes(wt, mpg)) + geom_point()
    ggsave(p, filename = paste0("test", n, ".png"))
}
plan(multiprocess)
future_map(1:20, f)

The same code sometimes will give this error: Error: X11 fatal IO error: please save work and shut down R

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_0.2.5        furrr_0.1.0        future_1.7.0       ggplot2_2.2.1.9000 pacman_0.4.6      
[6] colorout_1.1-2    

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17        bindr_0.1.1.9000    magrittr_1.5        tidyselect_0.2.4    munsell_0.5.0      
 [6] colorspace_1.3-2    R6_2.2.2            rlang_0.2.1.9000    plyr_1.8.4          dplyr_0.7.6        
[11] tools_3.4.4         globals_0.11.0      parallel_3.4.4      grid_3.4.4          gtable_0.2.0       
[16] withr_2.1.2         lazyeval_0.2.1      assertthat_0.2.0    digest_0.6.15       tibble_1.4.2       
[21] bindrcpp_0.2.2.9000 codetools_0.2-15    glue_1.3.0          labeling_0.3        compiler_3.4.4     
[26] pillar_1.2.3        scales_0.5.0.9000   listenv_0.7.0       pkgconfig_2.0.1

cpu utilization

I'm seeing some weird cpu utilization when future_map is called more than once.

First, the speed-up on the first parallel execution is about 1.77-fold (163.6s sequential vs 92.2s parallel with 20 cores via plan(multiprocess)). Looking at the cpu utilization, the work seems largely over in about 27s with some workers bouncing around a bit at the end.

The second parallel execution takes 135.6s for a 1.20-fold speed-up. That cpu usage is very strange and exhibits a similar "bouncy" phase at the end.

All the code and data are attached.

furrr_benchmark.zip

Let me know if you need more information.

> availableCores()
system 
    20 
> session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value                       
 version  R version 3.4.3 (2017-11-30)
 os       macOS High Sierra 10.13.3   
 system   x86_64, darwin15.6.0        
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 tz       America/New_York            
 date     2018-04-15                  

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version    date       source                             
 abind         1.4-5      2016-07-21 CRAN (R 3.4.0)                     
 AmesHousing * 0.0.3      2017-12-28 Github (topepo/AmesHousing@97bb8cc)
 assertthat    0.2.0      2017-04-11 CRAN (R 3.4.0)                     
 bindr         0.1        2016-11-13 CRAN (R 3.4.0)                     
 bindrcpp      0.2        2017-06-17 CRAN (R 3.4.0)                     
 broom       * 0.4.4      2018-03-29 CRAN (R 3.4.3)                     
 class         7.3-14     2015-08-30 CRAN (R 3.4.3)                     
 clisymbols    1.2.0      2017-05-21 CRAN (R 3.4.0)                     
 codetools     0.2-15     2016-10-05 CRAN (R 3.4.3)                     
 colorspace    1.3-2      2016-12-14 CRAN (R 3.4.0)                     
 CVST          0.2-1      2013-12-10 CRAN (R 3.4.0)                     
 ddalpha       1.3.2      2018-04-08 cran (@1.3.2)                      
 DEoptimR      1.0-8      2016-11-19 CRAN (R 3.4.0)                     
 digest        0.6.15     2018-01-28 cran (@0.6.15)                     
 dimRed        0.1.0.9001 2018-03-20 local (topepo/dimRed@NA)           
 dplyr       * 0.7.4      2017-09-28 CRAN (R 3.4.2)                     
 DRR           0.0.3      2018-01-06 cran (@0.0.3)                      
 foreign       0.8-69     2017-06-22 CRAN (R 3.4.3)                     
 furrr       * 0.1.0      2018-04-14 Github (DavisVaughan/furrr@ff33338)
 future      * 1.8.0      2018-04-08 cran (@1.8.0)                      
 geometry      0.3-6      2015-09-09 cran (@0.3-6)                      
 ggplot2       2.2.1      2016-12-30 CRAN (R 3.4.0)                     
 globals       0.11.0     2018-01-10 cran (@0.11.0)                     
 glue          1.2.0.9000 2018-04-11 Github (tidyverse/glue@99e0171)    
 gower         0.1.2      2017-02-23 CRAN (R 3.4.0)                     
 gtable        0.2.0      2016-02-26 CRAN (R 3.4.0)                     
 ipred       * 0.9-6      2017-03-01 CRAN (R 3.4.0)                     
 kernlab       0.9-25     2016-10-03 CRAN (R 3.4.0)                     
 knitr         1.20       2018-02-20 CRAN (R 3.4.3)                     
 lattice       0.20-35    2017-03-25 CRAN (R 3.4.3)                     
 lava          1.5.1      2017-09-27 CRAN (R 3.4.2)                     
 lazyeval      0.2.1      2017-10-29 CRAN (R 3.4.2)                     
 listenv       0.7.0      2018-01-21 cran (@0.7.0)                      
 lubridate   * 1.7.3      2018-02-27 cran (@1.7.3)                      
 magic         1.5-8      2018-01-26 cran (@1.5-8)                      
 magrittr      1.5        2014-11-22 CRAN (R 3.4.0)                     
 MASS          7.3-47     2017-02-26 CRAN (R 3.4.3)                     
 Matrix        1.2-12     2017-11-20 CRAN (R 3.4.3)                     
 MLmetrics     1.1.1      2016-05-13 CRAN (R 3.4.0)                     
 mnormt        1.5-5      2016-10-15 CRAN (R 3.4.0)                     
 munsell       0.4.3      2016-02-13 CRAN (R 3.4.0)                     
 nlme          3.1-131    2017-02-06 CRAN (R 3.4.3)                     
 nnet          7.3-12     2016-02-02 CRAN (R 3.4.3)                     
 pillar        1.2.1      2018-02-27 CRAN (R 3.4.3)                     
 pkgconfig     2.0.1      2017-03-21 CRAN (R 3.4.0)                     
 plyr          1.8.4      2016-06-08 CRAN (R 3.4.0)                     
 pROC          1.10.0     2017-06-10 CRAN (R 3.4.0)                     
 prodlim       1.6.1      2017-03-06 CRAN (R 3.4.0)                     
 psych         1.8.3.3    2018-03-30 cran (@1.8.3.3)                    
 purrr       * 0.2.4      2017-10-18 CRAN (R 3.4.2)                     
 R6            2.2.2      2017-06-17 CRAN (R 3.4.0)                     
 Rcpp          0.12.16    2018-03-13 cran (@0.12.16)                    
 RcppRoll      0.2.2      2015-04-05 CRAN (R 3.4.0)                     
 recipes       0.1.2      2018-01-11 CRAN (R 3.4.3)                     
 reshape2      1.4.3      2017-12-11 CRAN (R 3.4.3)                     
 rlang         0.2.0.9001 2018-04-14 Github (tidyverse/rlang@82b2727)   
 robustbase    0.92-8     2017-11-01 CRAN (R 3.4.2)                     
 rpart         4.1-11     2017-03-13 CRAN (R 3.4.3)                     
 rsample     * 0.0.2      2017-11-12 CRAN (R 3.4.2)                     
 scales        0.5.0      2017-08-24 CRAN (R 3.4.1)                     
 sessioninfo * 1.0.0      2017-06-21 CRAN (R 3.4.1)                     
 sfsmisc       1.1-2      2018-03-05 cran (@1.1-2)                      
 stringi       1.1.6      2017-11-17 CRAN (R 3.4.2)                     
 stringr       1.3.0      2018-02-19 CRAN (R 3.4.3)                     
 survival      2.41-3     2017-04-04 CRAN (R 3.4.3)                     
 tibble        1.4.2      2018-01-22 cran (@1.4.2)                      
 tictoc      * 1.0        2014-06-17 CRAN (R 3.4.0)                     
 tidyr       * 0.8.0      2018-01-29 cran (@0.8.0)                      
 tidyselect    0.2.4      2018-02-26 cran (@0.2.4)                      
 timeDate      3043.102   2018-02-21 cran (@3043.10)                    
 withr         2.1.2      2018-04-11 Github (r-lib/withr@79d7b0d)       
 yardstick   * 0.0.1      2017-11-12 CRAN (R 3.4.2)

future_options(seed=) should convert integer-ish things to integer automatically

library(furrr); future_options(seed = 123)
#> Loading required package: future
#> Error in new_future_options(globals = globals, packages = packages, seed = seed, : inherits(seed, c("logical", "integer", "list")) is not TRUE

Created on 2018-11-08 by the reprex
package (v0.2.0).

Feature request: Implement `future_rerun()`

It's a bit beyond my understanding of how all this is implemented to know if this is a big or small ask, but it would be nice to be able to take simulation style tasks like (slightly edited from the rerun() docs):

samples <- 10 %>%  
  rerun(x = rnorm(5), y = rnorm(5)) 
samples %>% map_dbl(~ cor(.x$x, .x$y))

and translate them directly to future_ versions:

samples <- 10 %>%  
  future_rerun(x = rnorm(5), y = rnorm(5)) 
samples %>% future_map_dbl(~ cor(.x$x, .x$y))

It is possible to this without rerun(), but it feels a little awkward:

samples <- 10 %>% 
  seq_len() %>% 
  future_map(~ list(x = rnorm(5), y = rnorm(5)))
samples %>% future_map_dbl(~ cor(.x$x, .x$y))

Naming df when using future_map2

Hi Davis,

I use a future_map2 to read a bunch of files (time reduced from 31 mins to 5 mins using a 8 core machine - amazing!)
My issue is how to name the df, so that I can search for it by name within the returned list
Example:

test <- function(x, name){
require(tidyverse)
z <- data.frame(x+1) %>% stats::setNames(., "a")
return(z)
}

furrr::future_map2(1:3, c("a", "b", "c"), ~test(.x, .y))

I need a piece of code within the function that will give the df a name
In the original function, some files are LARGE and other small, so reading times differ within files
Hence, I am not sure if the dfs are jumbled within the list to use names(list) <- c("a", "b", "c")

Any ideas?

Please HELP!

I have posted on SO:
https://stackoverflow.com/questions/50187445/how-to-name-a-name-a-dataframe-so-that-i-can-look-for-it-within-a-list#50187522

Does furrr have to past the FULL list as a global?

So I've noticed that it seems like furrr makes the list that it's iterating over a global variable to pass on to ALL the futures. The list that I'm iterating over is much larger than the maximum size in future.globals.maxSize, but the individual elements of the list are much much smaller.

At least conceptually at a high level, it seems that the subprocesses that are being spawned with future shouldn't need the huge list passed to them. Is there anyway to implement this?

Here's an example of the problem I'm facing:

library(furrr)
big_thing <- rep(list(1:1e7),100)
future:::objectSize(big_thing) # 4000004000 bytes = ~ 3.7 GiB
future:::objectSize(big_thing[[1]]) # 40000040 bytes = ~ 38 MiB

test <- future_map(big_thing, ~.)

This gives me the error:

Error in getGlobalsAndPackages(expr, envir = envir, tweak = tweakExpression,  : 
  The total size of the 7 globals that need to be exported for the future expression (‘{; ...future.f.env <- environment(...future.f); if (!is.null(...future.f.env$`~`)) {; if (is_bad_rlang_tilde(...future.f.env$`~`)) {; ...future.f.env$`~` <- base::`~`; }; ...; .out; }); }’) is 3.73 GiB. This exceeds the maximum allowed size of 500.00 MiB (option 'future.globals.maxSize'). The three largest globals are ‘...future.x_ii’ (3.73 GiB of class ‘list’), ‘is_bad_rlang_tilde’ (3.85 KiB of class ‘function’) and ‘...future.map’ (2.20 KiB of class ‘function’).

Is there anyway to make it so that furrr only has to pass the individual items in as global variables one at a time so we don't have this issue?

(I'm using furrr_0.1.0, future_1.9.0, and R version 3.3.3)

Feature request: "time" progress bars

I recently discovered furrr, and love how easy it is to reconfigure the parallel processing scheme as I move between my laptop and an HPC cluster. I also really like the fact that I can use both parallel processing AND a progress bar. I've been using plyr, which plays nicely with either parallel processing or a progress bar but not both.

I would like to request more variety in the progress bars. In particular, my preferred progress bar in plyr was the 'time' bar, which included an estimate of the time until completion. This was implemented using plyr::progress_time() (link). The successor is dplyr::progress_estimated() (link).

future_invoke_map vs invoke_map

Hi Davis,

Maybe I am missing something obvious here, but I stumpled upon this today
while trying to parallelize my existing code:

library('furrr')
library('purrr')
library('dplyr')

df <- tibble(
  f = c("runif"),
  params = list(
    list(list(10), list(10))
))

# correct return
invoke_map(df$f, flatten(df$params))

# only returns result from first list element
future_invoke_map(df$f, flatten(df$params))

purrr::invoke_map returns the result I would expect to so , 2x10 random numbers, whereas future_invoke_map only returns 10 numbers for the list element in the nested tibble.
A workaround is to unnest the tibble first, but as furrr usually works as a drop-in replacament this could maybe be fixed?

Cheers,
Marco

future_map is surprisingly slow

library(furrr)
#> Warning: package 'furrr' was built under R version 3.4.4
#> Loading required package: future
#> Warning: package 'future' was built under R version 3.4.4
library(purrr)
plan(multiprocess)

boot_df <- function(x) x[sample(nrow(x), replace = T), ]
rsquared <- function(mod) summary(mod)$r.squared
boot_lm <- function(i) {
  rsquared(lm(mpg ~ wt + disp, data = boot_df(mtcars)))
}

system.time(map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.470   0.006   0.477
system.time(future_map(1:500, boot_lm))
#>    user  system elapsed 
#>   0.716   0.197   0.914
system.time(parallel::mclapply(1:500, boot_lm, mc.cores = 4))
#>    user  system elapsed 
#>   0.893   0.612   0.214

What am I missing?

feature request: make furrr compatible with tsibble

Hello

Everything is said in the title. It would be really nice that furrr had multiprocessing equivalents for the tsibble rolling functions. Have a look here tidyverts/tsibble#66

What do you think? Is it doable?
Thanks!

Tidyverse integration or keep seperate

Thanks for the package.
This addresses the long-term wish for parallel purrr.
See tidyverse/purrr#121.

Questions:

Are you in touch with the tidyverse stakeholder (i.e. https://github.com/hadley, https://github.com/lionel-)?
I wonder if/how your package will be integrated to the official tidyverse stack.
Any thougths?

show progress bar for sequential execution

Especially if the user is unsure about how long a query will take, it would be great if a progress bar could be displayed even for sequential execution.

I guess there is a reason why it is not implemented for sequential backends, but it would still be great to have it! 😄

Error when using batchtools futures and progress bar

Can't give a MWE as I as testing this on a slurm cluster.

my_list %>%
  future_map(~{my_fun}, .progress = TRUE)
Error in mutate_impl(.data, dots) :
  Evaluation error: BatchtoolsError in BatchtoolsFuture ('<none>'): ‘Error in file(temp_file, "a") : cannot open the connection’.

Maybe @mllg can immediately see what's going on or if the furrr progress bar can't work with batchtools.

Error in getGlobalsAndPackages(expr, envir = envir, globals = TRUE)

I stumbled upon an error using future_map I could not resolve.

# sample data frame 1
df1 <- dataframe1
    x1 = c(1,2),
    x2 = c(3,4),
    y1 = c(1,2),
    y2 = c(3,4)
)

# sample dataframe2 
df2 <- data.frame(
    x1 = c(1,2),
    x2 = c(3,4),
    y1 = c(1,2),
    y2 = c(3,4)
)

# put both dataframes in a list to be able to apply the future_map() function
df <- list(df1, df2)

# make a function
f1 <- local(function(arg) {
    arg1 <- pmap(
        arg[c(1,2,3,4)], 
        ~ c(...) %>% 
            matrix(., , ncol=2, byrow = TRUE) %>% 
            st_linestring
    ) 
    
    arg1 <- do.call(st_sfc, arg1)
    arg <- mutate(arg, geometry = arg1)
})

plan(multisession)

# apply the function on the list
future_map(df, f1)

The error is the following:

Error in getGlobalsAndPackages(expr, envir = envir, globals = TRUE) : 
  Did you mean to create the future within a function?  Invalid future expression tries to use global '...' variables that do not exist: .f()

I read in another issue for the future package:
HenrikBengtsson/future#176
that this might be due to a build in check. So I tried applying that fix:

# make a function
f1 <- local(function(arg) {
    arg1 <- pmap(
        arg[c(1,2,3,4)], 
        ~ c(...) %>% 
            matrix(., , ncol=2, byrow = TRUE) %>% 
            st_linestring
    ) 
    
    arg1 <- do.call(st_sfc, arg1)
    arg <- mutate(arg, geometry = arg1)
})

future_map(df, f1)

unfortunately this fix didn't work and the same error appeared.

how many cores/processes are being spawned?

Hello @DavisVaughan ,

Thanks for this wonderful package. I wonder if there is any way for me to set up (or monitor) how many cores or processes are used/spawned with furrr? I am using windows.

My usual use case is to nest() a tibble and to apply some function to the grouped dataframes. Hence, I mostly have embarassingly parallells problems here.

Thanks!

future_map() failing vs. map()

Hi Davis,

Trying to convert my function from purrr::map to furrr::future_map to increase speed.
I have found an instance though where it keeps failing for future_map as it cannot find the variable "date22" within the nested data, however, map does recognice it is there and have no problem with it.

I could not replicate this using a reprex because it was very specific to the data set that I had.
(my first issue raised so apologies if it does not follow the standard)

davisvaughan / furrr Goto Github PK

furrr's People

Contributors

Stargazers

Watchers

Forkers

furrr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs