GithubHelp home page GithubHelp logo

jacobkap / fastdummies Goto Github PK

View Code? Open in Web Editor NEW
34.0 3.0 9.0 45.72 MB

The goal of fastDummies is to quickly create dummy variables (columns) and dummy rows.

Home Page: https://jacobkap.github.io/fastDummies/

License: Other

R 11.19% HTML 84.40% CSS 1.27% C++ 2.19% JavaScript 0.01% C 0.94%
dummy-columns dummy-variable dummy-rows dummy-data binary-data

fastdummies's Introduction

CRAN_Status_Badge AppVeyor Build Status Build Status Coverage status

Overview

The goal of fastDummies is to quickly create dummy variables (columns) and dummy rows. Creating dummy variables is possible through base R or other packages, but this package is much faster than those methods.

Installation

To install this package, use the code
install.packages("fastDummies")


# The development version is available on Github.
# install.packages("devtools")
devtools::install_github("jacobkap/fastDummies")

Usage

library(fastDummies)

There are two functions in this package:

  • dummy_cols() lets you make dummy variables (dummy_columns() is a clone of dummy_cols())
  • dummy_rows() which lets you make dummy rows.

Dummy Columns

Dummy variables (or binary variables) are commonly used in statistical analyses and in more simple descriptive statistics. A dummy column is one which has a value of one when a categorical event occurs and a zero when it doesn’t occur. In most cases this is a feature of the event/person/object being described. For example, if the dummy variable was for occupation being an R programmer, you can ask, “is this person an R programmer?” When the answer is yes, they get a value of 1, when it is no, they get a value of 0.

We’ll start with a simple example and then go into using the function dummy_cols(). You can also use the function dummy_columns() which is identical to dummy_cols().

Imagine you have a data set about animals in a local shelter. One of the columns in your data is what animal it is: dog or cat.

animals
dog
dog
cat

To make dummy columns from this data, you would need to produce two new columns. One would indicate if the animal is a dog, and the other would indicate if the animal is a cat. Each row would get a value of 1 in the column indicating which animal they are, and 0 in the other column.

animals dog cat
dog 1 0
dog 1 0
cat 0 1

In the function dummy_cols, the names of these new columns are concatenated to the original column and separated by an underscore.

animals animals_dog animals_cat
dog 1 0
dog 1 0
cat 0 1

With an example like this, it is fairly easy to make the dummy columns yourself. dummy_cols() automates the process, and is useful when you have many columns to general dummy variables from or with many categories within the column.

fastDummies_example <- data.frame(numbers = 1:3,
                    gender  = c("male", "male", "female"),
                    animals = c("dog", "dog", "cat"),
                    dates   = as.Date(c("2012-01-01", "2011-12-31",
                                          "2012-01-01")),
                    stringsAsFactors = FALSE)
knitr::kable(fastDummies_example)
numbers gender animals dates
1 male dog 2012-01-01
2 male dog 2011-12-31
3 female cat 2012-01-01

The object fastDummies_example has two character type columns, one integer column, and a Date column. By default, dummy_cols() will make dummy variables from factor or character columns only. This is because in most cases those are the only types of data you want dummy variables from. If those are the only columns you want, then the function takes your data set as the first parameter and returns a data.frame with the newly created variables appended to the end of the original data.

results <- fastDummies::dummy_cols(fastDummies_example)
knitr::kable(results)
numbers gender animals dates gender_female gender_male animals_cat animals_dog
1 male dog 2012-01-01 0 1 0 1
2 male dog 2011-12-31 0 1 0 1
3 female cat 2012-01-01 1 0 1 0

Dummy Rows

When dealing with data, there are often missing rows. While truly handling missing data is far beyond the scope of this package, the function dummy_rows() lets you add those missing rows back into the data.

The function takes all character, factor, and Date columns, finds all possible combinations of their values, and adds the rows that are not in the original data set. Any columns not used in creating the combinations (e.g. numeric) are given a value of NA (unless otherwise specified with dummy_value).

Lets start with a simple example.

fastDummies_example <- data.frame(numbers = 1:3,
                    gender  = c("male", "male", "female"),
                    animals = c("dog", "dog", "cat"),
                    dates   = as.Date(c("2012-01-01", "2011-12-31",
                                          "2012-01-01")),
                    stringsAsFactors = FALSE)
knitr::kable(fastDummies_example)
numbers gender animals dates
1 male dog 2012-01-01
2 male dog 2011-12-31
3 female cat 2012-01-01

This data set has four columns: two character, one Date, and one numeric. The function by default will use the character and Date columns in creating the combinations. First, a small amount of math to explain the combinations. Each column has two distinct values - gender: male & female; animals: dog & cat; dates: 2011-12-31 & 2011-12-31. To find the number of possible combinations, multiple the number of unique values in each column together. 2 * 2 * 2 = 8.

results <- fastDummies::dummy_rows(fastDummies_example)
knitr::kable(results)
numbers gender animals dates
1 male dog 2012-01-01
2 male dog 2011-12-31
3 female cat 2012-01-01
NA female cat 2011-12-31
NA male cat 2011-12-31
NA female dog 2011-12-31
NA male cat 2012-01-01
NA female dog 2012-01-01

fastdummies's People

Contributors

jacobkap avatar pbaylis avatar teofiln avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fastdummies's Issues

change function argument data to .data

With the package still early in it's development, it may be of interest to change the data argument to .data as Hadley does in many of his dplyr functions to reduce the likelihood that the argument interferes with a user data object.

Frequency-based Variable Dropping

Hi,

Typically, we would want to exclude the one dummy that stands for the most frequently observed category.

E.g. if we have 'small', 'medium' and 'large' while medium being the shirt size 80 percent of the population is wearing, then one typically drops the 'medium' dummy in a regression to have the regression showing the typical situation and not an outlier.

Would be handy to have a feature in place that allows dropping not just the first but the most frequent category. Should be fairly simple to achieve. But would be neat if integrated directly in the package.

Thanks for considering!

Feature Request: Code NA as NA

Setting the ignore_na argument to "TRUE" appears to set the value for all dummy codes to 0. It'd be helpful to have an additional option to code NA values appearing in the source column as NA in all corresponding dummy columns. (This would be consistent with psych::dummy.code().) In many cases, if I don't know what the original value was, I don't want to make assumptions about what that value is in the corresponding dummy columns.

change group separator?

hi, I am currently using your package (good work :) ) a lot and was wondering if it is possible to change group separator from "_" to "."?

Get generated fastdummies variables

Hi all, something very useful would be, in some way, be able to get the generated variables by the function fastDummies, in some analysis we need to know the new variables, it is possible to get them with some regex + the old vars, but I think would be better to have a function here that handles that.

Thx!

Error in setalloccol(ans) : can't set ALTREP truelength

Hello,

I have been trying to re-compile some old code that I created on R 3.6.2, but have since started using R 4.0.2. Much of the code uses dummy_cols(), which has worked well in the past, but now I get the error Error in setalloccol(ans) : can't set ALTREP truelength when performing a standard dummy_cols() operation. I've spoken to other people at my organization who are experiencing the same issue. I've tried reverting to R 3.6, un-installing and re-installing fastDummies, to no avail.

An update to fix this would be very much appreciated. Thanks!

Error with split argument

Hi Jacob,

Sorry for the delay, but I finally got to testing this and have come across an error using split. Here is a reproducible example using a portion of a dataset I'm working with:

test <- structure(list(
  Theory = structure(c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6L, 6L, 6L, 9L, 9L, NA, 1L, 1L, NA, NA, NA, NA, NA, 1L, 1L, 6L, NA, 1L, 1L, 1L, NA, NA, 1L, 1L, NA, 2L, NA, 1L, 1L, 4L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, NA, NA, 1L, NA, NA, 2L, NA, NA, NA, NA, 9L, 9L, 1L, 1L, 1L, 6L, 6L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 1L, 1L, 1L, 8L, 1L, NA, 6L, 1L, 1L, 1L, NA, NA, NA, NA, NA, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, NA, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 1L, 1L, 1L, 1L, NA, 1L, 8L, NA, 8L, 8L, NA, NA, NA, NA, 2L, 1L, 2L, 10L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, NA, NA, NA, NA, NA, NA, 1L, NA, 9L, NA, NA, NA, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, NA, 1L, NA, 1L, NA, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1L, 1L, NA, 9L, 9L, 9L, 9L, 9L, 9L, 1L, 1L, 1L, 1L, 2L, NA, NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 6L, 6L, 6L, 6L, 7L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L), .Label = c("Behaviourism", "Behaviourism, Cognitive", "Behaviourism, Gestalt", "Behaviourism, Psychodynamic", "Behaviourism, Psychodynamic, Cognitive", "Cognitive", "Functionalism", "Gestalt", "Psychodynamic", "Structuralism"), class = "factor"), 
  Format = structure(c(1L, 1L, 24L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 12L, 1L, 1L, 2L, 1L, 1L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 12L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 11L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 15L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 13L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 12L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 1L, 1L, 1L, 1L, 1L, 5L, 1L, 1L, 1L, 1L, 7L, 1L, 1L, 15L, 1L, 5L, 25L, 5L, 24L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 15L, 1L, 1L, 1L, 1L, 20L, 1L, 18L, 12L, 1L, 1L, NA, 20L, 20L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 25L, 15L, 16L, 15L, 15L, 1L, 1L, 1L, 1L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 12L, 12L, 5L, 5L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 14L, 1L, 1L, 1L, 1L, 14L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 15L, 12L, NA, 15L, 1L, NA, NA, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 14L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 12L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 22L, 1L, 21L, 23L, 5L, 1L, 1L, 1L, 1L, 10L, 1L, 1L, 1L, 1L, 5L, 17L, 1L, 17L, 6L, 1L, 1L, 9L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 12L, 1L, 18L, 1L, 21L, 18L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 12L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 24L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 6L, 1L, 1L, 1L, 1L, 1L, 4L, 1L, 1L, 1L), .Label = c("16mm", "16mm, 35mm", "16mm, 35mm, VHS", "16mm, AVI", "16mm, Digital", "16mm, DVD", "16mm, DVD, Betacam SP", "16mm, DVD, Digital, Betacam SP", "16mm, DVD, Mini-DV", "16mm, MP4", "16mm, MPG", "16mm, VHS", "16mm, VHS, AVI", "16mm, VHS, Digital", "16mm, VHS, DVD", "16mm, VHS, DVD, Digital, AVI", "35mm", "8mm", "8mm, 16mm", "DVD", "DVD, AVI", "Mini-DV", "MPG", "VHS", "VHS, DVD, Digital"), class = "factor")), 
.Names = c("Theory", "Format"), row.names = c(NA, -427L), 
class = c("tbl_df", "tbl", "data.frame"))

### 

library(fastDummies)
levels(test$Theory) # Shows 10 combinations used of 6 different theories
dummy_cols(test, select_columns = "Theory", split = ", ") # Error in strsplit(.data[[col_name]], split) : non-character argument
dummy_cols(test, select_columns = "Theory", split = ",")  # Same error

If I leave out the split argument, I get all of the new columns as I would expect:

new <- dummy_cols(test, select_columns = "Theory")
names(new)

Any ideas what is causing the strsplit() error? I tested this with the version of fastDummies on CRAN (v1.4.0).

But

x<-c("a","b")
fastDummies::dummy_cols(x,,remove_first_dummy = T,remove_selected_columns = T)

data frame with 0 columns and 0 rows

sort_columns = TRUE does not work

I just pulled both the current production version and dev version of fastDummies and neither one currently has sort_columns = TRUE working correctly. For example, if I use this option in your example, I get the following:

> dummy_cols(fastDummies_example, sort_columns = TRUE)
  numbers gender animals      dates gender_ animals_
1       1   male     dog 2012-01-01       0        0
2       2   male     dog 2011-12-31       0        0
3       3 female     cat 2012-01-01       0        0

Can this be fixed?

Thanks. I love this package!

Sort dummy columns following factor order.

It would be useful for me to have the dummy columns this package creates (optionally) follow the ordering of the original factor variables. Maybe also true for others. For example, this issue could be resolved as well. I wrote some code to do this, happy to submit a PR if that's something you're willing to add.

dummy_cols creates dummies for missing values

Hi,

I was using dummy_cols and the function is creating values for the missing values. Then I have more data for the new columns than the original data.

Is there any solution to avoid creating values for the Na?

Thanks

sort new dummy columns by numeric value

Hi,
I'm using dummy_cols on a column of week (values 1-52) in order to make a table of 1's and 0's for an occupancy analysis. Is there an easy way for the new columns to be sorted on the numeric value (or an ordered factor level) or the column I'm using? Looking at the code, it looks like it converts the values of the specified column to a character string? But if that is the case, wouldn't the ordering only be for alphabetical purposes?
Thanks,
David

Can we create dummies from multiple character values

suppose we have a data frame
fastDummies_example <- data.frame(owner = 1:4,
Pets = c("dog", "dog, cat, hamster", "cat",
"hamster"), stringsAsFactors = FALSE)

owner 2 has multiple Pets, so could we assign dummies with respect to those multiple strings
so we should have
owner Pets Pets_dog P Pets_cat Pets_hamster
1 dog 1 0 0
2 dog, cat, hamster 1 1 1
3 cat 0 1 0
4 hamster 0 0 1

dummy_cols alters the input dataset

When doing the following

dummy_df <- dummy_cols(df, remove_first_dummy = T, remove_selected_columns = T)

df gets altered too, containing all the dummy variables. It should not be modified, saving the results only in the return value (dummy_df).

dummy_cols and ties for most frequent levels

First of all, thanks a lot to Jacob for making available this very useful package. Having worked with it for a while I seem to have come across a bug in the dummy_cols function, pertaining to the function's remove_most_frequent_dummy option. Consider the following example:

dummy_cols(.data = data.frame(X = as.factor(c("a", "a", "b", "b", "c"))),
           remove_most_frequent_dummy = TRUE)

Both levels "a" and "b" in the X factor are tied for most frequent. As pointed out on the dummy_cols function's help page, the alphabetically first among the levels tied for most frequent (here: "a") will be removed in this case, so the result of the above R expression is a data frame with the original X column and two dummy columns X_b and X_c. (No issues up to here.) Now I modify the example in a such a way as to have "b" and "c" tied for most frequent:

dummy_cols(.data = data.frame(X = as.factor(c("a", "b", "b", "c", "c"))),
           remove_most_frequent_dummy = TRUE)

I would expect a data frame with the original X column plus two dummy columns X_a and X_c, that is, the level to remove should be "b". Instead I obtain two dummy columns X_b and X_c, that is, it seems that the "a" level is removed although it is certainly not among the most frequent levels. As is evident from the two above examples, the occurence of this problem seems to depend on where the tie for most frequent levels is found.

Is there some trivial thing that I fail to see here? -- Any reaction would be much appreciated.

ignore_na not working as intended

I have NAs in dataset and when using function dummy_col(), with the ignore_na = F as default it will create new column with _NA, however, it will leave all NAs in the original dataset. Per help, however, it should replace the NA values.

Use [.data.table (currently `[` is dispatched to the data.frame() method)

See https://cran.r-project.org/web/packages/data.table/vignettes/datatable-importing.html, relevant section "data.table in Imports but nothing imported"

Basically, because fastDummies Imports data.table but does not importFrom(data.table, ...) in its NAMESPACE, to be conservative data.table assumes that fastDummies is "unaware" of data.table [ semantics and dispatches to [.data.frame.

That's evident e.g. here:

.data <- .data[-which(names(.data) %in% char_cols)]

Where that code won't work under [.data.table. The data.table equivalent is:

.data[, !..char_cols]
# or
.data[, !char_cols, with = FALSE]
# or
.data[, .SD, .SDcols = !char_cols]

The solution is

  1. Set .datatable.aware = TRUE anywhere in the package namespace
  2. Edit [ usages to use [.data.table semantics where appropriate.

Happy to file a PR fixing this.

If [.data.frame usage is intentional, I'd recommend (1) using as.data.frame() or data.table::setDF() to make the use of [.data.frame more intentional/explicit or (2) defining .datatable.aware=FALSE in your namespace to make it clearer that this is intentional.

Sorting of new columns not working correctly with numeric values

Using dummy_cols, the created dummy columns are correct but the order they are in is seemingly random. From what I get from #12 are the columns sorted alphabetically. However the data I'm using to create dummy variables consists of numeric values. It looks like this (a subset of my data):

photo photo_10 photo_13 photo_14 photo_15 photo_16 photo_17 photo_19 photo_2

1 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 1
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
... with 50 more rows, and 25 more variables: photo_20 , photo_21 ,
photo_22 , photo_23 , photo_24 , photo_25 ,
photo_26 , photo_27 , photo_29 , photo_3 ,
photo_30 , photo_31 , photo_32 , photo_33 ,
photo_34 , photo_36 , photo_37 , photo_38 ,
photo_39 , photo_4 , photo_5 , photo_6 , photo_7 ,
photo_8 , photo_9

What should happen is that the new columns are sorted from photo_1 to photo_40

Edit
What I also do not understand is that the values in under 'photo' are already in the right order. So why is that order not copied?

Feature request: support for making interaction variables

Hi! First of all, thanks for making this. I just did some quick benchmarking and {fastDummies} really is fast! I have a use case where this will actually make a big difference: simulating several million combinations of models using simulated data where each iteration requires creating new dummy variables. So this will help a lot!

In my case though I also need to create interaction variables, and it would be great if there was a way to build on this package to make them. Here is an example to show what I mean.

df <- data.frame( 
    price = runif(100, 5, 10),
    brand = sample(c("Nike", "Adidas"), 100, replace = TRUE)
)

df_without_ints <- fastDummies::dummy_cols(df, "brand")

df_with_ints <- as.data.frame(
    model.matrix(
        data = df, 
        object = ~price + brand + price*brand - 1)
    )

The df_without_ints data frame uses fastDummies::dummy_cols() to generate dummies, but it doesn't include interactions between price and the dummied brand coefficients. In contrast, I can use model.matrix() to generate both (see the df_with_ints object). model.matrix() isn't as fast, but it works well if you need both dummies and interactions with other columns. Does this make sense, and do you think it might be something others might be interested in?

Use dummy_cols to add columns by reference to data.table

Sorry for submitting multiple issues, but I hope that this makes it easier to address and track them.

Looking at the code for dummy_cols(), it seems that the function uses data.table::set() to add the dummy columns. data.table users with larger datasets may be interested in adding these columns by reference instead of copying the dataset. It looks like the function could accommodate this rather easily, but it may confuse new users if this is implemented in the original dummy_cols function. I'm not sure the best way to address this, but maybe a second function would be the right approach.

Split argument doesn't find all possibilities

Hi, I discovered a potential problem which I laid out on StackOverflow.

The split argument is not pulling out all categories. Here are two examples with results. Note that in the second example, I thought that adding a record that only had "mouse" would make the results better, but instead it made it worse. Thanks!

library(fastDummies)

#Dummy Split Test
ID <- seq(1:4)
pets <- c("dog", "cat;dog;mouse", "dog;mouse", "cat")
df <- data.frame("ID" = ID, "pets" = pets, stringsAsFactors = FALSE)

dummyTest <- dummy_cols(df, select_columns = c("pets"), remove_first_dummy = FALSE,
                        remove_most_frequent_dummy = FALSE, sort_columns = FALSE,
                        ignore_na = FALSE, split = ";")

print(dummyTest)

# ID          pets pets_dog pets_cat;dog;mouse pets_dog;mouse pets_cat
# 1  1           dog        1                  0              0        0
# 2  2 cat;dog;mouse        1                  1              0        1
# 3  3     dog;mouse        1                  0              1        0
# 4  4           cat        0                  0              0        1


ID <- seq(1:5)
pets <- c("dog", "cat;dog;mouse", "dog;mouse", "cat", "mouse")
df <- data.frame("ID" = ID, "pets" = pets, stringsAsFactors = FALSE)

dummyTest <- dummy_cols(df, select_columns = c("pets"), remove_first_dummy = FALSE,
                        remove_most_frequent_dummy = FALSE, sort_columns = FALSE,
                        ignore_na = FALSE, split = c(","))

print(dummyTest)

# ID          pets pets_dog pets_cat;dog;mouse pets_dog;mouse pets_cat pets_mouse
# 1  1           dog        1                  0              0        0          0
# 2  2 cat;dog;mouse        0                  1              0        0          0
# 3  3     dog;mouse        0                  0              1        0          0
# 4  4           cat        0                  0              0        1          0
# 5  5         mouse        0                  0              0        0          1

Return data.table if input data is a data.table

Hi Jacob,

Thanks for fastDummmies. It is a great little package that I think many people will find helpful.

I noticed in both dummy_cols() and dummy_rows() that the return value is a dataframe regardless if the input data is a data.table. Perhaps it would be appropriate to return a data.table if the input is a data.table. Obviously this could be easily done, by just creating a flag at the beginning of the function for input type and returning the same type at the end of the function.

Dummy indicator with more than one endorsement

I just came across your package and it looks very close to what I need. If you are still maintaining this package, an additional feature that would be extremely handy would be to allow for multiple categories to be selected (using a delimiter).

For example, I am working with a variable "Type" and some rows might be "Research", "Teaching", "Unknown", but others might be a combination: "Research, Teaching". When "dummied", I am looking for something that might look like:

Research | Teaching | Unknown
0 | 0 | 1
1 | 1 | 0

Being able to specify the delimiter (in this case, the comma followed by a space) would be a huge help.

Any ideas if this might be possible?

n-1 Dummies

Thanks a lot for the great package, it's very usefull.

Is there a possibility you could add an option that removes the first dummy of every variable that only n-1 Dummies remain? It would be very helpfull for models which need numeric data, but only n-1 Dummies because of perfect multicolinearity.

The code for it:

  • add variable "remove.first.dummy = FALSE" as parameter to the function (or however you want to call it)

  • add the code below:

    for (col_name in char_cols) {
    unique_vals <- unique(dataset[, get(col_name)])
    #----- next lines new ---
    if(remove.first.dummy){
    unique_vals = unique_vals[-1]
    }
    #-------------------------
    dataset[, (paste0(col_name, "", unique_vals)) := 0]
    for (unique_values in unique_vals) {
    dataset[get(col_name) == unique_values,
    (paste0(col_name, "
    ", unique_values)) := 1]
    }
    }

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.