osofr / condensier Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 1.0 1.68 MB

Non-parametric conditional density estimation with binned conditional histograms

License: MIT License

R 98.52% C++ 1.21% TeX 0.27%

density learner-density hazard bin-hazard cross-validation likelihood

condensier's People

Contributors

Stargazers

Watchers

Forkers

rfherrerac

condensier's Issues

Add S3 predict to predict probability for all classes in multinomial / categorical

Regular multinomial regressions / classifiers typically will predict the probability of each class (for each observations)

Need a predict_class or predict.condensier function that will loop over all categorical classes available and separately predict the probability of each class.

What is the license?

The description file says that the license is GPL-2; the README that the license is MIT.

It probably does not matter much either way since both licenses are "open-source", but the GPL may be more "copyleft" than MIT

Implement sampling from a long-format density fit

sample_value() will not work when using pool = TRUE.

The approach needs to somehow replicate the functionality in

condensier/R/SummariesModelClass.R

Lines 205 to 215 in 9e177e5

 for (k_i in seq_along(private$PsAsW.models)) { 

 sampleA_newcat <- private$PsAsW.models[[k_i]]$sampleA(newdata = newdata, ...) 

 if (k_i == 1L) sampleA_mat[, k_i] <- sampleA_newcat 

 # carry forward all previously sampled 1's (degenerate ones a bin a chosen for the first time) 

 if (k_i > 1) { 

 # if you succeeded at the previous bin, your 1L is carried through till the end: 

 sampleA_mat[(sampleA_mat[, k_i - 1] == 1L), k_i] <- 1L 

 # if you haven't succeeded at the previous bin, you get a chance to succeed at this category: 

 sampleA_mat[(sampleA_mat[, k_i - 1] == 0L), k_i] <- sampleA_newcat[(sampleA_mat[, k_i - 1] == 0L)] 

 } 

 }

but based on a single (pooled) model fit for all bin hazards.

This needs to be done directly inside BinOutModel$sampleA. One potential approach is to re-create a loop over nbins, inside the loop mutate newdata with a new bin indicator, then keep call predict for the same fit.

fitting densities with weights

For a particular application I've run into, it would be very useful to be able to incorporate a weights argument into fit_density, similar in style to what's currently present in standard methods like glm. This would likely be rather easily accomplished by incorporating weights into the step where the likelihood is fit within the fit_density function (I'm not familiar with the code base for this package; otherwise, I'd offer a solution via a PR and not simply open an issue).

Would it be possible to incorporate this weights argument if it's trivial @osofr? I'd be happy to help if I can.

Weird sampling issue

Hi @osofr,

I couldn't sleep so I created a minimal working example for the problem I told you about. I wrote it without my glasses on, so sorry for any typos :-)

The code:

 library('condensier')
 library('data.table')
 library('speedglm')
 library('magrittr')

 nbins <- 10
 w_prob <- 0.75
 nobs <- 13000

 X <- rbinom(n=nobs,size=1, prob=w_prob)
 mean <- 10 + (15 * X)
 Y <- (rnorm(nobs,mean,5))
 dat <- data.table(X = X, Y = Y)


 bin_estimator <- condensier::glmR6$new()

 dens_fit <- condensier::fit_density(X = 'X',
                         Y = 'Y',
                         input_data = dat,
                         nbins = nbins,
                         bin_estimator = bin_estimator)

 res <- estimated_probabilities <- condensier::predict_probability(dens_fit, dat)

 # Expect true twise
 print(mean(res[dat$X == 1 & dat$Y >= 17]) > mean(res[dat$X == 0 & dat$Y < 17]))
 print(mean(res[dat$X == 0 & dat$Y >= 17]) < mean(res[dat$X == 1 & dat$Y < 17]))

 print(mean(res[dat$X == 1 & dat$Y >= 17])) # Expect high
 print(mean(res[dat$X == 1 & dat$Y < 17])) # Expect low
 print(mean(res[dat$X == 0 & dat$Y >= 17])) # Expect low
 print(mean(res[dat$X == 0 & dat$Y < 17])) # Expect high

 res <- condensier::sample_value(dens_fit, dat)
 dat$Y[dat$X==0] %>% mean %>% print # xpect ~ 10
 dat$Y[dat$X==1] %>% mean %>% print # xpect ~ 25
 res[dat$X==0] %>% mean %>% print # xpect ~ 10
 res[dat$X==1] %>% mean %>% print # xpect ~ 25

And the output:

TRUE
FALSE

0.05643434
0.01318017
0.04243487
0.003734035

10.06833
25.00594
25.05065
25.00862

As you can see, the outputs are not entirely what I expected them to be. I'm running this code using the condensier version in my branch, but I could try in the morning to run it of master (or an earlier version), to see if the problem persists.

I hope this clarifies the problem a bit.

Best,
Frank

Poor pooled estimates in simple case

The following code attempts to fit a marginal density using both pooled and unpooled condensier estimates by way of sl3. The true density is standard normal. The unpooled estimates (red) approximate the true density(blue), but the unpooled estimates (green) do not.

options(sl3.verbose = FALSE)
library("condensier")
library("sl3")
library("simcausal")
library(ggplot2)


D <- DAG.empty()
D <- D + node("I", distr = "rbern", prob = 1) +
     node("B", distr = "rnorm", mean = 0, sd = 1)

D <- set.DAG(D, n.test = 10)
datO <- sim(D, n = 10000, rndseed = 12345)

# ================================================================================
task <- sl3_Task$new(datO, covariates=c("I"),outcome="B")
lrn_unpooled <- Lrnr_condensier$new(nbins = 25, bin_method = "equal.len", pool = FALSE,
                            bin_estimator = Lrnr_glm_fast$new(family = binomial()))
lrn_unpooled_fit <- lrn_unpooled$train(task)

lrn_pooled <- Lrnr_condensier$new(nbins = 25, bin_method = "equal.len", pool = TRUE,
                                    bin_estimator = Lrnr_glm_fast$new(family = binomial()))
lrn_pooled_fit <- lrn_pooled$train(task)


# ================================================================================


# evaluate fit on training data
datO$pred_fB_unpooled <- lrn_unpooled_fit$predict(task)
datO$pred_fB_pooled <- lrn_pooled_fit$predict(task)
datO$fB <- dnorm(datO$B)
long <- melt(datO, id=c("B"), measure=c("pred_fB_unpooled","pred_fB_pooled","fB"))
ggplot(long, aes(x=B, y=value, color=variable)) + geom_point() + theme_bw()

Add simcausal to remotes

Looks like simcausal isn't on CRAN anymore. Can you add its github repo to the list of remotes in DESCRIPTION?

osofr / condensier Goto Github PK

condensier's People

Contributors

Stargazers

Watchers

Forkers

condensier's Issues

Add S3 predict to predict probability for all classes in multinomial / categorical

What is the license?

Implement sampling from a long-format density fit

fitting densities with weights

Weird sampling issue

Poor pooled estimates in simple case

Add simcausal to remotes

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs

	for (k_i in seq_along(private$PsAsW.models)) {
	sampleA_newcat <- private$PsAsW.models[[k_i]]$sampleA(newdata = newdata, ...)
	if (k_i == 1L) sampleA_mat[, k_i] <- sampleA_newcat
	# carry forward all previously sampled 1's (degenerate ones a bin a chosen for the first time)
	if (k_i > 1) {
	# if you succeeded at the previous bin, your 1L is carried through till the end:
	sampleA_mat[(sampleA_mat[, k_i - 1] == 1L), k_i] <- 1L
	# if you haven't succeeded at the previous bin, you get a chance to succeed at this category:
	sampleA_mat[(sampleA_mat[, k_i - 1] == 0L), k_i] <- sampleA_newcat[(sampleA_mat[, k_i - 1] == 0L)]
	}
	}