better / convoys Goto Github PK

View Code? Open in Web Editor NEW

255.0 255.0 41.0 10.87 MB

Implementation of statistical models to analyze time lagged conversions

Home Page: https://better.engineering/convoys/

License: MIT License

Python 100.00%

convoys's People

Contributors

Stargazers

Watchers

convoys's Issues

Error in fit() method

Hey!
Awesome library - saved my day!

I've experienced an issue when trying to fit a model:

single.KaplanMeier.fit(B,T)

throws TypeError: fit() missing 1 required positional argument: 'T'.

Can not figure out what's the issue here.

Example dos does not work

Steps to reproduce:

clone the repo
python -m venv venv and source venv/bin/activate
pip install convoys==0.2.1
python examples/dob_violations.py

Stacktrace:

File "examples/dob_violations.py", line 50, in
run()
File "examples/dob_violations.py", line 25, in run
convoys.plotting.plot_cohorts(G, B, T, model=model, ci=0.95,
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/plotting.py", line 62, in plot_cohorts
m.fit(G, B, T)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/multi.py", line 31, in fit
self.base_model.fit(X, B, T)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/regression.py", line 269, in fit
for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)):
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/ensemble.py", line 379, in sample
self.backend.grow(iterations, state.blobs)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/backends/backend.py", line 175, in grow
a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype)
TypeError: 'numpy.float64' object cannot be interpreted as an integer

numpy.int deprecated being used in Gamma model

numpy.int was deprecated in 1.2.0 [ref]

This type is being used in the Gamma model

Happy to raise a PR to change to numpy.int_

Numpy type error occurs when setting mcmc==True

I am running into an numpy type error (see below) whenever I set value for the parameters mcmc==True or ci==0.95 in any convoys model. This is consistent across data sources and even occurs when using the example data sets. If I remove these parameters the code runs as expected with no errors. Is this something anyone else has come across? Any help is much appreciated!

TypeError: 'numpy.float64' object cannot be interpreted as an integer
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-160-acdf74229cef> in <module>
----> 1 model_test.fit(X,B,T)

~/opt/anaconda3/lib/python3.8/site-packages/convoys/multi.py in fit(self, G, B, T)
     29         for i, group in enumerate(G):
     30             X[i,group] = 1
---> 31         self.base_model.fit(X, B, T)
     32 
     33     def _get_x(self, group):

~/opt/anaconda3/lib/python3.8/site-packages/convoys/regression.py in fit(self, X, B, T, W)
    267                     ' %d walkers [' % n_walkers,
    268                     progressbar.AdaptiveETA(), ']'])
--> 269             for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)):
    270                 bar.update(i+1)
    271             result['samples'] = sampler.chain[:, n_burnin:, :] \

~/opt/anaconda3/lib/python3.8/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress, progress_kwargs)
    377             checkpoint_step = thin_by
    378             if store:
--> 379                 self.backend.grow(iterations, state.blobs)
    380 
    381         # Set up a wrapper around the relevant model functions

~/opt/anaconda3/lib/python3.8/site-packages/emcee/backends/backend.py in grow(self, ngrow, blobs)
    173         self._check_blobs(blobs)
    174         i = ngrow - (len(self.chain) - self.iteration)
--> 175         a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype)
    176         self.chain = np.concatenate((self.chain, a), axis=0)
    177         a = np.empty((i, self.nwalkers), dtype=self.dtype)

TypeError: 'numpy.float64' object cannot be interpreted as an integer

GeneralizedGamma's stdout flushing going crazy in Jupyter Labs

Calling fit on GeneralizedGamma within a Jupyter Labs notebook results in a lot of whitespace in the output. I think this is because sys.stdout.flush() does not play well Labs.

TypeError when run w/ emcee 3.0.0

Typeerror __init__() got an unexpected keyword argument 'dim' convoys is thrown when attempting to run with emcee 3.0.0.

The code in question is in regression.py (~line 248):

sampler = emcee.EnsembleSampler(
    nwalkers=n_walkers,
    dim=dim,
    lnpostfn = generalized_gamma_loss,
    args=args,
)

I was able to fix it by changing dim to ndim and lnpostfn to log_prob_fn - no other changes were required.

Use for Real-Time Scoring

I'm trying to do some modeling where I have a large time lag for conversion, and I am interested in getting updated single observation likelihood of conversion predictions over the lifetime of an observation (at no specified interval, just when someone is interested and wants to look). Intuitively I'd expect the likelihood of conversion to be the highest for the first couple of days/weeks and past a certain point it essentially isn't going to convert, it's just too old.

I was looking at Cox Proportional Hazards models when I came across Convoys and it seemed to address my problem more directly, though many of the examples involve groups and aggregate conversion rates. I know there are regression classes and I was playing with those:

from convoys import regression, utils

unit, groups, (G, B, T) = utils.get_arrays(
    survival_df, 
    created='date_input', 
    converted='conversion_date', 
    unit='days', 
    features=[i for i in features if i not in ['date_input', 'conversion_date']]
)
gamma_model = regression.GeneralizedGamma(flavor='linear', ci=True)
gamma_model.fit(G,B,T)
gamma_model.predict([1 2 3], 30, ci=True)

but I was curious if I'm thinking about the interpretation of the output for real-time scoring correctly (i.e., an observation is to be scored at time t and the result is the likelihood of conversion at that point assuming the observation has not converted at this point). Similarly, if my features are time-dependent (e.g., may be null at creation, but I learn more about them over time), can that be factored in (after more thorough reading of the docs, I've seen this in future directions using RNN, do you have any papers you can point me at)?

Thank you in advance.

Return posterior distribution from cdf function

For multi models, the user can pass ci=True and cdf() returns a tuple with the mean, lower bound, and upper bound of the confidence interval.

In use cases where one wants to compare across groups, returning the posterior distribution itself is more desirable than summary metrics.

I'd like to add a parameter for the user to specify that they want the posterior distributions for each group. I'll link to the PR below.

Feature: Add in Coale-McNeil model

http://data.princeton.edu/pop509/ParametricSurvival.pdf

I don't see an obvious pdf in that documentation though. Perhaps I need to re-read it though.

Decoupling visualization from models

Hey there! We, the Buffer data team, recently discovered this awesome package, and we're starting to use it in different analysis.

We're used to doing most of the plotting with R. I've started to work on getting the data back from the Matplotlib figure but seems like a hack and was wondering if you've thought about decoupling the plotting from the modelling.

Prophet, from Facebook, does a great job at that and it'll return a DataFrame with the required data to plot. The same prophet library will also have a default .plot function that uses Matplotlib. That helps users use other plotting frameworks.

I'm happy to help with the coding if I can figure out how to better do the decoupling. Let me know if you have any questions too. 😄

Thanks for open sourcing such a helpful library!

PS: We've also found that using a large group size will result in a confusing legend in the final plot. This one can be probably fixed using the proper Matplotlib arguments though. This example shows weeks in one of our plots:

better / convoys Goto Github PK

convoys's People

Contributors

Stargazers

Watchers

Forkers

convoys's Issues

Error in fit() method

Example dos does not work

numpy.int deprecated being used in Gamma model

Numpy type error occurs when setting mcmc==True

GeneralizedGamma's stdout flushing going crazy in Jupyter Labs

TypeError when run w/ emcee 3.0.0

Use for Real-Time Scoring

Return posterior distribution from cdf function

Feature: Add in Coale-McNeil model

Decoupling visualization from models

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs