better / convoys Goto Github PK
View Code? Open in Web Editor NEWImplementation of statistical models to analyze time lagged conversions
Home Page: https://better.engineering/convoys/
License: MIT License
Implementation of statistical models to analyze time lagged conversions
Home Page: https://better.engineering/convoys/
License: MIT License
Hey!
Awesome library - saved my day!
I've experienced an issue when trying to fit a model:
single.KaplanMeier.fit(B,T)
throws TypeError: fit() missing 1 required positional argument: 'T'
.
Can not figure out what's the issue here.
Steps to reproduce:
python -m venv venv
and source venv/bin/activate
pip install convoys==0.2.1
python examples/dob_violations.py
Stacktrace:
File "examples/dob_violations.py", line 50, in
run()
File "examples/dob_violations.py", line 25, in run
convoys.plotting.plot_cohorts(G, B, T, model=model, ci=0.95,
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/plotting.py", line 62, in plot_cohorts
m.fit(G, B, T)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/multi.py", line 31, in fit
self.base_model.fit(X, B, T)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/convoys/regression.py", line 269, in fit
for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)):
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/ensemble.py", line 379, in sample
self.backend.grow(iterations, state.blobs)
File "/Users/jacopotagliabue/Documents/repos/convoys/venv/lib/python3.8/site-packages/emcee/backends/backend.py", line 175, in grow
a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
numpy.int
was deprecated in 1.2.0
[ref]
This type is being used in the Gamma model
Happy to raise a PR to change to numpy.int_
I am running into an numpy type error (see below) whenever I set value for the parameters mcmc==True
or ci==0.95
in any convoys model. This is consistent across data sources and even occurs when using the example data sets. If I remove these parameters the code runs as expected with no errors. Is this something anyone else has come across? Any help is much appreciated!
TypeError: 'numpy.float64' object cannot be interpreted as an integer
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-160-acdf74229cef> in <module>
----> 1 model_test.fit(X,B,T)
~/opt/anaconda3/lib/python3.8/site-packages/convoys/multi.py in fit(self, G, B, T)
29 for i, group in enumerate(G):
30 X[i,group] = 1
---> 31 self.base_model.fit(X, B, T)
32
33 def _get_x(self, group):
~/opt/anaconda3/lib/python3.8/site-packages/convoys/regression.py in fit(self, X, B, T, W)
267 ' %d walkers [' % n_walkers,
268 progressbar.AdaptiveETA(), ']'])
--> 269 for i, _ in enumerate(sampler.sample(p0, iterations=n_iterations)):
270 bar.update(i+1)
271 result['samples'] = sampler.chain[:, n_burnin:, :] \
~/opt/anaconda3/lib/python3.8/site-packages/emcee/ensemble.py in sample(self, initial_state, log_prob0, rstate0, blobs0, iterations, tune, skip_initial_state_check, thin_by, thin, store, progress, progress_kwargs)
377 checkpoint_step = thin_by
378 if store:
--> 379 self.backend.grow(iterations, state.blobs)
380
381 # Set up a wrapper around the relevant model functions
~/opt/anaconda3/lib/python3.8/site-packages/emcee/backends/backend.py in grow(self, ngrow, blobs)
173 self._check_blobs(blobs)
174 i = ngrow - (len(self.chain) - self.iteration)
--> 175 a = np.empty((i, self.nwalkers, self.ndim), dtype=self.dtype)
176 self.chain = np.concatenate((self.chain, a), axis=0)
177 a = np.empty((i, self.nwalkers), dtype=self.dtype)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
Typeerror __init__() got an unexpected keyword argument 'dim' convoys
is thrown when attempting to run with emcee 3.0.0.
The code in question is in regression.py
(~line 248):
sampler = emcee.EnsembleSampler(
nwalkers=n_walkers,
dim=dim,
lnpostfn = generalized_gamma_loss,
args=args,
)
I was able to fix it by changing dim
to ndim
and lnpostfn
to log_prob_fn
- no other changes were required.
I'm trying to do some modeling where I have a large time lag for conversion, and I am interested in getting updated single observation likelihood of conversion predictions over the lifetime of an observation (at no specified interval, just when someone is interested and wants to look). Intuitively I'd expect the likelihood of conversion to be the highest for the first couple of days/weeks and past a certain point it essentially isn't going to convert, it's just too old.
I was looking at Cox Proportional Hazards models when I came across Convoys and it seemed to address my problem more directly, though many of the examples involve groups and aggregate conversion rates. I know there are regression classes and I was playing with those:
from convoys import regression, utils
unit, groups, (G, B, T) = utils.get_arrays(
survival_df,
created='date_input',
converted='conversion_date',
unit='days',
features=[i for i in features if i not in ['date_input', 'conversion_date']]
)
gamma_model = regression.GeneralizedGamma(flavor='linear', ci=True)
gamma_model.fit(G,B,T)
gamma_model.predict([1 2 3], 30, ci=True)
but I was curious if I'm thinking about the interpretation of the output for real-time scoring correctly (i.e., an observation is to be scored at time t and the result is the likelihood of conversion at that point assuming the observation has not converted at this point). Similarly, if my features are time-dependent (e.g., may be null at creation, but I learn more about them over time), can that be factored in (after more thorough reading of the docs, I've seen this in future directions using RNN, do you have any papers you can point me at)?
Thank you in advance.
For multi models, the user can pass ci=True
and cdf()
returns a tuple with the mean, lower bound, and upper bound of the confidence interval.
In use cases where one wants to compare across groups, returning the posterior distribution itself is more desirable than summary metrics.
I'd like to add a parameter for the user to specify that they want the posterior distributions for each group. I'll link to the PR below.
http://data.princeton.edu/pop509/ParametricSurvival.pdf
I don't see an obvious pdf in that documentation though. Perhaps I need to re-read it though.
Hey there! We, the Buffer data team, recently discovered this awesome package, and we're starting to use it in different analysis.
We're used to doing most of the plotting with R. I've started to work on getting the data back from the Matplotlib figure but seems like a hack and was wondering if you've thought about decoupling the plotting from the modelling.
Prophet, from Facebook, does a great job at that and it'll return a DataFrame
with the required data to plot. The same prophet
library will also have a default .plot
function that uses Matplotlib. That helps users use other plotting frameworks.
I'm happy to help with the coding if I can figure out how to better do the decoupling. Let me know if you have any questions too. ๐
Thanks for open sourcing such a helpful library!
PS: We've also found that using a large group size will result in a confusing legend in the final plot. This one can be probably fixed using the proper Matplotlib arguments though. This example shows weeks in one of our plots:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.