cokelaer / fitter Goto Github PK

View Code? Open in Web Editor NEW

363.0 363.0 55.0 1.48 MB

Fit data to many distributions

Home Page: https://fitter.readthedocs.io/

License: GNU General Public License v3.0

Python 100.00%

distribution fit python statistics

fitter's Introduction

Hi there

🔭 I’m currently actively working on Sequana and also maintaining BioServices, Damona, Fitter, colormap, spectrum, Bioconvert and easydev.
I'm currently leading the bioinformatics and data management activities of the Biomics NGS platform (biomics.pasteur.fr) building pipelines and tools for production.
👯 I’m looking to collaborate on BioConvert, BioServices and Damona and of course Sequana. Would you be interested to take the lead on Spectrum, fitter, colormap, please let me know.

fitter's People

Contributors

Stargazers

Watchers

fitter's Issues

Parallel execution

Fitter would benefit a lot from running in subprocesses. What do you think about implementing something like joblib under the hood?

If you'd like, I can draft a PR on this.

NOT clearing the output of summary()

It would be nice if there were an option (a parameter to summary()?) to not clear the output of the jupyter cell, i.e. to stop the execution of the line
pylab.clf()
The reason being, I'm running a fitting of several variables in a loop and I want to examine the output in one cell, instead of either repeating the process manually for all my variables or saving the Fitted instances to, say, a list, and then still making n cells and manually call list[f].summary() for each f.

Empty progress bar is shown when progress=False

Hi.

The progress bar object is created in the class init section, but its actual use is in fit() method. As a result, an empty progress bar is shown if it's not used; f.fit(progress=False). It is shown in the image below.

The fix is initializing the progress bar in the fit() method. I will submit the PR soon.

AIC and BIC support

Hi,

Would you consider PR that implements AIC e BIC as comparison metrics in addition to SSE?

Cheers,

Caio

KL divergence arguments order

https://github.com/cokelaer/fitter/blob/master/src/fitter/fitter.py#L281
I am not sure but I think that arguments to kl_div aka scipy.stats.entropy should be swapped.
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.entropy.html for reference.

negative binomial distribution

why is there no negative binomial distribution in the eighty distribution

Implement function to retrieve parameters of the best fit automatically

from scipy import stats
data = stats.gamma.rvs(2, loc=1.5, scale=2, size=100000)
from fitter import Fitter
f = Fitter(data, distributions=['gamma', 'rayleigh', 'uniform'])
f.fit()

Then, use

f.df_errors

to get the best fit (taking theminimum error) and get the parameters from f.fitted_param

fix pandas warning

Progress Bar in Jupyter lab

I have a problem with Jupyter lab progress bar not displaying correctly. I've searched, but I can't find a way to fix it. Any solution? Does anyone else have that problem?

'module' object has no attribute 'sum'

----------------------Code -------------
from scipy import stats
data = stats.gamma.rvs(2, loc=1.5, scale=2, size=100000)
from fitter import Fitter
f = Fitter(data, distributions=['gamma', 'rayleigh', 'uniform'])
f.fit()

when I execute got this message

'module' object has no attribute 'sum'
SKIPPED gamma distribution (taking more than 10 seconds)
'module' object has no attribute 'sum'
SKIPPED rayleigh distribution (taking more than 10 seconds)
'module' object has no attribute 'sum'
SKIPPED uniform distribution (taking more than 10 seconds)

Env. anaconda

Is possible install the package with anaconda? (pyhton 3.7)

Example from readthedocs not working even if I downsample the array

Hello, I am trying to run the example from readthedocs which is:

from scipy import stats
data = stats.gamma.rvs(2, loc=1.5, scale=2, size=100000)

from fitter import Fitter
f = Fitter(data, timeout=3000, distributions=['gamma', 'rayleigh', 'uniform']) #Increasing time out time
f.fit()
f.summary()

and I am getting:

WARNING:root:SKIPPED uniform distribution (taking more than 3000 seconds)

WARNING:root:SKIPPED rayleigh distribution (taking more than 3000 seconds)

/home/miguel/.local/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:4530: IntegrationWarning: The integral is probably divergent, or slowly convergent.
  intg = integrate.quad(f, -xi, np.pi/2, **intg_kwargs)[0]
WARNING:root:SKIPPED gamma distribution (taking more than 3000 seconds)

WARNING:fitter.fitter:uniform was not fitted. no parameters available
WARNING:fitter.fitter:rayleigh was not fitted. no parameters available
WARNING:fitter.fitter:gamma was not fitted. no parameters available
WARNING:matplotlib.legend:No handles with labels found to put in legend.

My version of fitter from pip is 1.3.0...

Am I doing something wrong? The example from the readthedocs should work in principle right?

Also, I was thinking that there might a bug with the timeout parameter? no matter how much time I give it, it always returns SKIPPED x distribution (taking more than y seconds), where y can be a very large number.

Fitter worked fine and then stopped working

Hello,

I ran fitter successfully for a dataset I am working with, and then it stopped working.

1st run

2nd run

What could possibly be the issue?

Timeout error for all distributions.

Hello guys. Thank you for this very useful tool.

I am using Fitter with python 3.8 and in my last use I noticed a timeout error for all distributions with a dataset of only 8 thousand values. I've been using Fitter for almost a year and this error has never happened before.

Searching deeply in fitter.py file, I found on line 426, in the method _timed_run(), the it object is calling the .isAlive() method, however, the InterruptableThread class that inherits the thread.Thread class has only the .is_alive() method. Changing .isAlive() to .is_alive() fixes this for me.

Without this, when the code reaches here, the program raises an error captured by line 292 except Exception: in the _fit_single_distribution() method and announces a timout error.

Thanks again and sorry for the bad english.

Choose fit method (MLE or MM)

Fitter uses the fit method of SciPy, which has parameter method which can take values "MLE" (default) and "MM"

Would it be possible for the fit method of Fitter inherits this parameter?

Trivial typo in fitter.py

"it to compare" should be "is to compare"

James

Improve logging facilities

Hi,
Why do logging module is directly use to display info/warning instead of logger variable created with logging.getLogger()?
What is the solution to disable fitter logging if fitter is used in other module ?
I update fitter code from logging.info() to logger.info() (line 300 and 310 of fitter.py) and now i can force the logging level in my other module.
But I'm not sure that's the best way to control the logging level of other module.
Thanks for that so usefull module :-)

Adding an option for the density (normalisation)

Is it possible to add an option to have a fit (including the fit parameters) for "density=False"? And in the end, have a choice when calling an object from Fitter?

quadbacy.py warning and skipping

With the below code:

f=Fitter(spread)
f.fit()
f.summary()

I am getting these warnings. Is the fitter skipping because it deems it unnecessary? And the second warning appeared recently. Is it only calculating for 50 distributions now?

Fitting 109 distributions: 0%| | 0/109 [00:00<?, ?it/s]WARNING:root:SKIPPED _fit distribution (taking more than 30 seconds)
Fitting 109 distributions: 14%|███████▍ | 15/109 [00:03<00:37, 2.54it/s]WARNING:root:SKIPPED loguniform distribution (taking more than 30 seconds)
WARNING:root:SKIPPED rv_histogram distribution (taking more than 30 seconds)
Fitting 109 distributions: 16%|████████▍ | 17/109 [00:06<01:20, 1.14it/s]WARNING:root:SKIPPED rv_continuous distribution (taking more than 30 seconds)
WARNING:root:SKIPPED reciprocal distribution (taking more than 30 seconds)
Fitting 109 distributions: 97%|███████████████████████████████████████████████████▌ | 106/109 [00:26<00:00, 3.77it/s]WARNING:root:SKIPPED kappa4 distribution (taking more than 30 seconds)
WARNING:root:SKIPPED levy_stable distribution (taking more than 30 seconds)
Fitting 109 distributions: 99%|████████████████████████████████████████████████████▌| 108/109 [00:33<00:00, 1.91it/s]WARNING:root:SKIPPED studentized_range distribution (taking more than 30 seconds)
Fitting 109 distributions: 100%|█████████████████████████████████████████████████████| 109/109 [00:35<00:00, 3.06it/s]

and

:\Python310\lib\site-packages\scipy\integrate_quadpack_py.py:1151: IntegrationWarning:

The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.

C:\Python310\lib\site-packages\scipy\integrate_quadpack_py.py:1151: IntegrationWarning:

The integral is probably divergent, or slowly convergent.

Will fitter support CUDA acceleration？

Hello, fitting probability distribution via fitter on data with high dimension is time consumed. Hope that Fitter will be run on GPU in future.

Dependency "easydev" unmarked on conda

I encountered this error while trying to run this code:

from fitter import Fitter
f = Fitter(data) # data already assigned
f.fit()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_2933790/2066230846.py in <module>
      4 f = Fitter(data)
----> 5 f.fit()

~/anaconda3/envs/tfp/lib/python3.7/site-packages/fitter/fitter.py in fit(self, amp, progress)
    259         warnings.filterwarnings("ignore", category=RuntimeWarning)
    260 
--> 261         from easydev import Progress
    262         N = len(self.distributions)
    263         pb = Progress(N)

ModuleNotFoundError: No module named 'easydev'

I had installed Fitter with
conda install -c bioconda fitter.

The issue is resolved by installing easydev.
conda install -c conda-forge easydev.

If the conda package can't be made to install all its dependencies, it'd be good to have a mention of this in the installation instructions.

Lack of Clarity on the Parameters of the Distribution

If I use the "get_best" sub-module as follows:

f.get_best(method='sumsquare_error')

It returns the best fitted distribution and its parameters; i.e., a dictionary with one key (the distribution name) and its parameters.

For instance:

{'beta': (1.0900359801761663, 0.8058383063379988, -9.543996466545888, 107.5439964665459)}

Could you please provide clarity on which is the mean, standard deviation, etc? The package documentation does not provide clarity.

put fitter on RTD

need a name on RTD

Second trivial typo in fitter.py

" to not converge" should be " do not converge"

Suppress warnings does not work

Please tell me how to suppress warnings in my Jupyter notebook.
I tried:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import tensorflow as tf
import warnings
warnings.filterwarnings("ignore")

feature_names = ['BoxRatio', 'Thrust', 'Acceleration', 'Velocity', 'OnBalRun', 'vwapGain', 'Expect', 'Trin']
response_name = ['Altitude']

df_train = pd.read_csv("/HedgeTools/Datasets/rocket-train-classify.csv")

for name in feature_names:
    print('feature name: ', name)
    X_train = df_train[name].values
    f = Fitter(X_train)
    f.fit(progress=True)
    f.summary(plot=True)
    print('best transform: ', f.get_best())

and I'm getting

WARNING:root:SKIPPED alpha distribution (taking more than 30 seconds)
WARNING:root:SKIPPED beta distribution (taking more than 30 seconds)
WARNING:root:SKIPPED arcsine distribution (taking more than 30 seconds)
WARNING:root:SKIPPED anglit distribution (taking more than 30 seconds)
WARNING:root:SKIPPED argus distribution (taking more than 30 seconds)
WARNING:root:SKIPPED betaprime distribution (taking more than 30 seconds)
WARNING:root:SKIPPED bradford distribution (taking more than 30 seconds)
WARNING:root:SKIPPED burr distribution (taking more than 30 seconds)
WARNING:root:SKIPPED burr12 distribution (taking more than 30 seconds)
WARNING:root:SKIPPED chi2 distribution (taking more than 30 seconds)

Charles

Add method to get best fitted model

It's awkward to actually get an instance of the best fitted model. Right now I've got to do:

    from scipy import stats
    ((name, params),) = fitter.get_best().items() 
    model = getattr(stats, name)(**params)

I think it would make more sense if the get_best actually returned the fitted model and some other method gave you the parameters. Look at the sklearn GridSearch where after calling fit, the fitter behaves as the best fitted model, and one can extract the best model and best params easily. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

what is the best range for `sumsquare_error`?

In my case, my sumsquare_error is about 0.010512 . I notice that the sumsquare_error example in the doc is about to 0. I wonder if I can use this result.

Another question, how to convert the sumsquare_error to R_squared or can Fitter output the R_squared?

Adding build tags to git repo

I have noticed that the project does not include any git-tags related to the builds (as found in Pypi) it would be really helpful if you could tag builds so they correspond to pypi release versions.

If you need any help in doing this I would be more than willing to help you out. There does not seem to be too many versions so it should be a fairly quick process.

Thanks
Matt

@_mseymour

.summary() function returns a NoneType object?

I was try to access the results of the fit object but summary() gives me a NoneType object that rather than a table?

fit() does not work with > 2000 samples.

I have used it previously to fit 8000+ data samples. This was a year or two back. I found out (iteratively, long day!) that new version does not work with samples > 2000. Is it technical limitation?

Fitter can't fit data on custom histograms?

Like to stack overflow question of the project

I have just 140 data points. The histogram has about 100 bins and it's practically useless and looks flat. However with 10 bins the data looks really smooth but hist function doesn't have any attributes. There is also the issue that when I show best fitted plots, the best plot gets excluded. And the rest 3 are plotted. What can I do?

Unable to fit data

Hello all,
I have just started using Fitter module to plot distribution of my datasets. The code is very basic:

import pandas as pd
import seaborn as sns
from fitter import Fitter, get_common_distributions, get_distributions

dataset = pd.read_csv("/Users/xxx/Desktop/xxx/data.csv")
#dataset.head()
#dataset.info()

sns.set_style('white')
sns.set_context("paper", font_scale = 2)
sns.displot(data=dataset, x="12 pm - 3 pm", kind="hist", bins = 10, aspect = 1.5)

w912 = dataset["12 pm - 3 pm"].values
w912f= [x for x in w912 if np.isnan(x)==False]
#print(w912f)

f = Fitter(w912,distributions=['gamma',
'lognorm',
'burr',
'norm',
'genextreme'])
f.fit()
f.summary()

but when I run this the output says "logger.warning("%s was not fitted. no parameters available" % name)"
and "NameError: name 'logger' is not defined"
I did not find any suitable answer by surfing. Kindly enlighten me, also for each distribution its saying "WARNING:root:SKIPPED logistic distribution (taking more than 30 seconds)" this is just an example.
Thanks in advance.

Enhancement: Add goodness of fit statistic values as additional evaluation tool

I think of Anderson-Darling and/or Kolmogorov-Smirnov statistic.

Thoughts?

The ability to fit the mixture of distributions with carefull identifiability analysis is needed

Need more functions to work with mixture of distributions

High cpu usage after `fit()` with Jupyter notebook

Hi, I am using fitter on the MacBook Pro 2021 with the Jupyter notebook. Recently, I found that after running the Fitter.fit(), the cpu usage monitor shows that a python thread usage is 100%. Then I tried run the Fitter.fit() in the terminal, it was totally fine.

I have checked the log of Jupyter notebook, everything is fine.

fitter-1.3.0 use pip install got error message

(base) C:\Users\233Desktop>pip install C:\Users\233Desktop\Downloads\fitter-1.3.0.tar.gz
Processing c:\users\233desktop\downloads\fitter-1.3.0.tar.gz
ERROR: Command errored out with exit status 1:
command: 'C:\Users\233Desktop\anaconda3\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\Users\233Desktop\AppData\Local\Temp\pip-req-build-olrsvmrt\setup.py'"'"'; file='"'"'C:\Users\233Desktop\AppData\Local\Temp\pip-req-build-olrsvmrt\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\233Desktop\AppData\Local\Temp\pip-pip-egg-info-1e1s8o9g'
cwd: C:\Users\233Desktop\AppData\Local\Temp\pip-req-build-olrsvmrt
Complete output (5 lines):
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\233Desktop\AppData\Local\Temp\pip-req-build-olrsvmrt\setup.py", line 41, in
long_description = open("README.rst").read(),
UnicodeDecodeError: 'gbk' codec can't decode byte 0xa6 in position 3273: illegal multibyte sequence
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

ImportError: cannot import name 'Fitter' from 'fitter'

But I had pip install fitter.

Add `LICENSE`

Hey, this looks like a handy project!

Would you consider adding an explicit license? The lack of one implies (whether intentionally or not) that it can't be used without permission.

Thanks!

AttributeError: 'Fitter' object has no attribute 'df_errors'

I just tried running Fitter as follows

import numpy as np
from fitter import Fitter
x = np.random.rand(100)
f = Fitter(x)
f.summary()

and got this error:

File ".../python3.9/site-packages/fitter/fitter.py", line 380, in plot_pdf
  names = self.df_errors.sort_values(by=method).index[0:Nbest]
AttributeError: 'Fitter' object has no attribute 'df_errors'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../python3.9/site-packages/fitter/fitter.py", line 419, in summary
  self.plot_pdf(Nbest=Nbest, lw=lw, method=method)
File ".../python3.9/site-packages/fitter/fitter.py", line 382, in plot_pdf
  names = self.df_errors.sort(method).index[0:Nbest]
AttributeError: 'Fitter' object has no attribute 'df_errors'

Feature request - Supporting custom (non-scipy) distributions

Hi.

Fitter only supports distributions available in scipy. I need to include a distribution called 'PERT", which is available as an external module. It is not possible to use it in a way Fitter calls the distribution functions; dist = eval("scipy.stats." + distribution).

I recommend adding a new argument, called something like isCustomDist, or you can check when the distribution name is not string. When it is true, the eval() section gets is bypassed.

I can assist to implement it.

How to save the histogram with fitted distributions?

Hi,

I was wondering how can I save the histogram output to my environment, without going into the source code and adding a save figure line?

Best,
Medina

AttributeError: 'module' object has no attribute 'clf' when running 'summary()'

I am getting the following error try to run the summary() method
Any idea?

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-60-68eccc08c0a2> in <module>()
      3 # may take some time since by default, all distributions are tried
      4 # but you call manually provide a smaller set of distributions
----> 5 f.summary()

c:\python27\lib\site-packages\fitter\fitter.py in summary(self, Nbest, lw)
    282 
    283         """
--> 284         pylab.clf()
    285         self.hist()
    286         self.plot_pdf(Nbest=Nbest, lw=lw)

AttributeError: 'module' object has no attribute 'clf'

How to order by AIC (Akaike Information Criterion) or BIC (Bayesian Information Criterion)?

The tables are ordered by sum of squares, but ordering by Akaike Information Criterion or Bayesian Information Criterion would give a different result. The KL-divergence appears to be infinite (8.6 Tanker is not infinite. Do you know why??), which is worrying as KL should be below for a good fit. How to order by Akaike Information Criterion or Bayesian Information Criterion?
Most of the distributions in the section appear bi-modal or multi-modal, but the standard distributions out of SciPy appear to be single modal. So we're not seeing a good fit. Fitting a multi-modal distribution will increase the model complexity (number of model parameters) and so measures such as aic and bic will become important as a progressively better fit will be possible by increasing the dimension of a multi-modal distribution.

"pip install fitter" Results in Error: "No such file or directory: 'requirements.txt'"

$ pip install fitter
Collecting fitter
  Downloading https://files.pythonhosted.org/packages/ad/4d/2a899c9d617b3c1c05246d43581d82dfe3f108f74af55ea0bb741e720b45/fitter-1.1.10.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-ma9vgdkb/fitter/setup.py", line 54, in <module>
        install_requires = open("requirements.txt").read(),
    FileNotFoundError: [Errno 2] No such file or directory: 'requirements.txt'
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-ma9vgdkb/fitter/

Plot graphics is not working

I am getting errors related to plotting data. It is expected to fitter plot some graphics but it is not working. These are the message errors I am getting:

>>> f.fit()
Fitted norm distribution with error=0.26388354842966155)
Fitted expon distribution with error=0.21704685620029987)
>>> f.summary()
/usr/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6521: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")
       sumsquare_error
expon         0.217047
norm          0.263884
>>> f.hist()
/usr/lib/python3.7/site-packages/matplotlib/axes/_axes.py:6521: MatplotlibDeprecationWarning: 
The 'normed' kwarg was deprecated in Matplotlib 2.1 and will be removed in 3.1. Use 'density' instead.
  alternative="'density'", removal="3.1")

Looks like they are just warnings, not errors itself, but the plotting is not working anyway. Any idea?

Conda package not updated

The bioconda channel only has up to version 1.2.3 available. Note that the current latest version includes several bug fixes, and you will need to either install via pip, or edit the source directly to make some things work.

Reproducibility across fittings

Hello!

Thank you so much for making this tool, it is very useful!

I noticed that across multiple fittings to the same set of data, different "best" distributions are shown. Is this intended behaviour? Is there a way to ensure reproducibility across runs, like setting a random seed?

Cheers,
Nancy

Takes too long to finish a fitting

It seems like Fitter try many models in Scipy to fit my data and it takes too long. Is there a better using of Fitter to finish the job?

AttributeError: Unknown property density when running summary()

from fitter import Fitter
y2 = [0, 0, 1, 1, 2, 3, 4, 5, 6, 6, 6, 7, 8, 9, 9, 9, 9, 9, 10, 11, 13, 14, 16, 18]
f = Fitter(y2)
f.fit()
# may take some time since by default, all distributions are tried
# but you call manually provide a smaller set of distributions
f.summary()
print("test over!", flush=True)
exit(1)

Feature request: raise Exception when an item in distributions is not a valid distribution

When an item in the distributions list is not a valid distribution, the fit method yields warnings, e.g.:
WARNING:root:SKIPPED normal distribution (taking more than 30 seconds)
WARNING:fitter.fitter:normal was not fitted. no parameters available
These are IMHO ambiguous and should be replaced by raising an exception

The above behaviour can be generated by the following commands
f = Fitter(data, distributions=['beta', 'gamma', 'normal'])
f.fit()
which yield the following result https://gist.github.com/amirpaster/c25bac5e4cacae04fe538742f8b6e1cf

cokelaer / fitter Goto Github PK

fitter's Introduction

Hi there

fitter's People

Contributors

Stargazers

Watchers

Forkers

fitter's Issues

when I execute got this message

Recommend Projects

Recommend Topics

Recommend Org

Jobs