GithubHelp home page GithubHelp logo

powerlaw's Introduction

powerlaw: A Python Package for Analysis of Heavy-Tailed Distributions

powerlaw is a toolbox using the statistical methods developed in Clauset et al. 2007 and Klaus et al. 2011 to determine if a probability distribution fits a power law. Academics, please cite as:

Jeff Alstott, Ed Bullmore, Dietmar Plenz. (2014). powerlaw: a Python package for analysis of heavy-tailed distributions. PLoS ONE 9(1): e85777

Also available at arXiv:1305.0215 [physics.data-an]

Basic Usage

For the simplest, typical use cases, this tells you everything you need to know.:

import powerlaw
data = array([1.7, 3.2 ...]) # data can be list or numpy array
results = powerlaw.Fit(data)
print(results.power_law.alpha)
print(results.power_law.xmin)
R, p = results.distribution_compare('power_law', 'lognormal')

For more explanation, understanding, and figures, see the paper, which illustrates all of powerlaw's features. For details of the math, see Clauset et al. 2007, which developed these methods.

Quick Links

Paper illustrating all of powerlaw's features, with figures

Code examples from manuscript, as an IPython Notebook Note: Some results involving lognormals will now be different from the manuscript, as the lognormal fitting has been improved to allow for greater numerical precision.

Documentation

This code was developed and tested for Python 2.x with the Enthought Python Distribution, and later amended to be compatible with 3.x. The full version of Enthought is available for free for academic use.

Installation

powerlaw is hosted on PyPI, so installation is straightforward. The easiest way to install type this at the command line (Linux, Mac, or Windows):

easy_install powerlaw

or, better yet:

pip install powerlaw

easy_install or pip just need to be on your PATH, which for Linux or Mac is probably the case.

pip should install all dependencies automagically. These other dependencies are numpy, scipy, and matplotlib. These are all present in Enthought, Anaconda, and most other scientific Python stacks. To fit truncated power laws or gamma distributions, mpmath is also required, which is less common and is installable with:

pip install mpmath

The requirement of mpmath will be dropped if/when the scipy functions gamma, gammainc and gammaincc are updated to have sufficient numerical accuracy for negative numbers.

You can also build from source from the code here on Github, though it may be a development version slightly ahead of the PyPI version.

Update Notifications and Mailing List

Get notified of updates by joining the Google Group here.

Questions/discussions/help go on the Google Group here. Also receives update info.

Further Development

The original author of powerlaw, Jeff Alstott, is now only writing minor tweaks, but powerlaw remains open for further development by the community. If there's a feature you'd like to see in powerlaw you can submit an issue, but pull requests are even better. Offers for expansion or inclusion in other projects are welcomed and encouraged.

Acknowledgements

Many thanks to Andreas Klaus, Mika Rubinov and Shan Yu for helpful discussions. Thanks also to Andreas Klaus, Aaron Clauset, Cosma Shalizi, and Adam Ginsburg for making their code available. Their implementations were a critical starting point for making powerlaw.

Power Laws vs. Lognormals and powerlaw's 'lognormal_positive' option

When fitting a power law to a data set, one should compare the goodness of fit to that of a lognormal distribution. This is done because lognormal distributions are another heavy-tailed distribution, but they can be generated by a very simple process: multiplying random positive variables together. The lognormal is thus much like the normal distribution, which can be created by adding random variables together; in fact, the log of a lognormal distribution is a normal distribution (hence the name), and the exponential of a normal distribution is the lognormal (which maybe would be better called an expnormal). In contrast, creating a power law generally requires fancy or exotic generative mechanisms (this is probably why you're looking for a power law to begin with; they're sexy). So, even though the power law has only one parameter (alpha: the slope) and the lognormal has two (mu: the mean of the random variables in the underlying normal and sigma: the standard deviation of the underlying normal distribution), we typically consider the lognormal to be a simpler explanation for observed data, as long as the distribution fits the data just as well. For most data sets, a power law is actually a worse fit than a lognormal distribution, or perhaps equally good, but rarely better. This fact was one of the central empirical results of the paper Clauset et al. 2007, which developed the statistical methods that powerlaw implements.

However, for many data sets, the superior lognormal fit is only possible if one allows the fitted parameter mu to go negative. Whether or not this is sensible depends on your theory of what's generating the data. If the data is thought to be generated by multiplying random positive variables, mu is just the log of the distribution's median; a negative mu just indicates those variables' products are typically below 1. However, if the data is thought to be generated by exponentiating a normal distribution, then mu is interpreted as the median of the underlying normal data. In that case, the normal data is likely generated by summing random variables (positive and negative), and mu is those sums' median (and mean). A negative mu, then, indicates that the random variables are typically negative. For some physical systems, this is perfectly possible. For the data you're studying, though, it may be a weird assumption. For starters, all of the data points you're fitting to are positive by definition, since power laws must have positive values (indeed, powerlaw throws out 0s or negative values). Why would those data be generated by a process that sums and exponentiates negative variables?

If you think that your physical system could be modeled by summing and exponentiating random variables, but you think that those random variables should be positive, one possible hacks is powerlaw's lognormal_positive. This is just a regular lognormal distribution, except mu must be positive. Note that this does not force the underlying normal distribution to be the sum of only positive variables; it only forces the sums' average to be positive, but it's a start. You can compare a power law to this distribution in the normal way shown above:

R, p = results.distribution_compare('power_law', 'lognormal_positive')

You may find that a lognormal where mu must be positive gives a much worse fit to your data, and that leaves the power law looking like the best explanation of the data. Before concluding that the data is in fact power law distributed, consider carefully whether a more likely explanation is that the data was generated by multiplying positive random variables, or even by summing and exponentiating random variables; either one would allow for a lognormal with an intelligible negative value of mu.

powerlaw's People

Contributors

andreasbilke avatar cdown avatar cils avatar drbild avatar esafak avatar henrymartin1 avatar jeffalstott avatar kataev avatar keflavich avatar lneisenman avatar mdf-github avatar mountaindust avatar nsfzyzz avatar pfischbeck avatar riptidebo avatar sx349 avatar wuhaochen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

powerlaw's Issues

Pass different minimize options to Scipy's minimize

The numerical fitting methods currently use Scipy's fmin:
from scipy.optimize import fmin

This is just one possible minimizing function. There's a keyword option in Fit waiting to let the user set a different minimizing function, fit_optimizer=None. It would be good to give the user the option for other numerical minimizing functions, such as Scipy's other functions in this space.

FloatingPointError when Fit() is invoked

powerlaw.Fit(discrete=True, data=[1]*1000 + [10]*100 + [100]*10 + [1000])
Output:
Calculating best minimal value for power law fit
---------------------------------------------------------------------------
FloatingPointError                        Traceback (most recent call last)
<ipython-input-320-4afe2d64269c> in <module>()
      1 # TODO, floating point error/bug in powerlaw lib
----> 2 powerlaw.Fit(discrete=True, data=[1,10,100,1000,100000])

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, data, discrete, xmin, xmax, fit_method, estimate_discrete, discrete_approximation, sigma_threshold, parameter_range, fit_optimizer, xmin_distance, **kwargs)
    127             self.fixed_xmin=False
    128             print("Calculating best minimal value for power law fit", file=sys.stderr)
--> 129             self.find_xmin()
    130 
    131         self.data = self.data[self.data>=self.xmin]

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in find_xmin(self, xmin_distance)
    226             return getattr(pl, xmin_distance), pl.alpha, pl.sigma, pl.in_range()
    227 
--> 228         fits = asarray(list(map(fit_function, xmins)))
    229         # logging.warning(fits.shape)
    230         setattr(self, xmin_distance+'s', fits[:,0])

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in fit_function(xmin)
    223                            data=self.data,
    224                            parameter_range=self.parameter_range,
--> 225                            parent_Fit=self)
    226             return getattr(pl, xmin_distance), pl.alpha, pl.sigma, pl.in_range()
    227 

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, estimate_discrete, **kwargs)
   1103     def __init__(self, estimate_discrete=True, **kwargs):
   1104         self.estimate_discrete = estimate_discrete
-> 1105         Distribution.__init__(self, **kwargs)
   1106 
   1107     def parameters(self, params):

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, xmin, xmax, discrete, fit_method, data, parameters, parameter_range, initial_parameters, discrete_approximation, parent_Fit, **kwargs)
    600 
    601         if (data is not None) and not (parameter_range and self.parent_Fit):
--> 602             self.fit(data)
    603 
    604 

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in fit(self, data)
   1139             if not self.in_range():
   1140                 Distribution.fit(self, data, suppress_output=True)
-> 1141             self.KS(data)
   1142         else:
   1143             Distribution.fit(self, data, suppress_output=True)

/usr/local/lib/python2.7/site-packages/powerlaw.pyc in KS(self, data)
    690         self.Asquare = sum((
    691                             (CDF_diff**2) /
--> 692                             (Theoretical_CDF * (1 - Theoretical_CDF))
    693                             )[1:]
    694                            )

FloatingPointError: invalid value encountered in divide
```

Standard error on best fit parameters of stretched exponential

Hi there,

Thanks for such an amazing package. It's made my life hell of a lot easier!

One question: I can't figure out how to get the standard errors on the best fit parameters for the non-powerlaw distributions (i.e. lognormal, stretched expontial, etc). Am I being stupid? Or is this not provided. If the latter, it would be great if it could be implemented!

I would happily provide help implementing if you need / want it.

[powerlaw.pdf] a little problem

First of all, powerlaw is really cool. about data's pdf ( also as histogram). especially I really appreciate the logarithmically way to decide the bins。 but I have encountered two little problem by using :

powerlaw.pdf(data, xmin=xmin, linear_bins=Ture)

  1. when data is not array type and xmin < 1, the code histogram(data/xmin, bins, density=True) would raise TypeError. Maybe we need one more line, something like data = np.asarray(data)

  2. when I pass more parameters like bins = 512. Those kwargs has not been used. so why not pass those to np.histgram()?

installation fails?

Thanks for the great package.

After the recent 0.8.1 update I encountered the following installation issue (same on Mac/Linux) with pip and easy_install

pip install -U powerlaw


Downloading/unpacking powerlaw
  Downloading powerlaw-.8.1.tar.gz
  Running setup.py egg_info for package powerlaw
    Traceback (most recent call last):
      File "<string>", line 14, in <module>
      File "/home/mekman/build/powerlaw/setup.py", line 2, in <module>
        with open('README.rst') as file:
    IOError: [Errno 2] No such file or directory: 'README.rst'
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):

  File "<string>", line 14, in <module>

  File "/home/mekman/build/powerlaw/setup.py", line 2, in <module>

    with open('README.rst') as file:

IOError: [Errno 2] No such file or directory: 'README.rst'

----------------------------------------
Command python setup.py egg_info failed with error code 1
Storing complete log in /home/mekman/.pip/pip.log

meaning of p-value

Hi Jeff

This is not really an issue with the code, but rather some of the comments in the code.
I just noticed that in the function descriptions (e.g., likelihood_ratio, compare_distributions etc), the meaning of the p-value is not clear:
The significance of the sign of R. If below a critical values
(typically .05) the sign of R is taken to be significant. If below the
critical value the sign of R is taken to be due to statistical
fluctuations.

Basically it says "below" for both cases...

returns datatype error

Here is my code:

newDS=removeTotal[['Firms', 'IndustrySize']][:8].astype(float)

Firms	IndustrySize
1	3598185.0	1.0
2	998953.0	2.0
3	608502.0	3.0
4	5205640.0	4.0
5	513179.0	5.0
6	87563.0	6.0
7	5806382.0	7.0
8	19076.0	8.0
import matplotlib.pyplot as plt
plt.plot(newDS['Firms'],newDS['IndustrySize'] )
plt.show()

plot is generated okay.

Now if I run

from powerlaw import plot_pdf, Fit, pdf
x, y = pdf(newDS)

it generates following error, traceback provided below:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-79cc0ba3a245> in <module>()
      1 from powerlaw import plot_pdf, Fit, pdf
----> 2 x, y = pdf(newDS)

/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/powerlaw.py in pdf(data, xmin, xmax, linear_bins, **kwargs)
   1949 
   1950 
-> 1951     if xmin<1:  #To compute the pdf also from the data below x=1, the data, xmax and xmin are rescaled dividing them by xmin.
   1952         xmax2=xmax/xmin
   1953         xmin2=1

TypeError: '<' not supported between instances of 'str' and 'int'
I also have asked it here

description of the module

Hi,

in the description of the powerlaw module for python is written that allowing mu to be negative means making the product between negative random variable, but this as far as I know is not true, because that means summing negative random variables, that are the log of the variables that you have multiplied. So that means you are multiplying variables typically between 0 and 1.

thank you very much

Question about p-value

I am confused about the meaning of the p-value between the software and the paper.

From the software document, p-value represents significant when p-value < 0.05 (normal usage in statistics)

However, the footnote in the page 17 of the paper "Power Law Distributions in Empirical Data" states that they use p-value as a measure of the hypothesis they are trying to verify.
Hence, high values, not low, are "good".

So, if I use distribution_compare(A, B), and R > 0, p-value > 0.1.
Is A better to fit the data?

Thank you.

Fixed parameter range option currently not working for lognormal distribution

Hi Jeff,

The other day I asked you for the example2 data from this link
http://nbviewer.ipython.org/gist/jeffalstott/3b69b400bbd8461c02c4
because I couldn’t get the forced positive mu
to work with my data set and I wanted first to see if I could duplicate the results from
your notebook. I can’t. I suspect it’s a bug.

I attach two files test1.py and test2.py that include the same instructions as the “actual data”
examples (without and with forced positive mu, respectively) in your note book. The outputs
are the same (both produce the same negative mu), unlike the output in the notebook.

Thanks for your thoughts,

Michael

*********************Output from test1.py

In [46]: run test1.py
Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
Power law's alpha: 3.531867
Exponential's lambda: 0.119016
R: 61.774285, p: 0.000891
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
Lognormal's sigma: 15.197246, mu: -579.325239
R: -0.955970, p: 0.151001


**********************Output
In [47]: run test2.py
Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))
Lognormal's sigma: 15.197246, mu: -579.325239
R: -0.955970, p: 0.151001

*******************************************************
test1.py 
import powerlaw
import numpy as np

data2 = np.genfromtxt('Example2.csv', delimiter=' ')

d = data2
d = d[~np.isnan(d)]

fit = powerlaw.Fit(d)
fit.plot_ccdf(linewidth=4)
fit.power_law.plot_ccdf()
fit.exponential.plot_ccdf()
#fit.lognormal.plot_ccdf()

print("Power law's alpha: %f"%fit.power_law.alpha)
print("Exponential's lambda: %f"%fit.exponential.Lambda)
print("R: %f, p: %f"%fit.distribution_compare('power_law', 'exponential'))

print("Lognormal's sigma: %f, mu: %f"%(fit.lognormal.sigma, fit.lognormal.mu))
print("R: %f, p: %f"%fit.distribution_compare('power_law', 'lognormal'))

*******************************************************
test2.py 

import numpy as np
import powerlaw

data2 = np.genfromtxt('Example2.csv', delimiter=' ')

d = data2
d = d[~np.isnan(d)]

fit_positive = powerlaw.Fit(d)

range_dict = {'mu': [0.0, None]}
fit_positive.lognormal.parameter_range(range_dict)

print("Lognormal's sigma: %f, mu: %f"%(fit_positive.lognormal.sigma, fit_positive.lognormal.mu))
print("R: %f, p: %f"%fit_positive.distribution_compare('power_law', 'lognormal'))

fit_positive.plot_pdf(linewidth=4)
fit_positive.power_law.plot_pdf()
fit_positive.lognormal.plot_pdf()

Off-by-one error when binning in powerlaw.pdf

Great job on this library! Just writing in for what should be a minor bug-fix:

Line 1964 of powerlaw.py appears to have an off-by-one error: xmax represents the last data point (if taken from data), but it will be excluded from the returned histogram when linear bins is used because Python's range function includes the start point and excludes the stopping point. E.g., instead of
elif linear_bins: bins = range(int(xmin2), int(xmax2))
it should be
bins = range(int(xmin2), int(xmax2)+1)

An equivalent problem exists just below this part in the code when logarithmic bins are used, though the issue is more subtle. Unlike range, numpy.logspace does include both endpoints. However, taking the log of the endpoints via (line 1966)
log_min_size = log10(xmin2)
log_max_size = log10(xmax2)
produces roundoff error, so the result of numpy.logspace has endpoints that are very close to xmin2 and xmax2, but not exact. In the case that brought this to my attention, the right endpoint is some minuscule roundoff error below the original value of xmax2, so when the floor is taken it once again leaves out the right-most data point from the result binning. To fix this, xmin2 and xmax2 should be enlarged just a bit before the log is taken:
log_min_size = log10(xmin2+0.00001)
log_max_size = log10(xmax2+0.00001)
or similar. Since the roundoff error can be expected to be smaller than the manually added error, this assures that when the floor is taken, the right and left endpoints of the log bins return to xmax and xmin respectively.

Maybe this will solve other issues, because I ran across it when I noticed that pdf plots don't match up when they should. I have a dataset where there is an outlier to the far right. Any plot that uses this routine to bin leaves it out and therefore the pdf is normalized over a smaller interval. However, if I plot a distribution directly, the data is not binned but just used as x values for the theoretical PDF - so the rightmost point is not left out, and the normalization interval is different. This results in a PDF plot that is shifted away from the data if both data and theoretical PDF are plotted together (for example: if one wants to plot the full data and a pdf extended to the left of the fitted xmin side-by-side in the same plot, they don't match).

Setting option "verbose=False" in Fit() leads to error

Hi, I've been trying to use "verbose = False" to shut down the messages but kept getting "divide by zero" errors. I checked the .py file and found that, starting line 96

if 0 in self.data and verbose:
    print("Values less than or equal to 0 in data. Throwing out 0 or negative values", file=sys.stderr)
    self.data = self.data[self.data>0]

This is obviously wrong, a correct version would be:

if 0 in self.data:
    if verbose: 
        print("Values less than or equal to 0 in data. Throwing out 0 or negative values", file=sys.stderr)
    self.data = self.data[self.data>0]

powerlaw: RuntimeWarning:invalid value encountered in divide(theoretical_CDF*(1-Theoretical_CDF))

I have this data
user1 user2 weight_link
"110", "1704", "1.008",
"110", "2139", "1.013",
"110", "4648", "1.02",
"110", "9490", "1.007",
"110", "12643", "1.013",
"110", "18224", "1.024",
"110", "21212", "1.011",
"110", "25759", "1.026",
"110", "27618", "1.022",
"110", "31667", "1.014",

I used weight as my data [1.008,1.013,1.008, 1.013, 1.02, 1.007,1.013,1.024, 1.011,1.026,1.022,1.014]

Then I run these line

data = getlink("test3.txt")#read the weight as float
results = powerlaw.Fit(data)
print results.power_law.alpha
print results.power_law.xmin
R, p = results.distribution_compare('power_law', 'lognormal')

I get this results

Calculating best minimal value for power law fit
116.192331641
1.007
/usr/local/lib/python2.7/dist-packages/powelaw.py:692:RuntimeWarning:invalid value encountered in divide(theoretical_CDF*(1-Theoretical_CDF))

Script terminated

Option to pass parameter_range to multiple different distributions within a Fit object.

Currently, users can use the parameter_range option to dictate that a Distribution object fits a theoretical distribution to empirical data under some constraints. This is can be done when making a Distribution object manually, including when they are part of a Fit object, like this:

fit = powerlaw.Fit(data)
fit.distribution_compare('power_law', 'lognormal') #Uses a regular lognormal fit
range_dict = {'mu': [0.0, None]}
fit.lognormal.parameter_range(range_dict) #Implements the restriction on lognormal that mu must be positive
fit.distribution_compare('power_law', 'lognormal') #Uses a lognormal fit which has a positive mu.

This is good for doing parameter manipulations on the fly. However, it would be good assert some parameter ranges when the Fit is first created. We can use parameter_range when making a Fit object, like so:

parameter_range = {'alpha': [2.3, None], 'sigma': [None, .2]}
fit = powerlaw.Fit(data, parameter_range=parameter_range)

The parameter_range is only used for the power law fit; it is not used for any of the other distributions. This makes sense, as alpha and sigma may not be parameters in the other distributions. However, it would be good to have the ability to pass a parameter_range with options for many different distributions. It's unclear what the One Best Way to do that would be.

pdf-distribution for values <1

I noticed, that the values below 1 are not included in the pdf-distribution of the data (powerlow.pdf(), fit.plot_pdf()). For some reason unknown to me, the histogram logarithmic spaced bins boundaries are transformed to integers.
line 1952 in powerlaws.py: bins=unique(floor(logspace( log_min_size, log_max_size, num=number_of_bins)))
In this way a potential infinite number of bins are eliminated.

I modified the function "pdf" to solve the problem. The data are rescaled by multipling them by xmin before the histogram is computed. At the end the bins boundareies "edges" are transformed back to the original scale and returned.

Below the code:

def pdf(data, xmin=None, xmax=None, linear_bins=False, **kwargs):
"""
Returns the probability density function (normalized histogram) of the
data.  
Parameters
----------
data : list or array
xmin : float, optional
    Minimum value of the PDF. If None, uses the smallest value in the data.
xmax : float, optional
    Maximum value of the PDF. If None, uses the largest value in the data.
linear_bins : float, optional
    Whether to use linearly spaced bins, as opposed to logarithmically
    spaced bins (recommended for log-log plots).

Returns
-------
bin_edges : array
    The edges of the bins of the probability density function.
probabilities : array
    The portion of the data that is within the bin. Length 1 less than
    bin_edges, as it corresponds to the spaces between them.
"""
from numpy import logspace, histogram, floor, unique
from math import ceil, log10
if not xmax:
    xmax = max(data)
if not xmin:
    xmin = min(data)

# normalize data to xmin, allow to have pdf also from the data below x=1
data2=data/xmin     
xmax=xmax/xmin
xmin_old=xmin
xmin=1

if linear_bins:
    bins = range(int(xmin), int(xmax2))
else:
    log_min_size = log10(xmin)
    log_max_size = log10(xmax)
    number_of_bins = ceil((log_max_size-log_min_size)*10)
    bins=unique(
            floor(
                logspace(
                    log_min_size, log_max_size, num=number_of_bins)))
hist, edges = histogram(data2, bins, density=True)

# transform data back to original    
xmax=xmax*xmin_old
xmin=xmin_old
edges=edges*xmin
return edges, hist`

Sigma in Truncated Power Law

Hi Jeff,
Is there any way to get the standard error 'sigma' for truncated power-law fitting? I only see one for the power-law function.

IndexError from power_law_ks_distance()

It's me again. I also kept getting IndexError from find_xmin. The problem seems to be from starting line 2556:

    n = float(len(data))
    if n < 2:
        if kuiper:
            return 1, 1, 2
        return 1

    if not all(data[i] <= data[i + 1] for i in arange(n - 1)):
        data = sort(data)

    if not discrete:
        Actual_CDF = arange(n) / n
        Theoretical_CDF = 1 - (data / xmin) ** (-alpha + 1)

The IndexError arises when i from arange(n-1) is used as an index: since n is a float, i would also be a float.

Now that we are only making n a float to avoid possible wrong results from "arange(n)/n", it seems a better choice to replace with:

    n = len(data)
# (code code code code)
    if not discrete:
        Actual_CDF = arange(n) / float(n)
        Theoretical_CDF = 1 - (data / xmin) ** (-alpha + 1)

Negative mu in lognormal fit -- source?

In the description you state:

However, for many data sets, the superior lognormal fit is only possible if one allows the fitted parameter mu to go negative.

Do you have any published source I can quote for this statement? I have indeed found a highly negative mu for a lognormal fit (and no good fit whatsoever for lognormal-positive) so I'm inclined to favour a power-law distribution, but I need to quote something proper for my thesis.

Not working with current numpy version

This line throws an error with the updated numpy package, namely since numpy 1.13.0, see changes here.

masked_Ds = masked_array(getattr(self, xmin_distance+'s'), mask=-good_values)

Negating/inverting a boolean array should now be done with ~, so it should be ~good_values instead of -good_values.

Exceptions when sampling discrete distribution

I am having a problem obtaining random samples from a discrete power law distribution. Sometimes, I get an exception instead of the desired value.

Code to reproduce the issue:

import powerlaw
xmax=100
exponent=1.1
dist=powerlaw.Power_Law(xmin=1,xmax=xmax,discrete=True,parameters=[exponent],discrete_approximation="xmax")
print(dist.generate_random(n=1,estimate_discrete=False))

The exception:

Traceback (most recent call last):
  File "sample_ips.py", line 7, in <module>
    print(dist.generate_random(n=1,estimate_discrete=False))
  File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1094, in generate_random
    x = array([self._double_search_discrete(R) for R in r],
  File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1094, in <listcomp>
    x = array([self._double_search_discrete(R) for R in r],
  File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1101, in _double_search_discrete
    while self.ccdf(data=[x2]) >= (1 - r):
  File "/usr/lib/python3.7/site-packages/powerlaw.py", line 727, in ccdf
    return self.cdf(data=data, survival=survival)
  File "/usr/lib/python3.7/site-packages/powerlaw.py", line 782, in cdf
    if isnan(min(CDF)):
  File "/usr/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2618, in amin
    initial=initial)
  File "/usr/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity

Is it safe to ignore the exception and regenerate the value, or the generator is malfunctioning and the data cannot be trusted?
Thank you for your help.

Fitting issues with standard configuration

Hello,

first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.

I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})

When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))", which can be ignored as I discovered in older threads.

The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit.
unknown-1

On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1.
unknown

I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?

Best regards

Shut up the RuntimeWarnings during fitting

During fitting we frequently explore fits that are outside our numerical precision. This can lead to weird events like dividing by zero when we really mean to divide by a very small number. These explored fits are never taken to be good fits, and so they aren't used for the final results returned to the user. However, during the fitting we still get RuntimeWarnings when these errors occur. Like this:

import powerlaw
data = genfromtxt('blackouts.txt')
fit = powerlaw.Fit(data)
Calculating best minimal value for power law fit
/Users/jeff/powerlaw/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

Is there a good way to stop these warnings from appearing without just making a blanket setting that we don't want to see any warnings?

stretched_exponential_likelihoods continuous definition

Maybe I am just missing something, but shouldn't the likelihood definition for stretched_exponential_likelihoods on line 2779:
likelihoods = data ** (beta - 1) * beta * Lambda * exp(Lambda * (xmin ** beta - data ** beta))
be
likelihoods = (data*Lambda) ** (beta - 1) * beta * Lambda * exp(Lambda ** beta * (xmin ** beta - data ** beta))?
This second expression is equivalent to the way the likelihood and pdf is defined in the Stretched_Exponential class, and it's what I would expect it to be from the typical definition of a Weibull distribution... Am I just missing something?

How can I fit a truncated powerlaw?

It is not clear to me if this library supports truncated powerlaw distributions.
It it is supported, how can I use it to fit a truncated powerlaw distribution?

Allow for weighted data

As I understand it, this package attempts to fit power law histograms to a set of provided values, though I have to admit that the form the provided data is supposed to have is not clear to me from the documentation.

I have data where there is a known bias depending on the measured values. To correct this bias, I would like to provide weights for each value. As far as I can see, that's not currently possible, so I'd like to request adding it as a feature.

I can get something close to what I want by producing a histogram with pyplot, then using it to generate artificial data (e.g.: for each bar, generate it's mean value a number of times equal to it's height) and using that data in powerlaw.Fit. This will create small inaccuracies due to rounding the height and shifting data to the mean value however.

Pickling Fit objects.

Hi Jeff,

I'm trying to pickle some large Fit objects but when I try to load them I get a
: maximum recursion depth exceeded while calling a Python object.
I've tried to increase the maximum recursion depth but I still get the same problem. If I increase it beyond a certain value, the python kernel just dies when I try to load it.
I've tried to use the dill library but no success. It seems to be related with recursive attributes but I'm not sure.

I'm trying to pickle large Fit objects (comming from >160M data points) because they took hours to run the truncated_power_law. Any ideas on how to pickle Fit objects or make the truncated_power_law fit faster?
Thank you!

Confusing in use of `searchsorted`

https://github.com/jeffalstott/powerlaw/blob/master/powerlaw.py#L1890
I think that if you keep original sorted data location, the use of searchsorted is useful.
But below code use unique_indices to take subset of CDF too.
I feel confusing since it's equivalence to call arange(n)/n directly,
and obviously the arange way is faster than searchsorted way, it's proved by below test:

def f1(data, n):
    CDF = searchsorted(data, data,side='left')/n
    unique_data, unique_indices = unique(data, return_index=True)
    data=unique_data
    CDF = CDF[unique_indices]
    return CDF


def f2(data, n):
    unique_data, unique_indices = unique(data, return_index=True)
    return (np.arange(n)/n)[unique_indices]


data = [0,1,1,2,2,2,3,3,3,3]

n = len(data)

%timeit f1(data,n)
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 85.3 µs per loop

%timeit f2(data,n)
10000 loops, best of 3: 54 µs per loop

f1(data,n)
Out[47]: array([ 0. ,  0.1,  0.3,  0.6])

f2(data,n)
Out[48]: array([ 0. ,  0.1,  0.3,  0.6])

So I can't understand how the 'clever' claimed in comment hold. Perhaps I miss some corner cases?...

original data option for the power law plot

Hi,

when I activate the original data option the plot of my sample changes but not the power law fit. I guess that is an issue, or?

Here a plots of my sample without the original data option:
testsingle_1
With the original data option:
testsingle_originaldata_1

Here is my code:

       ax = fits[1].plot_ccdf()
	fits[1].power_law.plot_ccdf(ax=ax, color='r', linestyle='--', linewidth=1, label='Power law fit')#add powerlaw line
	plt.savefig("testsingle_{}.png".format(1))
	plt.close()

ax=fits[1].plot_ccdf(original_data = True)
	fits[1].power_law.plot_ccdf(ax=ax, color='r', linestyle='--', linewidth=1, label='Power law fit')#add powerlaw line
	plt.savefig("testsingle_originaldata_{}.png".format(1))
	plt.close()

lognormal normalization constant C

Hi,

I'm afraid the normalization constant C formula for lognormal (which is defined in Lognormal class _pdf_continuous_normalizer function) is not as it is in Cluset et al. paper Table 2.1.

The current used formula in the codes is as follows:
C = (erfc((log(self.xmin) - self.mu) / (sqrt(2) * self.sigma)) / sqrt(2/(pi*self.sigma**2)))

based on Table 2, this should be as follows (as there is a power to -1):
C = (sqrt(2/(pi*self.sigma**2)) /erfc((log(self.xmin) - self.mu) / (sqrt(2) * self.sigma)))

Any idea?

Many thanks!

Calculating lognormal fit

After using powerlaw_compare, I found my upper tail data better fit to lognormal. Then I want to give a best estimation of the lognormal parameter. Then I found there are two different ways of calculating lognormal fit.
One is using lognormal.mu after using powerlaw.Fit:

fit=powerlaw.Fit(X)
print "power law:alpha=%s, D=%s, sigma=%s, xmin=%s" %(fit.alpha,fit.D,fit.sigma,fit.xmin)
print "lognormal: mu=%s, sigma=%s" %(fit.lognormal.mu,fit.lognormal.sigma)

The returning results is as follows:

power law:alpha=1.58419441281, D=0.0361993084144, sigma=0.0175821856886, xmin=150000.0
lognormal: mu=5.12826972544, sigma=4.10761057227

The other is using powerlaw.distribution_fit directly. I set the xmin according to calculation of best xmin in powerlaw.Fit (which is 150000)

cut=X[X>=150000]
fit2=powerlaw.distribution_fit(cut,distribution='lognormal',xmin=150000)
fit3=powerlaw.distribution_fit(X,distribution='lognormal',xmin=150000)
print "lognormal estimation"
print fit2
print "to show that xmin works"
print fit3

The returning result is:

lognormal estimation
(array([ 11.91839062, 2.29111789]), -16764.238127379787)
to show that xmin works
(array([ 11.91839062, 2.29111789]), -16764.238127379787)

I have no idea which one is right.
I doubt if it is due to my misinterpretation of the estimated value returned. I tried to look through the code but cannot find a clue. It may be a silly question but I post it here in case someone else is curious about this question:D

Don't throw out the zeros?

Hi Jeff et.al,

Great package. I'm using it extensively in a power law class I'm taking.

I'm looking at the freeman centralities distribution of a sparse network where there are a lot of isolates, hence their centralities (for the isolates) is 0. When I try to fit the centralities to a power law distribution using your package, the Fit call throws out all the zeros, but those are legitimate values.

The result is a warning:

Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit

...and of course, since the x-min value is calculated based on only non-zero values I get a much higher alpha which is not representative of the distribution.

I'm not sure if this is a feature request to include the zeros, or if I have a fundamental misunderstanding of the package and of Clauset's paper.

p.s. Also not sure if this is the appropriate place to bring this up, so sorry in advance.

Joe

p-value for an individual fit implementation

Hi Jeff!
In the Powerlaw: A python package for analysis of heavy-tailed distributions paper you've written:

The goodness of fit for each distribution can be considered individually or by comparison to the fit of other distributions (respectively, using bootstrapping and the Kolmogorov- Smirnov test to generate a p-value for an individual fit vs. using loglikelihood ratios to identify which of two fits is better) [5]. There are several reasons, both practical and philosophical, to focus on the latter, comparative tests.

I understand that currently the computation of p-value for an individual fit is not implemented. Would you have anything against it being added as a feature? Would you anticipate it to be a significant effort?

Best,
Adam

CDF value for the smallest data point equals to zero

When CDF is computed by your library, the value for the smallest datapoint is always zero, i.e when the data is [1, 1, 2, 2, 3, 3], the cdf of 1 is 0. From the definition of CDF, it should be 0.33. Also, all the datasets I've tried lead to

RuntimeWarning: invalid value encountered in true_divide
  (Theoretical_CDF * (1 - Theoretical_CDF))

Do I miss something? I tested separately the numpy.searchsorted code to generate CDF and it does not seem to work correctly

Fitting user-specified PDF, e.g. power spectral density

I really like your software, it makes it easier to judge the hype of powerlaws in datasets.
However, right now it focuses on fitting full datasets, creating their PDF and CDF on the fly. I'd like to use in situations where I already have a PDF (defined at several points) - or generally a distribution function of some sort - and fit its shape in some range.
An example is the power spectral density of fluctuations in turbulent plasmas, where there is an ongoing discussion whether they are powerlaws or exponentials.

I'd be wiling to contribute modifications to powerlaw which would make this optional sue-case possible. But I would greatly appreciate if you could point out how best to approach this issue.

Regenerate API docs?

The html docs for loglikelihood_ratio (and friends) still read

p : float
The significance of the sign of R. If below a critical values (typically .05) the sign of R is taken to be significant. If below the critical value the sign of R is taken to be due to statistical fluctuations.

which has been fixed in the docstrings a while ago apparently.

Change distribution implementations to use Scipy's distribution objects.

Currently all distributions are implemented within powerlaw. It would be great if we could use scipy's distribution object, which are presumably optimized in some ways. The problem is that those implementations would require significant tweaking to handle x_min, x_max, discrete vs. continuous, etc. Any ideas in this space would be good.

Is it possible to fit a power law function to the data that has x and ?

Hello, I am wondering whether this package can be used to fit to 2D data (x and y) or not. Moreover, I think my data seems to follow the below formula:
y = a+b*x^(-c) (a,b,c>0)
Can I use this package to fit the formula? By the way, I was using a log–log graph to determine the coefficients a, b and c. Is it correct mathematically?
Thanks for your help in advance!

Truncated power law - divide by zero

Hi Jeff,

Just stumbled upon a runtime error by fitting the truncated power law function. Here is the corresponding message:

Assuming nested distributions
/usr/local/lib/python2.7/dist-packages/powerlaw.py:1351: RuntimeWarning: divide by zero encountered in double_scalars
alpha = 1 + len(data)/sum( log( data / (self.xmin) ))
Traceback (most recent call last):
[...]
R, p = fit.distribution_compare('power_law', 'truncated_power_law', normalized_ratio=True)
File "/usr/local/lib/python2.7/dist-packages/powerlaw.py", line 315, in distribution_compare
[...]

Seems to happen when xmin is very high and the corresponding data just contains of few data.

Thanks,
Philipp

setting xmin in Fit doesn't work

I think this should be setting xmin=1 for the fitting procedure. It looks like it's not using the xmin I'm giving it. I've also compared it with my own code for calculation of alpha to see if it's using the given value of xmin or not.

pl=powerlaw.Power_Law()
pl.alpha=2

X=pl.generate_random(1000)

fit=powerlaw.Fit(X, xmn=1.)
print(fit.xmin)

Values below xmin are sliently trimmed when computing cdf or pdf

import numpy as np
import powerlaw
data = np.array([1.7, 3.2, 5.4, 7.9, 10., 12.])
results = powerlaw.Fit(data)
print(results.power_law.alpha)
print(results.power_law.xmin)
print(results.power_law.cdf(np.arange(10)))
print(results.power_law.cdf(np.arange(3, 10)))

xmin is ~2.28, and the last two lines give identical results: the entries 0, 1 and 2 of np.arange(10) have been silently trimmed out, when I'd expect them to return 1 (and 0 if computing the pdf). Otherwise, it becomes a bit awkward to e.g. do plt.plot(xs, results.power_law.cdf(xs)) if I am not directly computing xs using xmin.

Problem with powerlaw lognormal estimates

Jeff Alstott encouraged me to post an issue concerning inconsistent results that I have noticed when trying to model lognormal data using the powerlaw python toolbox. These inconsistencies seem to have arisen after recent improvements made by lneisenman. I am hoping he/she may be able to look in and shed some light as to what the issue might be. A simple example is the code used to generate Figure 4 of Jeff's PLOS One article. If one runs this code under powerlaw 1.3.4, one recovers very different values for parameters (R,p) than those reported in the article, namely (0.0087852467208828777, 0.94922437131915649), instead of (0.928, 0.426). In addition, the lognormal CCDF is no longer plotted in the figure limits, basically because it has been set equal to an array of ones. The values for the lognormal distribution parameters mu and sigma do not appear to be correct either.

I have also seen this behaviour (e.g. lognormal CCDF set equal to an array of ones) when modelling some of my own data sets but have yet to discover what factors lead to its expression.

Thanks in advance for any insights anyone might have - MBostock

Implement Gamma distribution

The Gamma distribution had issues with being supported, described here. The gamma distribution requires calculating the incomplete gamma function, and calculating this can be slow. Prohibitively slow. The problem is not in calculating the incomplete gamma function once, but during the fitting process, which uses numerical optimization. Here we calculate candidate PDFs and CDFs many, many times, and the slowness of the incomplete gamma function starts to bog us down. This was particularly disastrous for discrete distributions, which by default use lots of CDF calculations.

This was also the case for the truncated power law, which also relies on incomplete gamma functions. David Bild figured out a simplication, however, that would get around this, and massively sped up the truncated power law fitting. Perhaps a similar solution could be found for gamma distribution.

TypeError with fits using mpmath

I'm getting an error in mpmath when I call fit.distribution_compare for gamma and truncated power law distributions.

The error is "TypeError: cannot create mpf from 0.20379211", on line 560 of mpmath.ctx_mp

The error results from this call in powerlaw to mpmath's gammainc:

/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/powerlaw.pyc in cdf_base_function(self=<powerlaw.Truncated_Power_Law object>, x=array([ 6.19425631, 6.22601461, 6.35583973,...2085, 42.87356567, 49.3860321 ], dtype=float32))
1295 gammainc = vectorize(gammainc)
1296
-> 1297 CDF = ( (gammainc(1-self.alpha,self.Lambda_x)).astype('float') /
1298 self.Lambda
*(1-self.alpha)
1299 )

x ranges from 6 to 50, while self.alpha = 2.45 and self.Lambda = 0.033

This is on Python 2.7 and numpy 1.6.1, Mac OS X 10.6.8.

I can provide the full traceback if needed.

Do you know if this is an issue in powerlaw or something upstream in mpmath?

Thanks,
Eric

Binomial distribution for distribution_compare

My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks.

I would expect something like the binomial distribution to be implemented for comparison using distribution_compare().
Is there a specific reason it wasn't implemented?

For example I tried the following code to distinguish between an obviously scale-free network and an obviously non-scale-free network (both with similar numbers of nodes/edges):

non_sf_graph = nx.gnp_random_graph(10000, 0.002)
sf_graph = nx.barabasi_albert_graph(10000, 10)
fitpl = powerlaw.Fit(list(sf_graph.degree().values()))
fitnpl = powerlaw.Fit(list(non_sf_graph.degree().values()))

for dist in fitpl.supported_distributions.keys():
    print(dist)
    fitpl.distribution_compare('power_law', dist)
    fitnpl.distribution_compare('power_law', dist)

The output suggested that none of the implemented distributions provided a tool to discern between an preferential attachment model and a gnp random graph:

lognormal
(-0.23698971255249646, 0.089194415705275421)
(-20.320811335334504, 3.9097599268295484e-92)
exponential
(511.41420648854108, 7.3934851812182895e-23)
(24.215231521373582, 3.7251410948652104e-08)
truncated_power_law
(3.3213949937049847e-06, 0.99794356568650555)
(3.1510369047360598e-07, 0.99936659460444144)
stretched_exponential
(16.756797270053454, 1.6505119872120265e-05)
(8.7110005915424153, 8.7224098659112012e-05)
lognormal_positive
(30.428201968820289, 1.7275238929002278e-07)
(6.7992592335974233, 5.4945477823229749e-06)

I am asking as i am no statistics expert and I might not see the significance of all the available distributions. But they seem to fail this basic example. I would be happy to help implement a distribution, that successfully fits a random gnp network. Or are there some limitations which make this hard/impossible?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.