powerlaw's Issues
Document the contract for the various _cdf and _pdf methods
Could you document the intended behavior for _cdf_base_function, _pdf_base_function, _pdf_continuous_normalizer, and _pdf_discrete_normalizer?
It's not clear which of those should respect xmin and xmax and which should not.
IndexError from power_law_ks_distance()
It's me again. I also kept getting IndexError from find_xmin. The problem seems to be from starting line 2556:
n = float(len(data))
if n < 2:
if kuiper:
return 1, 1, 2
return 1
if not all(data[i] <= data[i + 1] for i in arange(n - 1)):
data = sort(data)
if not discrete:
Actual_CDF = arange(n) / n
Theoretical_CDF = 1 - (data / xmin) ** (-alpha + 1)
The IndexError arises when i from arange(n-1) is used as an index: since n is a float, i would also be a float.
Now that we are only making n a float to avoid possible wrong results from "arange(n)/n", it seems a better choice to replace with:
n = len(data)
# (code code code code)
if not discrete:
Actual_CDF = arange(n) / float(n)
Theoretical_CDF = 1 - (data / xmin) ** (-alpha + 1)
stretched_exponential_likelihoods continuous definition
Maybe I am just missing something, but shouldn't the likelihood definition for stretched_exponential_likelihoods on line 2779:
likelihoods = data ** (beta - 1) * beta * Lambda * exp(Lambda * (xmin ** beta - data ** beta))
be
likelihoods = (data*Lambda) ** (beta - 1) * beta * Lambda * exp(Lambda ** beta * (xmin ** beta - data ** beta))
?
This second expression is equivalent to the way the likelihood and pdf is defined in the Stretched_Exponential class, and it's what I would expect it to be from the typical definition of a Weibull distribution... Am I just missing something?
description of the module
Hi,
in the description of the powerlaw module for python is written that allowing mu to be negative means making the product between negative random variable, but this as far as I know is not true, because that means summing negative random variables, that are the log of the variables that you have multiplied. So that means you are multiplying variables typically between 0 and 1.
thank you very much
lognormal normalization constant C
Hi,
I'm afraid the normalization constant C formula for lognormal (which is defined in Lognormal class _pdf_continuous_normalizer function) is not as it is in Cluset et al. paper Table 2.1.
The current used formula in the codes is as follows:
C = (erfc((log(self.xmin) - self.mu) / (sqrt(2) * self.sigma)) / sqrt(2/(pi*self.sigma**2)))
based on Table 2, this should be as follows (as there is a power to -1):
C = (sqrt(2/(pi*self.sigma**2)) /erfc((log(self.xmin) - self.mu) / (sqrt(2) * self.sigma)))
Any idea?
Many thanks!
CDF value for the smallest data point equals to zero
When CDF is computed by your library, the value for the smallest datapoint is always zero, i.e when the data is [1, 1, 2, 2, 3, 3], the cdf of 1 is 0. From the definition of CDF, it should be 0.33. Also, all the datasets I've tried lead to
RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Do I miss something? I tested separately the numpy.searchsorted code to generate CDF and it does not seem to work correctly
Shut up the RuntimeWarnings during fitting
During fitting we frequently explore fits that are outside our numerical precision. This can lead to weird events like dividing by zero when we really mean to divide by a very small number. These explored fits are never taken to be good fits, and so they aren't used for the final results returned to the user. However, during the fitting we still get RuntimeWarnings when these errors occur. Like this:
import powerlaw
data = genfromtxt('blackouts.txt')
fit = powerlaw.Fit(data)
Calculating best minimal value for power law fit
/Users/jeff/powerlaw/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Is there a good way to stop these warnings from appearing without just making a blanket setting that we don't want to see any warnings?
Not working with current numpy version
Truncated power law - divide by zero
Hi Jeff,
Just stumbled upon a runtime error by fitting the truncated power law function. Here is the corresponding message:
Assuming nested distributions
/usr/local/lib/python2.7/dist-packages/powerlaw.py:1351: RuntimeWarning: divide by zero encountered in double_scalars
alpha = 1 + len(data)/sum( log( data / (self.xmin) ))
Traceback (most recent call last):
[...]
R, p = fit.distribution_compare('power_law', 'truncated_power_law', normalized_ratio=True)
File "/usr/local/lib/python2.7/dist-packages/powerlaw.py", line 315, in distribution_compare
[...]
Seems to happen when xmin is very high and the corresponding data just contains of few data.
Thanks,
Philipp
FloatingPointError when Fit() is invoked
powerlaw.Fit(discrete=True, data=[1]*1000 + [10]*100 + [100]*10 + [1000])
Output:
Calculating best minimal value for power law fit
---------------------------------------------------------------------------
FloatingPointError Traceback (most recent call last)
<ipython-input-320-4afe2d64269c> in <module>()
1 # TODO, floating point error/bug in powerlaw lib
----> 2 powerlaw.Fit(discrete=True, data=[1,10,100,1000,100000])
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, data, discrete, xmin, xmax, fit_method, estimate_discrete, discrete_approximation, sigma_threshold, parameter_range, fit_optimizer, xmin_distance, **kwargs)
127 self.fixed_xmin=False
128 print("Calculating best minimal value for power law fit", file=sys.stderr)
--> 129 self.find_xmin()
130
131 self.data = self.data[self.data>=self.xmin]
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in find_xmin(self, xmin_distance)
226 return getattr(pl, xmin_distance), pl.alpha, pl.sigma, pl.in_range()
227
--> 228 fits = asarray(list(map(fit_function, xmins)))
229 # logging.warning(fits.shape)
230 setattr(self, xmin_distance+'s', fits[:,0])
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in fit_function(xmin)
223 data=self.data,
224 parameter_range=self.parameter_range,
--> 225 parent_Fit=self)
226 return getattr(pl, xmin_distance), pl.alpha, pl.sigma, pl.in_range()
227
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, estimate_discrete, **kwargs)
1103 def __init__(self, estimate_discrete=True, **kwargs):
1104 self.estimate_discrete = estimate_discrete
-> 1105 Distribution.__init__(self, **kwargs)
1106
1107 def parameters(self, params):
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in __init__(self, xmin, xmax, discrete, fit_method, data, parameters, parameter_range, initial_parameters, discrete_approximation, parent_Fit, **kwargs)
600
601 if (data is not None) and not (parameter_range and self.parent_Fit):
--> 602 self.fit(data)
603
604
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in fit(self, data)
1139 if not self.in_range():
1140 Distribution.fit(self, data, suppress_output=True)
-> 1141 self.KS(data)
1142 else:
1143 Distribution.fit(self, data, suppress_output=True)
/usr/local/lib/python2.7/site-packages/powerlaw.pyc in KS(self, data)
690 self.Asquare = sum((
691 (CDF_diff**2) /
--> 692 (Theoretical_CDF * (1 - Theoretical_CDF))
693 )[1:]
694 )
FloatingPointError: invalid value encountered in divide
```
Off-by-one error when binning in powerlaw.pdf
Great job on this library! Just writing in for what should be a minor bug-fix:
Line 1964 of powerlaw.py appears to have an off-by-one error: xmax represents the last data point (if taken from data), but it will be excluded from the returned histogram when linear bins is used because Python's range function includes the start point and excludes the stopping point. E.g., instead of
elif linear_bins: bins = range(int(xmin2), int(xmax2))
it should be
bins = range(int(xmin2), int(xmax2)+1)
An equivalent problem exists just below this part in the code when logarithmic bins are used, though the issue is more subtle. Unlike range, numpy.logspace does include both endpoints. However, taking the log of the endpoints via (line 1966)
log_min_size = log10(xmin2)
log_max_size = log10(xmax2)
produces roundoff error, so the result of numpy.logspace has endpoints that are very close to xmin2 and xmax2, but not exact. In the case that brought this to my attention, the right endpoint is some minuscule roundoff error below the original value of xmax2, so when the floor is taken it once again leaves out the right-most data point from the result binning. To fix this, xmin2 and xmax2 should be enlarged just a bit before the log is taken:
log_min_size = log10(xmin2+0.00001)
log_max_size = log10(xmax2+0.00001)
or similar. Since the roundoff error can be expected to be smaller than the manually added error, this assures that when the floor is taken, the right and left endpoints of the log bins return to xmax and xmin respectively.
Maybe this will solve other issues, because I ran across it when I noticed that pdf plots don't match up when they should. I have a dataset where there is an outlier to the far right. Any plot that uses this routine to bin leaves it out and therefore the pdf is normalized over a smaller interval. However, if I plot a distribution directly, the data is not binned but just used as x values for the theoretical PDF - so the rightmost point is not left out, and the normalization interval is different. This results in a PDF plot that is shifted away from the data if both data and theoretical PDF are plotted together (for example: if one wants to plot the full data and a pdf extended to the left of the fitted xmin side-by-side in the same plot, they don't match).
Change distribution implementations to use Scipy's distribution objects.
Currently all distributions are implemented within powerlaw
. It would be great if we could use scipy's distribution object, which are presumably optimized in some ways. The problem is that those implementations would require significant tweaking to handle x_min, x_max, discrete vs. continuous, etc. Any ideas in this space would be good.
setting xmin in Fit doesn't work
I think this should be setting xmin=1 for the fitting procedure. It looks like it's not using the xmin I'm giving it. I've also compared it with my own code for calculation of alpha to see if it's using the given value of xmin or not.
pl=powerlaw.Power_Law()
pl.alpha=2
X=pl.generate_random(1000)
fit=powerlaw.Fit(X, xmn=1.)
print(fit.xmin)
Exceptions when sampling discrete distribution
I am having a problem obtaining random samples from a discrete power law distribution. Sometimes, I get an exception instead of the desired value.
Code to reproduce the issue:
import powerlaw
xmax=100
exponent=1.1
dist=powerlaw.Power_Law(xmin=1,xmax=xmax,discrete=True,parameters=[exponent],discrete_approximation="xmax")
print(dist.generate_random(n=1,estimate_discrete=False))
The exception:
Traceback (most recent call last):
File "sample_ips.py", line 7, in <module>
print(dist.generate_random(n=1,estimate_discrete=False))
File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1094, in generate_random
x = array([self._double_search_discrete(R) for R in r],
File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1094, in <listcomp>
x = array([self._double_search_discrete(R) for R in r],
File "/usr/lib/python3.7/site-packages/powerlaw.py", line 1101, in _double_search_discrete
while self.ccdf(data=[x2]) >= (1 - r):
File "/usr/lib/python3.7/site-packages/powerlaw.py", line 727, in ccdf
return self.cdf(data=data, survival=survival)
File "/usr/lib/python3.7/site-packages/powerlaw.py", line 782, in cdf
if isnan(min(CDF)):
File "/usr/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2618, in amin
initial=initial)
File "/usr/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 86, in _wrapreduction
return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation minimum which has no identity
Is it safe to ignore the exception and regenerate the value, or the generator is malfunctioning and the data cannot be trusted?
Thank you for your help.
[powerlaw.pdf] a little problem
First of all, powerlaw is really cool. about data's pdf ( also as histogram). especially I really appreciate the logarithmically way to decide the bins。 but I have encountered two little problem by using :
powerlaw.pdf(data, xmin=xmin, linear_bins=Ture)
-
when data is not array type and xmin < 1, the code histogram(data/xmin, bins, density=True) would raise TypeError. Maybe we need one more line, something like data = np.asarray(data)
-
when I pass more parameters like bins = 512. Those kwargs has not been used. so why not pass those to np.histgram()?
Axis is not correclty passed from Distribution.plot_ccdf() to Distribution.plot_cdf()
At line 969 of powerlaw.py:
ax=None
-> ax=ax
Moreover a few lines below (L999-1000), the new axis, in case it was not passed as parameter, is not correctly created. I suggest substituting to L999-1002
fig, ax = plt.subplots()
ax.plot(bins, CDF, **kwargs)
returns datatype error
Here is my code:
newDS=removeTotal[['Firms', 'IndustrySize']][:8].astype(float)
Firms IndustrySize
1 3598185.0 1.0
2 998953.0 2.0
3 608502.0 3.0
4 5205640.0 4.0
5 513179.0 5.0
6 87563.0 6.0
7 5806382.0 7.0
8 19076.0 8.0
import matplotlib.pyplot as plt
plt.plot(newDS['Firms'],newDS['IndustrySize'] )
plt.show()
plot is generated okay.
Now if I run
from powerlaw import plot_pdf, Fit, pdf
x, y = pdf(newDS)
it generates following error, traceback provided below:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-21-79cc0ba3a245> in <module>()
1 from powerlaw import plot_pdf, Fit, pdf
----> 2 x, y = pdf(newDS)
/usr/local/Cellar/python3/3.6.4_2/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/powerlaw.py in pdf(data, xmin, xmax, linear_bins, **kwargs)
1949
1950
-> 1951 if xmin<1: #To compute the pdf also from the data below x=1, the data, xmax and xmin are rescaled dividing them by xmin.
1952 xmax2=xmax/xmin
1953 xmin2=1
TypeError: '<' not supported between instances of 'str' and 'int'
I also have asked it here
RuntimeWarning: Invalid value encountered in divide (Theoretical_CDF * (1 - Theoretical_CDF))
Hi,
when I run your demo code (https://raw.githubusercontent.com/jeffalstott/powerlaw/master/manuscript/Manuscript_Code.py) I receive a error: "/Library/Python/2.7/site-packages/powerlaw.py:692: RuntimeWarning: Invalid value encountered in divide (Theoretical_CDF * (1 - Theoretical_CDF))".
My Powerlaw version is 1.4.1 and I run Python 2.7.10 on a Mac OSX 10.12.4.
Thanks Benedikt
Problem with powerlaw lognormal estimates
Jeff Alstott encouraged me to post an issue concerning inconsistent results that I have noticed when trying to model lognormal data using the powerlaw python toolbox. These inconsistencies seem to have arisen after recent improvements made by lneisenman. I am hoping he/she may be able to look in and shed some light as to what the issue might be. A simple example is the code used to generate Figure 4 of Jeff's PLOS One article. If one runs this code under powerlaw 1.3.4, one recovers very different values for parameters (R,p) than those reported in the article, namely (0.0087852467208828777, 0.94922437131915649), instead of (0.928, 0.426). In addition, the lognormal CCDF is no longer plotted in the figure limits, basically because it has been set equal to an array of ones. The values for the lognormal distribution parameters mu and sigma do not appear to be correct either.
I have also seen this behaviour (e.g. lognormal CCDF set equal to an array of ones) when modelling some of my own data sets but have yet to discover what factors lead to its expression.
Thanks in advance for any insights anyone might have - MBostock
meaning of p-value
Hi Jeff
This is not really an issue with the code, but rather some of the comments in the code.
I just noticed that in the function descriptions (e.g., likelihood_ratio, compare_distributions etc), the meaning of the p-value is not clear:
The significance of the sign of R. If below a critical values
(typically .05) the sign of R is taken to be significant. If below the
critical value the sign of R is taken to be due to statistical
fluctuations.
Basically it says "below" for both cases...
Negative mu in lognormal fit -- source?
In the description you state:
However, for many data sets, the superior lognormal fit is only possible if one allows the fitted parameter mu to go negative.
Do you have any published source I can quote for this statement? I have indeed found a highly negative mu for a lognormal fit (and no good fit whatsoever for lognormal-positive) so I'm inclined to favour a power-law distribution, but I need to quote something proper for my thesis.
powerlaw: RuntimeWarning:invalid value encountered in divide(theoretical_CDF*(1-Theoretical_CDF))
I have this data
user1 user2 weight_link
"110", "1704", "1.008",
"110", "2139", "1.013",
"110", "4648", "1.02",
"110", "9490", "1.007",
"110", "12643", "1.013",
"110", "18224", "1.024",
"110", "21212", "1.011",
"110", "25759", "1.026",
"110", "27618", "1.022",
"110", "31667", "1.014",
I used weight as my data [1.008,1.013,1.008, 1.013, 1.02, 1.007,1.013,1.024, 1.011,1.026,1.022,1.014]
Then I run these line
data = getlink("test3.txt")#read the weight as float
results = powerlaw.Fit(data)
print results.power_law.alpha
print results.power_law.xmin
R, p = results.distribution_compare('power_law', 'lognormal')
I get this results
Calculating best minimal value for power law fit
116.192331641
1.007
/usr/local/lib/python2.7/dist-packages/powelaw.py:692:RuntimeWarning:invalid value encountered in divide(theoretical_CDF*(1-Theoretical_CDF))
Script terminated
pdf-distribution for values <1
I noticed, that the values below 1 are not included in the pdf-distribution of the data (powerlow.pdf(), fit.plot_pdf()). For some reason unknown to me, the histogram logarithmic spaced bins boundaries are transformed to integers.
line 1952 in powerlaws.py: bins=unique(floor(logspace( log_min_size, log_max_size, num=number_of_bins)))
In this way a potential infinite number of bins are eliminated.
I modified the function "pdf" to solve the problem. The data are rescaled by multipling them by xmin before the histogram is computed. At the end the bins boundareies "edges" are transformed back to the original scale and returned.
Below the code:
def pdf(data, xmin=None, xmax=None, linear_bins=False, **kwargs):
"""
Returns the probability density function (normalized histogram) of the
data.
Parameters
----------
data : list or array
xmin : float, optional
Minimum value of the PDF. If None, uses the smallest value in the data.
xmax : float, optional
Maximum value of the PDF. If None, uses the largest value in the data.
linear_bins : float, optional
Whether to use linearly spaced bins, as opposed to logarithmically
spaced bins (recommended for log-log plots).
Returns
-------
bin_edges : array
The edges of the bins of the probability density function.
probabilities : array
The portion of the data that is within the bin. Length 1 less than
bin_edges, as it corresponds to the spaces between them.
"""
from numpy import logspace, histogram, floor, unique
from math import ceil, log10
if not xmax:
xmax = max(data)
if not xmin:
xmin = min(data)
# normalize data to xmin, allow to have pdf also from the data below x=1
data2=data/xmin
xmax=xmax/xmin
xmin_old=xmin
xmin=1
if linear_bins:
bins = range(int(xmin), int(xmax2))
else:
log_min_size = log10(xmin)
log_max_size = log10(xmax)
number_of_bins = ceil((log_max_size-log_min_size)*10)
bins=unique(
floor(
logspace(
log_min_size, log_max_size, num=number_of_bins)))
hist, edges = histogram(data2, bins, density=True)
# transform data back to original
xmax=xmax*xmin_old
xmin=xmin_old
edges=edges*xmin
return edges, hist`
Pass different minimize options to Scipy's minimize
The numerical fitting methods currently use Scipy's fmin
:
from scipy.optimize import fmin
This is just one possible minimizing function. There's a keyword option in Fit
waiting to let the user set a different minimizing function, fit_optimizer=None
. It would be good to give the user the option for other numerical minimizing functions, such as Scipy's other functions in this space.
Fixed parameter range option currently not working for lognormal distribution
Hi Jeff,
The other day I asked you for the example2 data from this link
http://nbviewer.ipython.org/gist/jeffalstott/3b69b400bbd8461c02c4
because I couldn’t get the forced positive mu
to work with my data set and I wanted first to see if I could duplicate the results from
your notebook. I can’t. I suspect it’s a bug.
I attach two files test1.py and test2.py that include the same instructions as the “actual data”
examples (without and with forced positive mu, respectively) in your note book. The outputs
are the same (both produce the same negative mu), unlike the output in the notebook.
Thanks for your thoughts,
Michael
*********************Output from test1.py
In [46]: run test1.py
Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Power law's alpha: 3.531867
Exponential's lambda: 0.119016
R: 61.774285, p: 0.000891
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Lognormal's sigma: 15.197246, mu: -579.325239
R: -0.955970, p: 0.151001
**********************Output
In [47]: run test2.py
Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Lognormal's sigma: 15.197246, mu: -579.325239
R: -0.955970, p: 0.151001
*******************************************************
test1.py
import powerlaw
import numpy as np
data2 = np.genfromtxt('Example2.csv', delimiter=' ')
d = data2
d = d[~np.isnan(d)]
fit = powerlaw.Fit(d)
fit.plot_ccdf(linewidth=4)
fit.power_law.plot_ccdf()
fit.exponential.plot_ccdf()
#fit.lognormal.plot_ccdf()
print("Power law's alpha: %f"%fit.power_law.alpha)
print("Exponential's lambda: %f"%fit.exponential.Lambda)
print("R: %f, p: %f"%fit.distribution_compare('power_law', 'exponential'))
print("Lognormal's sigma: %f, mu: %f"%(fit.lognormal.sigma, fit.lognormal.mu))
print("R: %f, p: %f"%fit.distribution_compare('power_law', 'lognormal'))
*******************************************************
test2.py
import numpy as np
import powerlaw
data2 = np.genfromtxt('Example2.csv', delimiter=' ')
d = data2
d = d[~np.isnan(d)]
fit_positive = powerlaw.Fit(d)
range_dict = {'mu': [0.0, None]}
fit_positive.lognormal.parameter_range(range_dict)
print("Lognormal's sigma: %f, mu: %f"%(fit_positive.lognormal.sigma, fit_positive.lognormal.mu))
print("R: %f, p: %f"%fit_positive.distribution_compare('power_law', 'lognormal'))
fit_positive.plot_pdf(linewidth=4)
fit_positive.power_law.plot_pdf()
fit_positive.lognormal.plot_pdf()
Question about p-value
I am confused about the meaning of the p-value between the software and the paper.
From the software document, p-value represents significant when p-value < 0.05 (normal usage in statistics)
However, the footnote in the page 17 of the paper "Power Law Distributions in Empirical Data" states that they use p-value as a measure of the hypothesis they are trying to verify.
Hence, high values, not low, are "good".
So, if I use distribution_compare(A, B), and R > 0, p-value > 0.1.
Is A better to fit the data?
Thank you.
Setting option "verbose=False" in Fit() leads to error
Hi, I've been trying to use "verbose = False" to shut down the messages but kept getting "divide by zero" errors. I checked the .py file and found that, starting line 96
if 0 in self.data and verbose:
print("Values less than or equal to 0 in data. Throwing out 0 or negative values", file=sys.stderr)
self.data = self.data[self.data>0]
This is obviously wrong, a correct version would be:
if 0 in self.data:
if verbose:
print("Values less than or equal to 0 in data. Throwing out 0 or negative values", file=sys.stderr)
self.data = self.data[self.data>0]
Regenerate API docs?
The html docs for loglikelihood_ratio (and friends) still read
p : float
The significance of the sign of R. If below a critical values (typically .05) the sign of R is taken to be significant. If below the critical value the sign of R is taken to be due to statistical fluctuations.
which has been fixed in the docstrings a while ago apparently.
Values below xmin are sliently trimmed when computing cdf or pdf
import numpy as np
import powerlaw
data = np.array([1.7, 3.2, 5.4, 7.9, 10., 12.])
results = powerlaw.Fit(data)
print(results.power_law.alpha)
print(results.power_law.xmin)
print(results.power_law.cdf(np.arange(10)))
print(results.power_law.cdf(np.arange(3, 10)))
xmin
is ~2.28, and the last two lines give identical results: the entries 0, 1 and 2 of np.arange(10)
have been silently trimmed out, when I'd expect them to return 1 (and 0 if computing the pdf). Otherwise, it becomes a bit awkward to e.g. do plt.plot(xs, results.power_law.cdf(xs))
if I am not directly computing xs
using xmin
.
Standard error on best fit parameters of stretched exponential
Hi there,
Thanks for such an amazing package. It's made my life hell of a lot easier!
One question: I can't figure out how to get the standard errors on the best fit parameters for the non-powerlaw distributions (i.e. lognormal, stretched expontial, etc). Am I being stupid? Or is this not provided. If the latter, it would be great if it could be implemented!
I would happily provide help implementing if you need / want it.
Binomial distribution for distribution_compare
My goal is to find the point where scale-free networks become indistinguishable from random (non-scale-free) networks.
I would expect something like the binomial distribution to be implemented for comparison using distribution_compare().
Is there a specific reason it wasn't implemented?
For example I tried the following code to distinguish between an obviously scale-free network and an obviously non-scale-free network (both with similar numbers of nodes/edges):
non_sf_graph = nx.gnp_random_graph(10000, 0.002)
sf_graph = nx.barabasi_albert_graph(10000, 10)
fitpl = powerlaw.Fit(list(sf_graph.degree().values()))
fitnpl = powerlaw.Fit(list(non_sf_graph.degree().values()))
for dist in fitpl.supported_distributions.keys():
print(dist)
fitpl.distribution_compare('power_law', dist)
fitnpl.distribution_compare('power_law', dist)
The output suggested that none of the implemented distributions provided a tool to discern between an preferential attachment model and a gnp random graph:
lognormal
(-0.23698971255249646, 0.089194415705275421)
(-20.320811335334504, 3.9097599268295484e-92)
exponential
(511.41420648854108, 7.3934851812182895e-23)
(24.215231521373582, 3.7251410948652104e-08)
truncated_power_law
(3.3213949937049847e-06, 0.99794356568650555)
(3.1510369047360598e-07, 0.99936659460444144)
stretched_exponential
(16.756797270053454, 1.6505119872120265e-05)
(8.7110005915424153, 8.7224098659112012e-05)
lognormal_positive
(30.428201968820289, 1.7275238929002278e-07)
(6.7992592335974233, 5.4945477823229749e-06)
I am asking as i am no statistics expert and I might not see the significance of all the available distributions. But they seem to fail this basic example. I would be happy to help implement a distribution, that successfully fits a random gnp network. Or are there some limitations which make this hard/impossible?
Implement Gamma distribution
The Gamma distribution had issues with being supported, described here. The gamma distribution requires calculating the incomplete gamma function, and calculating this can be slow. Prohibitively slow. The problem is not in calculating the incomplete gamma function once, but during the fitting process, which uses numerical optimization. Here we calculate candidate PDFs and CDFs many, many times, and the slowness of the incomplete gamma function starts to bog us down. This was particularly disastrous for discrete distributions, which by default use lots of CDF calculations.
This was also the case for the truncated power law, which also relies on incomplete gamma functions. David Bild figured out a simplication, however, that would get around this, and massively sped up the truncated power law fitting. Perhaps a similar solution could be found for gamma distribution.
installation fails?
Thanks for the great package.
After the recent 0.8.1 update I encountered the following installation issue (same on Mac/Linux) with pip and easy_install
pip install -U powerlaw
Downloading/unpacking powerlaw
Downloading powerlaw-.8.1.tar.gz
Running setup.py egg_info for package powerlaw
Traceback (most recent call last):
File "<string>", line 14, in <module>
File "/home/mekman/build/powerlaw/setup.py", line 2, in <module>
with open('README.rst') as file:
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 14, in <module>
File "/home/mekman/build/powerlaw/setup.py", line 2, in <module>
with open('README.rst') as file:
IOError: [Errno 2] No such file or directory: 'README.rst'
----------------------------------------
Command python setup.py egg_info failed with error code 1
Storing complete log in /home/mekman/.pip/pip.log
Is it possible to fit a power law function to the data that has x and ?
Hello, I am wondering whether this package can be used to fit to 2D data (x and y) or not. Moreover, I think my data seems to follow the below formula:
y = a+b*x^(-c) (a,b,c>0)
Can I use this package to fit the formula? By the way, I was using a log–log graph to determine the coefficients a, b and c. Is it correct mathematically?
Thanks for your help in advance!
Fitting issues with standard configuration
Hello,
first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.
I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})
When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF * (1 - Theoretical_CDF))", which can be ignored as I discovered in older threads.
The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit.
On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1.
I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?
Best regards
Pickling Fit objects.
Hi Jeff,
I'm trying to pickle some large Fit objects but when I try to load them I get a
: maximum recursion depth exceeded while calling a Python object
.
I've tried to increase the maximum recursion depth but I still get the same problem. If I increase it beyond a certain value, the python kernel just dies when I try to load it.
I've tried to use the dill
library but no success. It seems to be related with recursive attributes but I'm not sure.
I'm trying to pickle large Fit objects (comming from >160M data points) because they took hours to run the truncated_power_law
. Any ideas on how to pickle Fit objects or make the truncated_power_law fit faster?
Thank you!
p-value for an individual fit implementation
Hi Jeff!
In the Powerlaw: A python package for analysis of heavy-tailed distributions paper you've written:
The goodness of fit for each distribution can be considered individually or by comparison to the fit of other distributions (respectively, using bootstrapping and the Kolmogorov- Smirnov test to generate a p-value for an individual fit vs. using loglikelihood ratios to identify which of two fits is better) [5]. There are several reasons, both practical and philosophical, to focus on the latter, comparative tests.
I understand that currently the computation of p-value for an individual fit is not implemented. Would you have anything against it being added as a feature? Would you anticipate it to be a significant effort?
Best,
Adam
Option to pass parameter_range to multiple different distributions within a Fit object.
Currently, users can use the parameter_range
option to dictate that a Distribution
object fits a theoretical distribution to empirical data under some constraints. This is can be done when making a Distribution
object manually, including when they are part of a Fit
object, like this:
fit = powerlaw.Fit(data)
fit.distribution_compare('power_law', 'lognormal') #Uses a regular lognormal fit
range_dict = {'mu': [0.0, None]}
fit.lognormal.parameter_range(range_dict) #Implements the restriction on lognormal that mu must be positive
fit.distribution_compare('power_law', 'lognormal') #Uses a lognormal fit which has a positive mu.
This is good for doing parameter manipulations on the fly. However, it would be good assert some parameter ranges when the Fit
is first created. We can use parameter_range
when making a Fit
object, like so:
parameter_range = {'alpha': [2.3, None], 'sigma': [None, .2]}
fit = powerlaw.Fit(data, parameter_range=parameter_range)
The parameter_range
is only used for the power law fit; it is not used for any of the other distributions. This makes sense, as alpha
and sigma
may not be parameters in the other distributions. However, it would be good to have the ability to pass a parameter_range
with options for many different distributions. It's unclear what the One Best Way to do that would be.
TypeError with fits using mpmath
I'm getting an error in mpmath when I call fit.distribution_compare for gamma and truncated power law distributions.
The error is "TypeError: cannot create mpf from 0.20379211", on line 560 of mpmath.ctx_mp
The error results from this call in powerlaw to mpmath's gammainc:
/Library/Frameworks/Python.framework/Versions/7.2/lib/python2.7/site-packages/powerlaw.pyc in cdf_base_function(self=<powerlaw.Truncated_Power_Law object>, x=array([ 6.19425631, 6.22601461, 6.35583973,...2085, 42.87356567, 49.3860321 ], dtype=float32))
1295 gammainc = vectorize(gammainc)
1296
-> 1297 CDF = ( (gammainc(1-self.alpha,self.Lambda_x)).astype('float') /
1298 self.Lambda*(1-self.alpha)
1299 )
x ranges from 6 to 50, while self.alpha = 2.45 and self.Lambda = 0.033
This is on Python 2.7 and numpy 1.6.1, Mac OS X 10.6.8.
I can provide the full traceback if needed.
Do you know if this is an issue in powerlaw or something upstream in mpmath?
Thanks,
Eric
Calculating lognormal fit
After using powerlaw_compare, I found my upper tail data better fit to lognormal. Then I want to give a best estimation of the lognormal parameter. Then I found there are two different ways of calculating lognormal fit.
One is using lognormal.mu
after using powerlaw.Fit
:
fit=powerlaw.Fit(X)
print "power law:alpha=%s, D=%s, sigma=%s, xmin=%s" %(fit.alpha,fit.D,fit.sigma,fit.xmin)
print "lognormal: mu=%s, sigma=%s" %(fit.lognormal.mu,fit.lognormal.sigma)
The returning results is as follows:
power law:alpha=1.58419441281, D=0.0361993084144, sigma=0.0175821856886, xmin=150000.0
lognormal: mu=5.12826972544, sigma=4.10761057227
The other is using powerlaw.distribution_fit
directly. I set the xmin according to calculation of best xmin in powerlaw.Fit (which is 150000)
cut=X[X>=150000]
fit2=powerlaw.distribution_fit(cut,distribution='lognormal',xmin=150000)
fit3=powerlaw.distribution_fit(X,distribution='lognormal',xmin=150000)
print "lognormal estimation"
print fit2
print "to show that xmin works"
print fit3
The returning result is:
lognormal estimation
(array([ 11.91839062, 2.29111789]), -16764.238127379787)
to show that xmin works
(array([ 11.91839062, 2.29111789]), -16764.238127379787)
I have no idea which one is right.
I doubt if it is due to my misinterpretation of the estimated value returned. I tried to look through the code but cannot find a clue. It may be a silly question but I post it here in case someone else is curious about this question:D
Sigma in Truncated Power Law
Hi Jeff,
Is there any way to get the standard error 'sigma' for truncated power-law fitting? I only see one for the power-law function.
Fitting user-specified PDF, e.g. power spectral density
I really like your software, it makes it easier to judge the hype of powerlaws in datasets.
However, right now it focuses on fitting full datasets, creating their PDF and CDF on the fly. I'd like to use in situations where I already have a PDF (defined at several points) - or generally a distribution function of some sort - and fit its shape in some range.
An example is the power spectral density of fluctuations in turbulent plasmas, where there is an ongoing discussion whether they are powerlaws or exponentials.
I'd be wiling to contribute modifications to powerlaw which would make this optional sue-case possible. But I would greatly appreciate if you could point out how best to approach this issue.
Allow for weighted data
As I understand it, this package attempts to fit power law histograms to a set of provided values, though I have to admit that the form the provided data is supposed to have is not clear to me from the documentation.
I have data where there is a known bias depending on the measured values. To correct this bias, I would like to provide weights for each value. As far as I can see, that's not currently possible, so I'd like to request adding it as a feature.
I can get something close to what I want by producing a histogram with pyplot, then using it to generate artificial data (e.g.: for each bar, generate it's mean value a number of times equal to it's height) and using that data in powerlaw.Fit. This will create small inaccuracies due to rounding the height and shifting data to the mean value however.
original data option for the power law plot
Hi,
when I activate the original data option the plot of my sample changes but not the power law fit. I guess that is an issue, or?
Here a plots of my sample without the original data option:
With the original data option:
Here is my code:
ax = fits[1].plot_ccdf()
fits[1].power_law.plot_ccdf(ax=ax, color='r', linestyle='--', linewidth=1, label='Power law fit')#add powerlaw line
plt.savefig("testsingle_{}.png".format(1))
plt.close()
ax=fits[1].plot_ccdf(original_data = True)
fits[1].power_law.plot_ccdf(ax=ax, color='r', linestyle='--', linewidth=1, label='Power law fit')#add powerlaw line
plt.savefig("testsingle_originaldata_{}.png".format(1))
plt.close()
Seed Power_Law.generate_random()
Is there a way to seed this function? It'd be nice if this functionality could be added
Confusing in use of `searchsorted`
https://github.com/jeffalstott/powerlaw/blob/master/powerlaw.py#L1890
I think that if you keep original sorted data location, the use of searchsorted
is useful.
But below code use unique_indices
to take subset of CDF too.
I feel confusing since it's equivalence to call arange(n)/n
directly,
and obviously the arange
way is faster than searchsorted
way, it's proved by below test:
def f1(data, n):
CDF = searchsorted(data, data,side='left')/n
unique_data, unique_indices = unique(data, return_index=True)
data=unique_data
CDF = CDF[unique_indices]
return CDF
def f2(data, n):
unique_data, unique_indices = unique(data, return_index=True)
return (np.arange(n)/n)[unique_indices]
data = [0,1,1,2,2,2,3,3,3,3]
n = len(data)
%timeit f1(data,n)
The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 85.3 µs per loop
%timeit f2(data,n)
10000 loops, best of 3: 54 µs per loop
f1(data,n)
Out[47]: array([ 0. , 0.1, 0.3, 0.6])
f2(data,n)
Out[48]: array([ 0. , 0.1, 0.3, 0.6])
So I can't understand how the 'clever' claimed in comment hold. Perhaps I miss some corner cases?...
Don't throw out the zeros?
Hi Jeff et.al,
Great package. I'm using it extensively in a power law class I'm taking.
I'm looking at the freeman centralities distribution of a sparse network where there are a lot of isolates, hence their centralities (for the isolates) is 0. When I try to fit the centralities to a power law distribution using your package, the Fit call throws out all the zeros, but those are legitimate values.
The result is a warning:
Values less than or equal to 0 in data. Throwing out 0 or negative values
Calculating best minimal value for power law fit
...and of course, since the x-min value is calculated based on only non-zero values I get a much higher alpha which is not representative of the distribution.
I'm not sure if this is a feature request to include the zeros, or if I have a fundamental misunderstanding of the package and of Clauset's paper.
p.s. Also not sure if this is the appropriate place to bring this up, so sorry in advance.
Joe
How can I fit a truncated powerlaw?
It is not clear to me if this library supports truncated powerlaw distributions.
It it is supported, how can I use it to fit a truncated powerlaw distribution?
Add "Debug" flag to automatically hide messages such as "Calculating best minimal value for power law fit"
pip doesn't correctly install dependencies
This is because of using distutils
instead of setuptools
in setup.py
. See this link.
If you are happy to switch to setuptools
, I can submit a pull request.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.