Comments (8)
Paging @lneisenman.
@bostockm, if you can put in an example with the exact line of code run and the outputs shown, that would be great. Thanks!
from powerlaw.
I attach below 1) code used to generate Figure 4 of PLOS article, 2) screen output with resulting (R,p) data for powerlaw - lognormal comparison, and lognormal CCDF fit values (all ones), and 3) eps version figure that also indicates the returned lognormal CCDF.
#############################################################
1.
In [312]: run test.py
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Figure 4 (R,p) 0.00878524672088 0.949224371319
Figure 4 Lognormal CCDF fit values
[ 1. 1. 1. ..., 1. 1. 1.]
In [313]:
##############################################################
2.
In [313]: cat test.py
import numpy as np
import matplotlib.pyplot as pl
import powerlaw
words = np.genfromtxt('words.txt')
data = words
fit = powerlaw.Fit(data, discrete=True)
R,p=fit.distribution_compare('power_law', 'lognormal')
print('Figure 4 (R,p)',R,p)
y=fit.lognormal.ccdf()
print('Figure 4 Lognormal CCDF fit values')
print(y)
fig = fit.plot_ccdf(linewidth=3, label='Empirical Data')
fit.power_law.plot_ccdf(ax=fig, color='r', linestyle='--', label='Power law fit')
fit.lognormal.plot_ccdf(ax=fig, color='g', linestyle='--', label='Lognormal fit')
fig.set_ylabel(u"p(X≥x)")
fig.set_xlabel("Word Frequency")
handles, labels = fig.get_legend_handles_labels()
fig.legend(handles, labels, loc=3)
pl.ylim(1e-4,2)
pl.show()
figname = 'FigLognormal'
pl.savefig(figname+'.eps', bbox_inches='tight')
###################################################
3.
from powerlaw.
With respect to the R and p values, the fit command should be:
fit = powerlaw.Fit(data, discrete=True, estimate_discrete=False)
and the compare statement should be:
R, p = fit.loglikelihood_ratio('power_law', 'lognormal', normalized_ratio=True)
This will get you the "correct" values for R and p as per Clauset 2009 Table 5. This issue with the ccdf graph persists. I haven't looked at that code so I'm not sure what the issue is. FWIW, replacing the words.txt data with the terrorism.txt data produces an appropriate looking graph suggesting the issue is hiding in the ccdf calculation.
Larry
from powerlaw.
Thanks @lneisenman. To confirm our understanding: Clauset 2009 Table 5 uses the discrete estimation, and so to match their results we must also use the discrete estimation. One may also, of course, use the precise, unestimated fitting, and then will get different results (which are presumably better).
from powerlaw.
Actually, it looks like @bostockm is not looking at the reported results from Clauset 20009, but from Alstott 2014.
"If one runs this code under powerlaw 1.3.4, one recovers very different values for parameters (R,p) than those reported in the article, namely (0.0087852467208828777, 0.94922437131915649), instead of (0.928, 0.426). "
But @lneisenman, are you able to replicate the values reported in Alstott 2014? Using powerlaw
1.3.4?
from powerlaw.
When I revert to prior to any of my changes:
git checkout 6d9db54124fad3b8e503a100b75304a30bbfc8a8
I can replicate the figure and get the same values for R and p as shown in the 2014 paper:
Figure 4 (R, p) 0.928017881163 0.425845694403
However, I am not able to replicate the Clauset result:
Figure 4 normalized (R, p) 0.798926601828 0.424332972912
This still produces a viable figure.
When I switch back to the current version:
git checkout 60cd80d3879677e4b2dfbb116533d6c706252691
I can no longer replicate the paper:
Figure 4 (R, p) 0.00878524672081 0.94922437132
But I can now match Clauset:
Figure 4 normalized (R, p) 0.440451073211 0.659610441062
The plot is wrong in both cases.
Here is a link to the the discussion about the changes for lognorm: #16
In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper. There may be a similar issue in the ccdf plotting code but I won't be able to look for a couple of days. Hope this helps.
Larry
from powerlaw.
"In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper."
This is fine and good; we want the right results, not whatever has been published wherever :-) As long as you're still confident that the differences are due to a rounding error, and the rounding error has been resolved, then we're good.
Where we are not good is the ccdf calculation. I will look into this now.
from powerlaw.
Isolated the issue. It's in the way the lognormal's CDF is defined.
1558 def _cdf_base_function(self, x):
1559 from numpy import sqrt, log
1560 from scipy.special import erf
1561 return 0.5 + ( 0.5 *
1562 erf((log(x)-self.mu) / (sqrt(2)*self.sigma)))
Even if everything else is working fine, line 1562 can give an array of 1's. In the case of the words
data, this is because (log(x)-self.mu) / (sqrt(2)*self.sigma)
returns an array like this:
> (log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma)
array([ 18.55538593, 18.55538593, 18.55538593, ..., 18.7291237 ,
18.72974499, 18.74985663])
Which when fed into erf
produce:
> erf((log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma))
array([ 1., 1., 1., ..., 1., 1., 1.])
And that's where the problem lies. So it looks like we're in the same situation as #16, in which Larry identified some extreme numbers leading to rounding errors. It's not immediately clear to me how to fix this, however, without using mpmath
Even the higher numerical precision of mpmath
may not be sufficient, however:
> from mpmath import erf
> erf(18.55538593)
mpf('1.0')
That result still shouldn't be 1, but instead a number very close to 1.
These problems would not arise if we only calculated the CDF between xmin
and (if present) xmax
. However, the way powerlaw
currently calculates CDFs is that it:
- calculates the base CDF function
- cuts off all the probability before
xmin
- normalizes the rest of the probability accordingly
Step 1 is done for each kind of distribution individually, as in line 1558-1562 shown above. Step 2 and 3 are done more generally in the fit.cdf()
function (line 722), and the relevant snippet starts at line 762:
762 CDF = self._cdf_base_function(data) - self._cdf_xmin
763 #if self.xmax:
764 # CDF = CDF - (1 - self._cdf_base_function(self.xmax))
765
766 norm = 1 - self._cdf_xmin
767 if self.xmax:
768 norm = norm - (1 - self._cdf_base_function(self.xmax))
769
770 CDF = CDF/norm
This generalized way of calculating the CDF is great, except when we're in a crazy fit like this where numerical precision is an issue. In this case, we could try to get around the precision issues by just calculate the appropriate CDF from the beginning, incorporating an xmin
directly into the calculation. Numbers closer to the xmin
should be more sane numbers well within numerical precision; this would only work if the direct calculation used simplified versions of the equations that prevented these rounding errors.
@lneisenman, you previously tackled and resolved a similar issue, which is indeed how we're able to be in this pickle now. Do you have any ideas?
from powerlaw.
Related Issues (20)
- `estimate_discrete` should be False by default or raise a warning for x_min < 6 HOT 1
- p_value not computed from normalizes R HOT 6
- Issue with the x_min
- Curve fitted using power law is far from the data points
- Version label
- Added xmin computation does not work for distributions != power_law/truncated_power_law HOT 1
- power law plot showing fit and all data, not just data from xmin HOT 1
- New user: Why the curvature in power_law.plot_ccdf fit? HOT 14
- Defunct scipy import HOT 1
- threshold in powerlaw fit HOT 1
- Remove or make optional xmin fitting print
- Fitting a powerlaw with the xmax parameter HOT 17
- How to improve the efficiency of the fit.
- Get the estimates when i only have an probability distribution from empirical data
- Some issues in lognormal fit
- how to calculate the R value properly for discrete data
- Feature Request: Return the normalization constant HOT 9
- Please remove print statement on line 341 of powerlaw.py
- parameter1 attribute not set for fit.powerlaw HOT 1
- Can not pass 'bins' keyword to `plot_pdf` HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from powerlaw.