Jeff Alstott encouraged me to post an issue concerning inconsistent results that I hav

Paging <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Actually, it looks like <a class="user-mention notranslate" data-hovercard-type="user"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Isolated the issue. It's in the way the lognormal's CDF <a href="https://github.com/je

Problem with powerlaw lognormal estimates about powerlaw HOT 8 CLOSED

jeffalstott commented on July 22, 2024

Problem with powerlaw lognormal estimates

from powerlaw.

Comments (8)

jeffalstott commented on July 22, 2024

Paging @lneisenman.

@bostockm, if you can put in an example with the exact line of code run and the outputs shown, that would be great. Thanks!

from powerlaw.

bostockm commented on July 22, 2024

I attach below 1) code used to generate Figure 4 of PLOS article, 2) screen output with resulting (R,p) data for powerlaw - lognormal comparison, and lognormal CCDF fit values (all ones), and 3) eps version figure that also indicates the returned lognormal CCDF.
#############################################################
1.
In [312]: run test.py
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Figure 4 (R,p) 0.00878524672088 0.949224371319
Figure 4 Lognormal CCDF fit values
[ 1. 1. 1. ..., 1. 1. 1.]

In [313]:
##############################################################
2.

In [313]: cat test.py
import numpy as np
import matplotlib.pyplot as pl
import powerlaw

words = np.genfromtxt('words.txt')
data = words
fit = powerlaw.Fit(data, discrete=True)

R,p=fit.distribution_compare('power_law', 'lognormal')
print('Figure 4 (R,p)',R,p)

y=fit.lognormal.ccdf()
print('Figure 4 Lognormal CCDF fit values')
print(y)

fig = fit.plot_ccdf(linewidth=3, label='Empirical Data')
fit.power_law.plot_ccdf(ax=fig, color='r', linestyle='--', label='Power law fit')
fit.lognormal.plot_ccdf(ax=fig, color='g', linestyle='--', label='Lognormal fit')

fig.set_ylabel(u"p(X≥x)")
fig.set_xlabel("Word Frequency")
handles, labels = fig.get_legend_handles_labels()
fig.legend(handles, labels, loc=3)
pl.ylim(1e-4,2)
pl.show()

figname = 'FigLognormal'
pl.savefig(figname+'.eps', bbox_inches='tight')

###################################################
3.

from powerlaw.

lneisenman commented on July 22, 2024

@bostockm
@jeffalstott

With respect to the R and p values, the fit command should be:
fit = powerlaw.Fit(data, discrete=True, estimate_discrete=False)

and the compare statement should be:
R, p = fit.loglikelihood_ratio('power_law', 'lognormal', normalized_ratio=True)

This will get you the "correct" values for R and p as per Clauset 2009 Table 5. This issue with the ccdf graph persists. I haven't looked at that code so I'm not sure what the issue is. FWIW, replacing the words.txt data with the terrorism.txt data produces an appropriate looking graph suggesting the issue is hiding in the ccdf calculation.

Larry

from powerlaw.

jeffalstott commented on July 22, 2024

Thanks @lneisenman. To confirm our understanding: Clauset 2009 Table 5 uses the discrete estimation, and so to match their results we must also use the discrete estimation. One may also, of course, use the precise, unestimated fitting, and then will get different results (which are presumably better).

from powerlaw.

jeffalstott commented on July 22, 2024

Actually, it looks like @bostockm is not looking at the reported results from Clauset 20009, but from Alstott 2014.

"If one runs this code under powerlaw 1.3.4, one recovers very different values for parameters (R,p) than those reported in the article, namely (0.0087852467208828777, 0.94922437131915649), instead of (0.928, 0.426). "

But @lneisenman, are you able to replicate the values reported in Alstott 2014? Using powerlaw 1.3.4?

from powerlaw.

lneisenman commented on July 22, 2024

@bostockm
@jeffalstott

When I revert to prior to any of my changes:
git checkout 6d9db54124fad3b8e503a100b75304a30bbfc8a8

I can replicate the figure and get the same values for R and p as shown in the 2014 paper:
Figure 4 (R, p) 0.928017881163 0.425845694403

However, I am not able to replicate the Clauset result:
Figure 4 normalized (R, p) 0.798926601828 0.424332972912

This still produces a viable figure.

When I switch back to the current version:
git checkout 60cd80d3879677e4b2dfbb116533d6c706252691

I can no longer replicate the paper:
Figure 4 (R, p) 0.00878524672081 0.94922437132

But I can now match Clauset:
Figure 4 normalized (R, p) 0.440451073211 0.659610441062

The plot is wrong in both cases.

Here is a link to the the discussion about the changes for lognorm: #16
In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper. There may be a similar issue in the ccdf plotting code but I won't be able to look for a couple of days. Hope this helps.

Larry

from powerlaw.

jeffalstott commented on July 22, 2024

"In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper."
This is fine and good; we want the right results, not whatever has been published wherever :-) As long as you're still confident that the differences are due to a rounding error, and the rounding error has been resolved, then we're good.

Where we are not good is the ccdf calculation. I will look into this now.

from powerlaw.

jeffalstott commented on July 22, 2024

Isolated the issue. It's in the way the lognormal's CDF is defined.

1558     def _cdf_base_function(self, x):
1559         from numpy import sqrt, log
1560         from scipy.special import erf
1561         return  0.5 + ( 0.5 *
1562                 erf((log(x)-self.mu) / (sqrt(2)*self.sigma)))

Even if everything else is working fine, line 1562 can give an array of 1's. In the case of the words data, this is because (log(x)-self.mu) / (sqrt(2)*self.sigma) returns an array like this:

> (log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma)
array([ 18.55538593,  18.55538593,  18.55538593, ...,  18.7291237 ,
        18.72974499,  18.74985663])

Which when fed into erf produce:

> erf((log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma))
array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

And that's where the problem lies. So it looks like we're in the same situation as #16, in which Larry identified some extreme numbers leading to rounding errors. It's not immediately clear to me how to fix this, however, without using mpmath Even the higher numerical precision of mpmath may not be sufficient, however:

> from mpmath import erf
> erf(18.55538593)
mpf('1.0')

That result still shouldn't be 1, but instead a number very close to 1.

These problems would not arise if we only calculated the CDF between xmin and (if present) xmax. However, the way powerlaw currently calculates CDFs is that it:

calculates the base CDF function
cuts off all the probability before xmin
normalizes the rest of the probability accordingly

Step 1 is done for each kind of distribution individually, as in line 1558-1562 shown above. Step 2 and 3 are done more generally in the fit.cdf() function (line 722), and the relevant snippet starts at line 762:

 762         CDF = self._cdf_base_function(data) - self._cdf_xmin
 763         #if self.xmax:
 764         #    CDF = CDF - (1 - self._cdf_base_function(self.xmax))
 765
 766         norm = 1 - self._cdf_xmin
 767         if self.xmax:
 768             norm = norm - (1 - self._cdf_base_function(self.xmax))
 769
 770         CDF = CDF/norm

This generalized way of calculating the CDF is great, except when we're in a crazy fit like this where numerical precision is an issue. In this case, we could try to get around the precision issues by just calculate the appropriate CDF from the beginning, incorporating an xmin directly into the calculation. Numbers closer to the xmin should be more sane numbers well within numerical precision; this would only work if the direct calculation used simplified versions of the equations that prevented these rounding errors.

@lneisenman, you previously tackled and resolved a similar issue, which is indeed how we're able to be in this pickle now. Do you have any ideas?

from powerlaw.

Problem with powerlaw lognormal estimates about powerlaw HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs