GithubHelp home page GithubHelp logo

Comments (8)

jeffalstott avatar jeffalstott commented on July 22, 2024

Paging @lneisenman.

@bostockm, if you can put in an example with the exact line of code run and the outputs shown, that would be great. Thanks!

from powerlaw.

bostockm avatar bostockm commented on July 22, 2024

I attach below 1) code used to generate Figure 4 of PLOS article, 2) screen output with resulting (R,p) data for powerlaw - lognormal comparison, and lognormal CCDF fit values (all ones), and 3) eps version figure that also indicates the returned lognormal CCDF.
#############################################################
1.
In [312]: run test.py
Calculating best minimal value for power law fit
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: invalid value encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
/Users/bostock/anaconda/lib/python3.4/site-packages/powerlaw.py:693: RuntimeWarning: divide by zero encountered in true_divide
(Theoretical_CDF * (1 - Theoretical_CDF))
Figure 4 (R,p) 0.00878524672088 0.949224371319
Figure 4 Lognormal CCDF fit values
[ 1. 1. 1. ..., 1. 1. 1.]

In [313]:
##############################################################
2.

In [313]: cat test.py
import numpy as np
import matplotlib.pyplot as pl
import powerlaw

words = np.genfromtxt('words.txt')
data = words
fit = powerlaw.Fit(data, discrete=True)

R,p=fit.distribution_compare('power_law', 'lognormal')
print('Figure 4 (R,p)',R,p)

y=fit.lognormal.ccdf()
print('Figure 4 Lognormal CCDF fit values')
print(y)

fig = fit.plot_ccdf(linewidth=3, label='Empirical Data')
fit.power_law.plot_ccdf(ax=fig, color='r', linestyle='--', label='Power law fit')
fit.lognormal.plot_ccdf(ax=fig, color='g', linestyle='--', label='Lognormal fit')

fig.set_ylabel(u"p(X≥x)")
fig.set_xlabel("Word Frequency")
handles, labels = fig.get_legend_handles_labels()
fig.legend(handles, labels, loc=3)
pl.ylim(1e-4,2)
pl.show()

figname = 'FigLognormal'
pl.savefig(figname+'.eps', bbox_inches='tight')

###################################################
3.
figlognormal

from powerlaw.

lneisenman avatar lneisenman commented on July 22, 2024

@bostockm
@jeffalstott

With respect to the R and p values, the fit command should be:
fit = powerlaw.Fit(data, discrete=True, estimate_discrete=False)

and the compare statement should be:
R, p = fit.loglikelihood_ratio('power_law', 'lognormal', normalized_ratio=True)

This will get you the "correct" values for R and p as per Clauset 2009 Table 5. This issue with the ccdf graph persists. I haven't looked at that code so I'm not sure what the issue is. FWIW, replacing the words.txt data with the terrorism.txt data produces an appropriate looking graph suggesting the issue is hiding in the ccdf calculation.

Larry

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Thanks @lneisenman. To confirm our understanding: Clauset 2009 Table 5 uses the discrete estimation, and so to match their results we must also use the discrete estimation. One may also, of course, use the precise, unestimated fitting, and then will get different results (which are presumably better).

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Actually, it looks like @bostockm is not looking at the reported results from Clauset 20009, but from Alstott 2014.

"If one runs this code under powerlaw 1.3.4, one recovers very different values for parameters (R,p) than those reported in the article, namely (0.0087852467208828777, 0.94922437131915649), instead of (0.928, 0.426). "

But @lneisenman, are you able to replicate the values reported in Alstott 2014? Using powerlaw 1.3.4?

from powerlaw.

lneisenman avatar lneisenman commented on July 22, 2024

@bostockm
@jeffalstott

When I revert to prior to any of my changes:
git checkout 6d9db54124fad3b8e503a100b75304a30bbfc8a8

I can replicate the figure and get the same values for R and p as shown in the 2014 paper:
Figure 4 (R, p) 0.928017881163 0.425845694403

However, I am not able to replicate the Clauset result:
Figure 4 normalized (R, p) 0.798926601828 0.424332972912

This still produces a viable figure.

When I switch back to the current version:
git checkout 60cd80d3879677e4b2dfbb116533d6c706252691

I can no longer replicate the paper:
Figure 4 (R, p) 0.00878524672081 0.94922437132

But I can now match Clauset:
Figure 4 normalized (R, p) 0.440451073211 0.659610441062

The plot is wrong in both cases.

Here is a link to the the discussion about the changes for lognorm: #16
In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper. There may be a similar issue in the ccdf plotting code but I won't be able to look for a couple of days. Hope this helps.

Larry

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

"In short, the older version of the code appeared to have rounding issues when analyzing the words.txt data set. Fixing it to match the Clauset results meant that the code would no longer reproduce the result in the 2014 paper."
This is fine and good; we want the right results, not whatever has been published wherever :-) As long as you're still confident that the differences are due to a rounding error, and the rounding error has been resolved, then we're good.

Where we are not good is the ccdf calculation. I will look into this now.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Isolated the issue. It's in the way the lognormal's CDF is defined.

1558     def _cdf_base_function(self, x):
1559         from numpy import sqrt, log
1560         from scipy.special import erf
1561         return  0.5 + ( 0.5 *
1562                 erf((log(x)-self.mu) / (sqrt(2)*self.sigma)))

Even if everything else is working fine, line 1562 can give an array of 1's. In the case of the words data, this is because (log(x)-self.mu) / (sqrt(2)*self.sigma) returns an array like this:

> (log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma)
array([ 18.55538593,  18.55538593,  18.55538593, ...,  18.7291237 ,
        18.72974499,  18.74985663])

Which when fed into erf produce:

> erf((log(fit.data)-fit.lognormal.mu) / (sqrt(2)*fit.lognormal.sigma))
array([ 1.,  1.,  1., ...,  1.,  1.,  1.])

And that's where the problem lies. So it looks like we're in the same situation as #16, in which Larry identified some extreme numbers leading to rounding errors. It's not immediately clear to me how to fix this, however, without using mpmath Even the higher numerical precision of mpmath may not be sufficient, however:

> from mpmath import erf
> erf(18.55538593)
mpf('1.0')

That result still shouldn't be 1, but instead a number very close to 1.

These problems would not arise if we only calculated the CDF between xmin and (if present) xmax. However, the way powerlaw currently calculates CDFs is that it:

  1. calculates the base CDF function
  2. cuts off all the probability before xmin
  3. normalizes the rest of the probability accordingly

Step 1 is done for each kind of distribution individually, as in line 1558-1562 shown above. Step 2 and 3 are done more generally in the fit.cdf() function (line 722), and the relevant snippet starts at line 762:

 762         CDF = self._cdf_base_function(data) - self._cdf_xmin
 763         #if self.xmax:
 764         #    CDF = CDF - (1 - self._cdf_base_function(self.xmax))
 765
 766         norm = 1 - self._cdf_xmin
 767         if self.xmax:
 768             norm = norm - (1 - self._cdf_base_function(self.xmax))
 769
 770         CDF = CDF/norm

This generalized way of calculating the CDF is great, except when we're in a crazy fit like this where numerical precision is an issue. In this case, we could try to get around the precision issues by just calculate the appropriate CDF from the beginning, incorporating an xmin directly into the calculation. Numbers closer to the xmin should be more sane numbers well within numerical precision; this would only work if the direct calculation used simplified versions of the equations that prevented these rounding errors.

@lneisenman, you previously tackled and resolved a similar issue, which is indeed how we're able to be in this pickle now. Do you have any ideas?

from powerlaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.