GithubHelp home page GithubHelp logo

Comments (12)

jeffalstott avatar jeffalstott commented on July 22, 2024

Good idea! It is indeed a bit noodley right now, and could stand to be rethought out. Perhaps you might want to help in determining what the best structure should be?

Given a CDF/CCDF function, the normalizing factor for the PDF can be calculated readily. Just take the CCDF from xmin to infinity and subtract the CCDF from xmax to infinity (if present). Thus, with _cdf_base_function we can calculate all other quantities.

The Central Question:
Is there good reason not to just define the equation for the _cdf_base_function and calculate all other quantities from it directly?

Historical explanation for how it got to be this way:
When I built these internal functions, I did not always know what the equations were for certain discrete cases, namely CDFs. This is why CDFs for the various distributions don't currently support discrete forms, except the power law (which is the only one necessary for the xmin selection process). So I put in pdf*_normalizers to handle those cases where I didn't know the discrete CDF equation. This is mucky and no longer necessary.

from powerlaw.

drbild avatar drbild commented on July 22, 2024

I see. The CDF is only useful for re-normalizing the PDF to account for an xmin or xmax differing from the natural support, right? The normalizing constant for the PDF such that it sums/integrates to 1 over the natural support, e.g., (0, Infinity) or [xmin, Infinity) for the power laws, must still be specified.

I'm working on a refactor that yields four abstract methods on Distribution: cdf(continuous,discrete)_base_function(data) and pdf(continuous,discrete)_normalizer(data).

The per-distribution implementations may ignore the the xmin and/or xmax parameter, returning the CDF or normalizer over the natural support instead. The cdf()/pdf() methods provide the necessary xmin/xmax normalization, per your comment. The only contract is that both the _cdf___base_function() and _pdf___normalizer() methods must use the /same/ support (e.g., both must use the natural support or both must consider xmin or both must consider xmin and xmax), so that the CDF can be used to normalize the PDF to the [xmin xmax] bound.

(I guess the two pdf normalizers and the _pdf_base_function could be replaced by two pdf*_base_functions, but I think the shared _pdf_base_function emphasizes the shared PDF shape.)

For the central question, one could imagine that in some cases direct evaluation of the analytical expression for is more numerically stable or faster. But if that is ever an actual concern, it's easy for a particular Distribution to override cdf() and pdf() to use direct evaluation.

And thanks for the library. It's been very helpful to have a clean, Pythonic library for this analysis.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Glad you like the library!

I wrote a long response to this, and then I realized that you probably understood everything I was going to say, just using slightly different language. So I think my only question is this:

Is your refactor going to define _pdf__normalizer() using the _cdf__base_function()?

If so, then each Distribution just needs to define _pdf_base_function() and _cdf_base_function(), and everything else can come from there. Example from the current code:
def _pdf_discrete_normalizer(self):
C = 1.0 - self._cdf_xmin
if self.xmax:
C -= 1 - self._cdf_base_function(self.xmax+1)
C = 1.0/C
return C

You could tuck the discrete vs. continuous check within _pdf_base_function() and _cdf_base_function(), or maybe there's a good reason to have a discrete and continuous version of each.

So, is this what you were intending, or have I misunderstood?

from powerlaw.

drbild avatar drbild commented on July 22, 2024

Basically, yes.

Assuming that CDF(x) is defined as Pr[X <= x] and not Pr[X < x], then it should be
C = 1.0 - self._cdf_base_function(self.xmin - 1)
C -= 1 - self._cdf_base_function(self.xmax)
C = 1.0/C
return C

xmin - 1 may be outside the support (e.g., if xmin == 1), so a slightly different form might be needed.

(The Clausett paper defines the CCDF as Pr[X >= x], rather than the traditional Pr[X > x], which is annoying because then, in the discrete case, the CCDF is not 1 - CDF. That alternate definition of CCDF has a cleaner analytical form for the power law, but it's still weird.)

I think maintaining separate discrete and continuous functions is useful so that (a) it's clear that both discrete and continuous froms shoud be implemented and (b) the code crashes when discrete data is passed to a Distribution not supporting the discrete form, rather than just silently using the continuous form.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Excellent. I look forward to your refactor!

I agree with (a). Regarding your (b), the choice of discrete vs continuous is determined by the discrete keyword in the Fit and Distribution classes. So crashing should occur when the user sets that keyword. I deliberately didn't include a check for assume all-integer data should be handled as discrete. Were you implying there should be one?

It seemed odd to me that if a user gave a dataset of all integers powerlaw would fit one way, and then the user gave the same dataset with 1.5 appended powerlaw would fit in a radically different way. Is there Pythonic rationale to do the automatic detection?

from powerlaw.

drbild avatar drbild commented on July 22, 2024

I should have been more clear. For (b), I didn't mean automatic detection of discrete data; specifying via the "discrete" parameter seems right. But if the same cdf_base/pdf_normalizer method is used for both discrete and continuous cases, it's up to the implementer of a new Distribution to check self.discrete and do the right thing. If he doesn't and just assumes continuous, users can call the method with self.discrete == true and get bad results. By using separate methods, the program will crash when the missing discrete_base_function or discrete_normalizer method is called.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Cool. Looking forward to what you come up with!

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Hi David,

I just happened to see that you did some more work on this on your fork. Did it end up at a conclusion?

from powerlaw.

drbild avatar drbild commented on July 22, 2024

Hey Jeff,
I never got around to cleaning my stuff up enough to make a pull request. Let me see if it still looks worth doing and get back to you.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Okie doke. A fair amount has changed in powerlaw's plumbing since then, but
I don't think anything directly relevant to what you were doing.

On Mon, Apr 28, 2014 at 8:23 PM, David R. Bild [email protected]:

Hey Jeff,
I never got around to cleaning my stuff up enough to make a pull request.
Let me see if it still looks worth doing and get back to you.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-41601714
.

from powerlaw.

drbild avatar drbild commented on July 22, 2024

I still haven't had a chance to really dive back into my branch from last year and probably won't anytime soon. I'm not doing data analysis currently (the current version got me through my dissertation!) so don't have the cycles to spare fore this.

My "refactor-normalizers" seems to only have updates for some of the distributions, so additional work would be needed before a pull request.

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Thanks for the heads up. Glad it was useful for you!

On Sat, May 17, 2014 at 8:34 AM, David R. Bild [email protected]:

I still haven't had a chance to really dive back into my branch from last
year and probably won't anytime soon. I'm not doing data analysis currently
(the current version got me through my dissertation!) so don't have the
cycles to spare fore this.

My "refactor-normalizers" seems to only have updates for some of the
distributions, so additional work would be needed before a pull request.


Reply to this email directly or view it on GitHubhttps://github.com//issues/4#issuecomment-43400278
.

from powerlaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.