GithubHelp home page GithubHelp logo

Pickling Fit objects. about powerlaw HOT 7 CLOSED

jeffalstott avatar jeffalstott commented on July 22, 2024
Pickling Fit objects.

from powerlaw.

Comments (7)

jeffalstott avatar jeffalstott commented on July 22, 2024

I do not know what's causing the maximum recursion error. However, the Fit object contains all your data*, so it's huge. Maybe that's the origin of the problem. However, you probably don't want to actually store all your data; you already have that stored somewhere! What you probably want are the following values:

fit.xmin
fit.powerlaw.alpha
fit.powerlaw.sigma
fit.distribution_compare('power_law', 'exponential')

Those are some of the essentials, and they're only a few numbers. You may also want things like:

fit.distribution_compare('power_law', 'truncated_power_law')
fit.distribution_compare('power_law', 'lognormal')
fit.exponential.Lambda
fit.xmax (if you have one)

and so on. All of these values will be very small together, so storing them in a pickle or whatnot should be easy.

There are other values that you also might to store, which are nearly the same size as your data:

fit.Ds
fit.alphas
fit.sigmas

These are the value of D, alpha, and sigma for the power law fit at every possible value of xmin. So they're as long as the number of unique values in your dataset, which is probably huge! You probably don't want to store these, but you might if you have a tricky situation like that described in the "Multiple Possible Fits" section of the paper describing PLoS ONE. If you're not in that situation, don't save these.

So, what describes your use case? What are the features you're actually needing to save? The simplest route is to identify the features you need, put them in a dictionary and pickle that. But maybe you need a less simple route.

*It actually contains two copies of it: All of the original data, and then a copy of the data with just the points within xmin and xmax. To better handle very large datasets such as yours, perhaps we should come up with a system that doesn't store the data twice. Just have one copy of the data, and two views on it. Actually, this might be how numpy is handling the data, anyway. I'd have to check.

from powerlaw.

hsbarbosa avatar hsbarbosa commented on July 22, 2024

Thanks, Jeff!

I've tested with a small object as well and got the same issue.

 data_fit = pl.Fit(data['min_rank'][:1000])
pickle.dump(data_fit,open('test.pck','wb'))
pickle.load(open('test.pck','rb'))
RuntimeError                              Traceback (most recent call last)
<ipython-input-140-a0aef4fe1530> in <module>()
----> 1 pickle.load(open('test.pck','rb'))
...
   135 
    136     def __getattr__(self, name):
--> 137         if name in self.supported_distributions.keys():
    138             #from string import capwords
    139             #dist = capwords(name, '_')
RuntimeError: maximum recursion depth exceeded while calling a Python object 

It might be related with some circular reference like in the fit.power_law.parent_Fit, I don't know.
This pickle issue seems to be similar to what happens with BeautifulSoup objects.

Anyway, I was trying to store it because it took several hours to run and I was afraid of a python kernel crash and I could want to reproduce the results later (or on another machine). I'm gonna store the necessary information to recreate the Fit and Distribution objects!!
Thank you, again!

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Huh! Thanks for pointing this out; I've replicated the issue. For now I am not going to alter anything, since nobody else has brought up the idea of pickling Fit objects. However, I am going to leave this issue open in case someone wants to try to fix it.

It might be nice to have some standard procedure for grabbing all the "important" properties of a Fit and storing them in some way. That is more of a UX question than anything.

"I'm gonna store the necessary information to recreate the Fit and Distribution objects!!"
On that note: The data points I suggested are not going to allow you to easily recreate a Fit or Distribution object. But, if you even know the optimal xmin, then you can create a new Fit with a manually defined xmin (fit = powerlaw.Fit(data, xmin=my_optimal_xmin_that_I_stored). That Fit will be very easy to calculate, and from there you can calculate loglikelihood ratios fairly easily. What that Fit will NOT have is things like Ds or alphas, which are desired in some edge cases but almost certainly not useful to you.

from powerlaw.

rnpgeo avatar rnpgeo commented on July 22, 2024

Hi guys,
Thanks for the useful thread. I've been searching around most of the afternoon for a solution to this. Its good to know it's not just a problem with my code anyway. I'd be interested to hear of any developments with this. Thanks for the great module!
Cheers

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Hi @rnpgeo,

No advances so far. If you want to take a stab at it, you're very welcome to try. If you get a solution we can merge your update into powerlaw.

from powerlaw.

hsbarbosa avatar hsbarbosa commented on July 22, 2024

Guys,
I think I've got a solution for this issue and it's much simpler than I thought it would be.
I haven't tested it thoroughly for all PL classes but at least for the Fit objects, it works.

The solution is simply calling the base implementation of the __getattr__ method for magic names in Fit class, changing it from.

line 144

    def __getattr__(self, name):
        if name in self.supported_distributions.keys():
        ...

to

def __getattr__(self, name):
        if name.startswith('__') and name.endswith('__'):
            return super(Fit, self).__getattr__(name)     
        if name in self.supported_distributions.keys():
        ...

from powerlaw.

jeffalstott avatar jeffalstott commented on July 22, 2024

Excellent! I can see how what's currently written would create undesired behaviors.

Question: Currently the __getattr__ is in a kind of stupid if/else statement, where the else returns an AttributeError (in line 166.) Would your correction be better placed by integrating with that if/else? For example:

def __getattr__(self, name):     
        if name in self.supported_distributions.keys():
        ...
        else:
            return super(Fit, self).__getattr__(name)

Would that accomplish the goal of fixing the pickling problem while also being more robust and elegant? If you test this out and determine that everything works (including the Distribution objects), then I can either make the edit to the code or you can submit a pull request and I'll merge it in.

Thanks!

from powerlaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.