Hi Jeff, I'm trying to pickle some large Fit objects but when I try

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Pickling Fit objects. about powerlaw HOT 7 CLOSED

jeffalstott commented on July 22, 2024

Pickling Fit objects.

from powerlaw.

Comments (7)

jeffalstott commented on July 22, 2024

I do not know what's causing the maximum recursion error. However, the Fit object contains all your data*, so it's huge. Maybe that's the origin of the problem. However, you probably don't want to actually store all your data; you already have that stored somewhere! What you probably want are the following values:

fit.xmin
fit.powerlaw.alpha
fit.powerlaw.sigma
fit.distribution_compare('power_law', 'exponential')

Those are some of the essentials, and they're only a few numbers. You may also want things like:

fit.distribution_compare('power_law', 'truncated_power_law')
fit.distribution_compare('power_law', 'lognormal')
fit.exponential.Lambda
fit.xmax (if you have one)

and so on. All of these values will be very small together, so storing them in a pickle or whatnot should be easy.

There are other values that you also might to store, which are nearly the same size as your data:

fit.Ds
fit.alphas
fit.sigmas

These are the value of D, alpha, and sigma for the power law fit at every possible value of xmin. So they're as long as the number of unique values in your dataset, which is probably huge! You probably don't want to store these, but you might if you have a tricky situation like that described in the "Multiple Possible Fits" section of the paper describing PLoS ONE. If you're not in that situation, don't save these.

So, what describes your use case? What are the features you're actually needing to save? The simplest route is to identify the features you need, put them in a dictionary and pickle that. But maybe you need a less simple route.

*It actually contains two copies of it: All of the original data, and then a copy of the data with just the points within xmin and xmax. To better handle very large datasets such as yours, perhaps we should come up with a system that doesn't store the data twice. Just have one copy of the data, and two views on it. Actually, this might be how numpy is handling the data, anyway. I'd have to check.

from powerlaw.

hsbarbosa commented on July 22, 2024

Thanks, Jeff!

I've tested with a small object as well and got the same issue.

 data_fit = pl.Fit(data['min_rank'][:1000])
pickle.dump(data_fit,open('test.pck','wb'))
pickle.load(open('test.pck','rb'))

RuntimeError                              Traceback (most recent call last)
<ipython-input-140-a0aef4fe1530> in <module>()
----> 1 pickle.load(open('test.pck','rb'))
...
   135 
    136     def __getattr__(self, name):
--> 137         if name in self.supported_distributions.keys():
    138             #from string import capwords
    139             #dist = capwords(name, '_')
RuntimeError: maximum recursion depth exceeded while calling a Python object

It might be related with some circular reference like in the fit.power_law.parent_Fit, I don't know.
This pickle issue seems to be similar to what happens with BeautifulSoup objects.

Anyway, I was trying to store it because it took several hours to run and I was afraid of a python kernel crash and I could want to reproduce the results later (or on another machine). I'm gonna store the necessary information to recreate the Fit and Distribution objects!!
Thank you, again!

from powerlaw.

jeffalstott commented on July 22, 2024

Huh! Thanks for pointing this out; I've replicated the issue. For now I am not going to alter anything, since nobody else has brought up the idea of pickling Fit objects. However, I am going to leave this issue open in case someone wants to try to fix it.

It might be nice to have some standard procedure for grabbing all the "important" properties of a Fit and storing them in some way. That is more of a UX question than anything.

"I'm gonna store the necessary information to recreate the Fit and Distribution objects!!"
On that note: The data points I suggested are not going to allow you to easily recreate a Fit or Distribution object. But, if you even know the optimal xmin, then you can create a new Fit with a manually defined xmin (fit = powerlaw.Fit(data, xmin=my_optimal_xmin_that_I_stored). That Fit will be very easy to calculate, and from there you can calculate loglikelihood ratios fairly easily. What that Fit will NOT have is things like Ds or alphas, which are desired in some edge cases but almost certainly not useful to you.

from powerlaw.

rnpgeo commented on July 22, 2024

Hi guys,
Thanks for the useful thread. I've been searching around most of the afternoon for a solution to this. Its good to know it's not just a problem with my code anyway. I'd be interested to hear of any developments with this. Thanks for the great module!
Cheers

from powerlaw.

jeffalstott commented on July 22, 2024

Hi @rnpgeo,

No advances so far. If you want to take a stab at it, you're very welcome to try. If you get a solution we can merge your update into powerlaw.

from powerlaw.

hsbarbosa commented on July 22, 2024

Guys,
I think I've got a solution for this issue and it's much simpler than I thought it would be.
I haven't tested it thoroughly for all PL classes but at least for the Fit objects, it works.

The solution is simply calling the base implementation of the __getattr__ method for magic names in Fit class, changing it from.

line 144

    def __getattr__(self, name):
        if name in self.supported_distributions.keys():
        ...

def __getattr__(self, name):
        if name.startswith('__') and name.endswith('__'):
            return super(Fit, self).__getattr__(name)     
        if name in self.supported_distributions.keys():
        ...

from powerlaw.

jeffalstott commented on July 22, 2024

Excellent! I can see how what's currently written would create undesired behaviors.

Question: Currently the __getattr__ is in a kind of stupid if/else statement, where the else returns an AttributeError (in line 166.) Would your correction be better placed by integrating with that if/else? For example:

def __getattr__(self, name):     
        if name in self.supported_distributions.keys():
        ...
        else:
            return super(Fit, self).__getattr__(name)

Would that accomplish the goal of fixing the pickling problem while also being more robust and elegant? If you test this out and determine that everything works (including the Distribution objects), then I can either make the edit to the code or you can submit a pull request and I'll merge it in.

Thanks!

from powerlaw.

Pickling Fit objects. about powerlaw HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs