As I understand it, this package attempts to fit power law histograms to a set of prov

Allow for weighted data about powerlaw HOT 5 CLOSED

jeffalstott commented on August 25, 2024

Allow for weighted data

from powerlaw.

Comments (5)

jeffalstott commented on August 25, 2024

powerlaw does not use histogram data. It uses raw data, the individual observations that would go into making a histogram. These methods are described in the Alstott et al. 2014 paper and the Clauset et al. 2007 papers, both of which are linked on the project's front page.

For methods which actually DO use histograms, see Virkar and Clauset 2014: http://arxiv.org/abs/1208.3524
Those methods are not implemented in powerlaw.

However, the process that you are using is a decent first pass solution. Virkar and Clauset 2014 will probably tell you what the problems are with it, such as the rounding errors you've already identified.

powerlaw is always open for contributions! If you or someone else would like to implement the histogram-fitting techniques, that would be great. Matlab code for Virkar and Clauset's methods is here:
http://tuvalu.santafe.edu/~aaronc/powerlaws/bins/

from powerlaw.

drevicko commented on August 25, 2024

I was actually wondering if weighted data could be directly incorporated into your algorithms in some way, and I'm afraid I've not the time to look deeply at it right now (3 months to go on a phd thesis! why am I writing this!? ;).
Hmm.. just for clarification, what I meant by "attempting to fit a power law histogram" was something like this: If you plot the data as a histogram, then plot the fitted power law pdf, they would line up pretty well (assuming the fit to the data is good). Or am I confusing things?

from powerlaw.

jeffalstott commented on August 25, 2024

That sounds like the basics of fitting a probability distribution. I suggest reading Alstott et al. 2014. The first figure shows what you're describing (the visualized data points are a histogram).

As for having a bias for certain bins/observation sizes: If the bias is known and just a fixed percentage, then just lower the number of observations in the dataset by that percentage. So, if observations of size X are 10% more likely to be sampled than they should be, and you have 110 observations, remove 10 of them to get only 100 observations. Then feed the whole set of observations into powerlaw as one normally would.

from powerlaw.

drevicko commented on August 25, 2024

Yeah, I guess so. Lately I've been working with higher dimensional problems (topic modelling mostly) where you have to be quite explicit exactly what you're modelling - kind of forgot that it can be obvious!

The data I have a estimates of link lifetimes in graph data. I have creation and dissolution events and the time between. Problem is, with a given observational window (of time), the effective observational window is smaller for larger life times (since you need both ends). I'm correcting for that by scaling by the effective (reduced) observational window, essentially promoting the longer ones. The scaling ranges from nearly one (small life times) to quite large (life times close to the observational window). I could quantise the scaling, rounding to the nearest integer, but that'd introduce errors that are large for the small lifetimes. I could ameliorate by multiplying the scaling by, say, 100, but then an already quite large data set would become huge. The histogram approach is probably better. Hmm.. another bias introduce by both these approaches is to give unfair statistical weight to large life times: one observation is counted as many data points. I think dealing with these issues warrants a deeper analysis that I am in no position to make right now, so I'll just run with the histogram-synthetic data approach and describe it's short falls ;)

Thanks for your attention to this - it's been useful to think deeper on it!

from powerlaw.

jeffalstott commented on August 25, 2024

Have you just tried using an xmax in the distribution, which powerlaw
supports?

On Sunday, April 26, 2015, drevicko [email protected] wrote:

Yeah, I guess so. Lately I've been working with higher dimensional
problems (topic modelling mostly) where you have to be quite explicit
exactly what you're modelling - kind of forgot that it can be obvious!

The data I have a estimates of link lifetimes in graph data. I have
creation and dissolution events and the time between. Problem is, with a
given observational window (of time), the effective observational window is
smaller for larger life times (since you need both ends). I'm correcting
for that by scaling by the effective (reduced) observational window,
essentially promoting the longer ones. The scaling ranges from nearly one
(small life times) to quite large (life times close to the observational
window). I could quantise the scaling, rounding to the nearest integer, but
that'd introduce errors that are large for the small lifetimes. I could
ameliorate by multiplying the scaling by, say, 100, but then an already
quite large data set would become huge. The histogram approach is probably
better. Hmm.. another bias introduce by both these approaches is to give
unfair statistical weight to large life times: one observation is counted
as many data points. I think dealing with thes e issues warrants a deeper
analysis that I am in no position to make right now, so I'll just run with
the histogram-synthetic data approach and describe it's short falls ;)

Thanks for your attention to this - it's been useful to think deeper on it!

—
Reply to this email directly or view it on GitHub
#20 (comment).

from powerlaw.

Allow for weighted data about powerlaw HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs