GithubHelp home page GithubHelp logo

Allow for weighted data about powerlaw HOT 5 CLOSED

jeffalstott avatar jeffalstott commented on August 25, 2024
Allow for weighted data

from powerlaw.

Comments (5)

jeffalstott avatar jeffalstott commented on August 25, 2024

powerlaw does not use histogram data. It uses raw data, the individual observations that would go into making a histogram. These methods are described in the Alstott et al. 2014 paper and the Clauset et al. 2007 papers, both of which are linked on the project's front page.

For methods which actually DO use histograms, see Virkar and Clauset 2014: http://arxiv.org/abs/1208.3524
Those methods are not implemented in powerlaw.

However, the process that you are using is a decent first pass solution. Virkar and Clauset 2014 will probably tell you what the problems are with it, such as the rounding errors you've already identified.

powerlaw is always open for contributions! If you or someone else would like to implement the histogram-fitting techniques, that would be great. Matlab code for Virkar and Clauset's methods is here:
http://tuvalu.santafe.edu/~aaronc/powerlaws/bins/

from powerlaw.

drevicko avatar drevicko commented on August 25, 2024

I was actually wondering if weighted data could be directly incorporated into your algorithms in some way, and I'm afraid I've not the time to look deeply at it right now (3 months to go on a phd thesis! why am I writing this!? ;).
Hmm.. just for clarification, what I meant by "attempting to fit a power law histogram" was something like this: If you plot the data as a histogram, then plot the fitted power law pdf, they would line up pretty well (assuming the fit to the data is good). Or am I confusing things?

from powerlaw.

jeffalstott avatar jeffalstott commented on August 25, 2024

That sounds like the basics of fitting a probability distribution. I suggest reading Alstott et al. 2014. The first figure shows what you're describing (the visualized data points are a histogram).

As for having a bias for certain bins/observation sizes: If the bias is known and just a fixed percentage, then just lower the number of observations in the dataset by that percentage. So, if observations of size X are 10% more likely to be sampled than they should be, and you have 110 observations, remove 10 of them to get only 100 observations. Then feed the whole set of observations into powerlaw as one normally would.

from powerlaw.

drevicko avatar drevicko commented on August 25, 2024

Yeah, I guess so. Lately I've been working with higher dimensional problems (topic modelling mostly) where you have to be quite explicit exactly what you're modelling - kind of forgot that it can be obvious!

The data I have a estimates of link lifetimes in graph data. I have creation and dissolution events and the time between. Problem is, with a given observational window (of time), the effective observational window is smaller for larger life times (since you need both ends). I'm correcting for that by scaling by the effective (reduced) observational window, essentially promoting the longer ones. The scaling ranges from nearly one (small life times) to quite large (life times close to the observational window). I could quantise the scaling, rounding to the nearest integer, but that'd introduce errors that are large for the small lifetimes. I could ameliorate by multiplying the scaling by, say, 100, but then an already quite large data set would become huge. The histogram approach is probably better. Hmm.. another bias introduce by both these approaches is to give unfair statistical weight to large life times: one observation is counted as many data points. I think dealing with these issues warrants a deeper analysis that I am in no position to make right now, so I'll just run with the histogram-synthetic data approach and describe it's short falls ;)

Thanks for your attention to this - it's been useful to think deeper on it!

from powerlaw.

jeffalstott avatar jeffalstott commented on August 25, 2024

Have you just tried using an xmax in the distribution, which powerlaw
supports?

On Sunday, April 26, 2015, drevicko [email protected] wrote:

Yeah, I guess so. Lately I've been working with higher dimensional
problems (topic modelling mostly) where you have to be quite explicit
exactly what you're modelling - kind of forgot that it can be obvious!

The data I have a estimates of link lifetimes in graph data. I have
creation and dissolution events and the time between. Problem is, with a
given observational window (of time), the effective observational window is
smaller for larger life times (since you need both ends). I'm correcting
for that by scaling by the effective (reduced) observational window,
essentially promoting the longer ones. The scaling ranges from nearly one
(small life times) to quite large (life times close to the observational
window). I could quantise the scaling, rounding to the nearest integer, but
that'd introduce errors that are large for the small lifetimes. I could
ameliorate by multiplying the scaling by, say, 100, but then an already
quite large data set would become huge. The histogram approach is probably
better. Hmm.. another bias introduce by both these approaches is to give
unfair statistical weight to large life times: one observation is counted
as many data points. I think dealing with thes e issues warrants a deeper
analysis that I am in no position to make right now, so I'll just run with
the histogram-synthetic data approach and describe it's short falls ;)

Thanks for your attention to this - it's been useful to think deeper on it!


Reply to this email directly or view it on GitHub
#20 (comment).

from powerlaw.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.