GithubHelp home page GithubHelp logo

Comments (1)

bogdansurdu avatar bogdansurdu commented on June 7, 2024

I think this is a useful addition to the library and it is quite straightforward to do for target distributions that are numerical or DateTime (as the first two central moments, mean and variance, are calculated through clearly defined operations).

However, the problem becomes interesting when considering categorical variables and distributions (binary variables are automatically included since they are a subset of these), because there doesn't seem to be a consensus in existing implementations and even literature. So far, three methods for calculating categorical variance and two methods for categorical mean have been considered (which behave quite differently) and I believe there are even more possibilities.

Variance

1. Multinomial Variance

Every categorical distribution is a multinomial one in the sense, however since we are usually interested in one trial when calculating variance and mean, we do it by setting n = 1. Thus, we can calculate variance by getting the counts of each categorical distribution from the target, normalizing them so that they can be regarded as probabilities. Then simply we would have: Var(Xk) = pk * (1 - pk).

2. Entropy

Entropy also seems to be a valid choice, as it can be considered the amount of randomness that a categorical variable presents for a specific group of interest, which should ideally be small. On that note, it can be a bit harder to interpret this, as the numeric result ranges from [0, ∞), but since we are interested in comparing across groups, the task might be easier.

3. Sum of fraction squares

This is a method that has been used to measure concentrations of categorical distributions. The more diverse a distribution is, the measure is going to produce smaller values, as the fractions associated with each variable decrease in magnitude. Therefore, it might make sense to instead return 1 - sum_of_squares, as that directly measures the spread of the variable. This measure is also quite nice because it is bounded by [0, 1].

Mean

1. Multinomial Mean

By the same reasoning as for the multinomial variance, we can use the multinomial expected value formula (considering only one trial, so n = 1). Thus, we have E(Xk) = pk), where again p is simply the normalized counts of the kth categorical variable. This is going to be used when the mode that handles how categorical series are going to be treated in the function that generates moments for a random series is set to multinomial.

2. Mode

It is worth noting that the multinomial functions return a series of values, as opposed to a single float. Thus, it makes sense to have a way of returning a type of "mean" that is a singular value, so the simplest way to do so is to return the variable that occurs more often (when categorical_mode is set to entropy or squares). However, this becomes tricky when two variables appear a close or equal number of times in the distribution, so it might be worth looking into how to handle this (maybe with some type of numerical encoding?).

Other Measures

Do you know of any other measures that could be worth looking into and considering for handling this?

from fairlens.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.