Is your feature request related to a problem? Please describe. <p

Simple statistical comparisons of sensitive groups about fairlens HOT 1 CLOSED

synthesized-io commented on June 7, 2024 2

Simple statistical comparisons of sensitive groups

from fairlens.

Comments (1)

bogdansurdu commented on June 7, 2024

I think this is a useful addition to the library and it is quite straightforward to do for target distributions that are numerical or DateTime (as the first two central moments, mean and variance, are calculated through clearly defined operations).

However, the problem becomes interesting when considering categorical variables and distributions (binary variables are automatically included since they are a subset of these), because there doesn't seem to be a consensus in existing implementations and even literature. So far, three methods for calculating categorical variance and two methods for categorical mean have been considered (which behave quite differently) and I believe there are even more possibilities.

Variance

1. Multinomial Variance

Every categorical distribution is a multinomial one in the sense, however since we are usually interested in one trial when calculating variance and mean, we do it by setting n = 1. Thus, we can calculate variance by getting the counts of each categorical distribution from the target, normalizing them so that they can be regarded as probabilities. Then simply we would have: Var(X_k) = p_k * (1 - p_k).

2. Entropy

Entropy also seems to be a valid choice, as it can be considered the amount of randomness that a categorical variable presents for a specific group of interest, which should ideally be small. On that note, it can be a bit harder to interpret this, as the numeric result ranges from [0, ∞), but since we are interested in comparing across groups, the task might be easier.

3. Sum of fraction squares

This is a method that has been used to measure concentrations of categorical distributions. The more diverse a distribution is, the measure is going to produce smaller values, as the fractions associated with each variable decrease in magnitude. Therefore, it might make sense to instead return 1 - sum_of_squares, as that directly measures the spread of the variable. This measure is also quite nice because it is bounded by [0, 1].

Mean

1. Multinomial Mean

By the same reasoning as for the multinomial variance, we can use the multinomial expected value formula (considering only one trial, so n = 1). Thus, we have E(X_k) = p_k), where again p is simply the normalized counts of the k^th categorical variable. This is going to be used when the mode that handles how categorical series are going to be treated in the function that generates moments for a random series is set to multinomial.

2. Mode

It is worth noting that the multinomial functions return a series of values, as opposed to a single float. Thus, it makes sense to have a way of returning a type of "mean" that is a singular value, so the simplest way to do so is to return the variable that occurs more often (when categorical_mode is set to entropy or squares). However, this becomes tricky when two variables appear a close or equal number of times in the distribution, so it might be worth looking into how to handle this (maybe with some type of numerical encoding?).

Other Measures

Do you know of any other measures that could be worth looking into and considering for handling this?

from fairlens.

Simple statistical comparisons of sensitive groups about fairlens HOT 1 CLOSED

Comments (1)

Variance

Mean

Other Measures

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs