Comments (1)
I think this is a useful addition to the library and it is quite straightforward to do for target distributions that are numerical or DateTime (as the first two central moments, mean and variance, are calculated through clearly defined operations).
However, the problem becomes interesting when considering categorical variables and distributions (binary variables are automatically included since they are a subset of these), because there doesn't seem to be a consensus in existing implementations and even literature. So far, three methods for calculating categorical variance and two methods for categorical mean have been considered (which behave quite differently) and I believe there are even more possibilities.
Variance
1. Multinomial Variance
Every categorical distribution is a multinomial one in the sense, however since we are usually interested in one trial when calculating variance and mean, we do it by setting n = 1
. Thus, we can calculate variance by getting the counts of each categorical distribution from the target, normalizing them so that they can be regarded as probabilities. Then simply we would have: Var(Xk) = pk * (1 - pk).
2. Entropy
Entropy also seems to be a valid choice, as it can be considered the amount of randomness that a categorical variable presents for a specific group of interest, which should ideally be small. On that note, it can be a bit harder to interpret this, as the numeric result ranges from [0, ∞)
, but since we are interested in comparing across groups, the task might be easier.
3. Sum of fraction squares
This is a method that has been used to measure concentrations of categorical distributions. The more diverse a distribution is, the measure is going to produce smaller values, as the fractions associated with each variable decrease in magnitude. Therefore, it might make sense to instead return 1 - sum_of_squares
, as that directly measures the spread of the variable. This measure is also quite nice because it is bounded by [0, 1]
.
Mean
1. Multinomial Mean
By the same reasoning as for the multinomial variance, we can use the multinomial expected value formula (considering only one trial, so n = 1
). Thus, we have E(Xk) = pk), where again p
is simply the normalized counts of the kth categorical variable. This is going to be used when the mode that handles how categorical series are going to be treated in the function that generates moments for a random series is set to multinomial
.
2. Mode
It is worth noting that the multinomial functions return a series of values, as opposed to a single float. Thus, it makes sense to have a way of returning a type of "mean" that is a singular value, so the simplest way to do so is to return the variable that occurs more often (when categorical_mode
is set to entropy
or squares
). However, this becomes tricky when two variables appear a close or equal number of times in the distribution, so it might be worth looking into how to handle this (maybe with some type of numerical encoding?).
Other Measures
Do you know of any other measures that could be worth looking into and considering for handling this?
from fairlens.
Related Issues (20)
- Import the insight package into fairlens and remove any duplicated code
- Broken Logo Image on PyPi
- Fairlens Logo Difficult to read with a dark-mode theme
- Binary distribution plots are difficult to interpret HOT 1
- Plot scaling
- Integrating insight into fairlens
- [ISSUE] Templates for Pull Requests
- Pairwise distance computation in the fairness scorer
- Setting up CI + packaging
- Organising the repository for going public.
- Roadmap HOT 1
- Brand Assets
- Updated README and documentation HOT 1
- Contribution Guidelines
- Publishing
- Mitigate biases that are detected in datasets. HOT 1
- Cross reference API reference in docs
- Add "open in colab" badge HOT 1
- Add version number to docs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fairlens.