oliviaguest / gini Goto Github PK
View Code? Open in Web Editor NEWCalculate the Gini coefficient of a numpy array.
Home Page: http://neuroplausible.com/gini
License: Creative Commons Zero v1.0 Universal
Calculate the Gini coefficient of a numpy array.
Home Page: http://neuroplausible.com/gini
License: Creative Commons Zero v1.0 Universal
Currently, the range is shifted to make values be non-negative:
if np.amin(array) < 0:
# Values cannot be negative:
array -= np.amin(array)
This is dangerous. This should be controlled by a user-specified function argument with a default value of False, e.g. shift_negative=False
.
I know this is a documented assumption that the inputs be positive. At a minimum, it's safer to raise an exception if an assumption is violated, rather than to handle it forcibly.
Hi :)
Do you think it could be valuable to add Numba speed up to the function?
Since it is clean numpy code it should be as easy as adding a one decorator.
Some code for reproducibility
from time import time
import numpy as np
from numba import jit
import matplotlib.pyplot as plt
def gini_normal(array):
"""Calculate the Gini coefficient of a numpy array."""
# based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif
# from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
array = array.flatten() #all values are treated equally, arrays must be 1d
if np.amin(array) < 0:
array -= np.amin(array) #values cannot be negative
array += 0.0000001 #values cannot be 0
array = np.sort(array) #values must be sorted
index = np.arange(1,array.shape[0]+1) #index per array element
n = array.shape[0]#number of array elements
return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)))
@jit(nopython=True)
def gini_numba(array):
"""Calculate the Gini coefficient of a numpy array."""
# based on bottom eq: http://www.statsdirect.com/help/content/image/stat0206_wmf.gif
# from: http://www.statsdirect.com/help/default.htm#nonparametric_methods/gini.htm
array = array.flatten() #all values are treated equally, arrays must be 1d
if np.amin(array) < 0:
array -= np.amin(array) #values cannot be negative
array += 0.0000001 #values cannot be 0
array = np.sort(array) #values must be sorted
index = np.arange(1,array.shape[0]+1) #index per array element
n = array.shape[0]#number of array elements
return ((np.sum((2 * index - n - 1) * array)) / (n * np.sum(array)))
def profiler(func):
"""Quick and dirty utility func for timing perfromance"""
timing = []
for max_iter in (1e1, 1e2, 1e3, 1e4, 1e5, 1e6):
start = time()
for iteration in range(int(max_iter)):
func(np.random.random(size=(10)))
timing.append(time() - start)
return timing
###################################################
time_normal = profiler(gini_normal)
time_numba = profiler(gini_numba)
plt.figure(figsize=(5, 5))
plt.plot(
[1e1, 1e2, 1e3, 1e4, 1e5, 1e6],
time_normal,
label='Raw Numpy'
)
plt.plot(
[1e1, 1e2, 1e3, 1e4, 1e5, 1e6],
time_numba,
label='Numba + Numpy'
)
plt.ylabel('Seconds')
plt.xlabel('Number of Iterations')
plt.legend()
plt.show()
If I wanted to calculate ginni index, I would have to refactor the code in ginni.py, correct?
Hi Olivia,
Regard to this line array += 0.0000001
Is there a particular reason using 0.0000001 and not some smaller positive number, such as np.nextafter(np.float64(0), np.float64(1))
?
Also, why not first check if a zero exists in the array before adding 0.0000001?
Hi, thank you for putting this together! In the readme, you have an example creating random integers and showing the gini is "low", being .33. As someone who has built many ML models with a gini under .33, this seemed weird to me. The other gini function you reference being similar to, here (https://github.com/pysal/pysal/pull/862/files), has a function that instead returns around -.33 as opposed to +.33 when calculating the gini of random numbers. Thus, I believe there is an issue returning negative ginis in your function.
I'd would like to calculate gini's index with categorical variables.
I have data with zones visited by people for example:
and I want to know how dispersed (or not very dispersed) is that person depending on the areas you visit in a parameter from 0 to 1. So, i want to obtain that person 3 is less disperse than person 1. I think that the value 0 of this parameter represent less dispersion. To do that i believe that gini's index represent that, but my variables (zones) are categorical.
Do you know how can i resolve this?
Would it be possible to adapt this to support a second weights
array?
This page includes an R function that adds weights support: http://ellisp.github.io/blog/2017/08/05/weighted-gini
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.