GithubHelp home page GithubHelp logo

jenks-caspall.py's Introduction

Python Jenks-Caspall Natural breaks

This is an attempt to implement Jenks-Caspall reiterative forcing algorithm.

This script correctly works as it but it's still in developement. Future improvement will concerns classes design and clusters initialization method.


The well know Natural Break classification can be computed through 2 algorithms:

  • The Jenks-Caspall algorithm developed in 1971 is an empirical approach based on minimizing the classification errors by moving observations between adjacent classes.
  • The Fisher-Jenks algorithm, introduced to cartographers by Jenks in 1977, uses in contrast a mathematical foundation, developing by Fisher in 1958, that guarantees an optimal solution. It usually refers as "Jenks’s optimal method".

There are strong Python implementations of Jenks-Fisher available, but they can't process lot of values in a reasonable time. The algorithm makes this somewhat inevitable : the number of possible combination increases exponentially with the number of values, and find the optimal partition will be increasingly slow. Usually softwares which uses Jenks-Fisher sample the data befohand to maintain acceptable running time. Using an optimal method with only a sample of data is a nonsense unless the sampling was taken very carefully.

Jenks-Caspall algorithm was presented in the paper : "Error on choroplethic maps: definition, measurement, reduction" (Jenks GF, Caspall FC, 1971)

This algorithm is subdivised in two steps:

  • Step 1 is similar to the well know K-means algorithm : starting with a giving partition, each value is affected to its closest class mean. Then, means are updated and the process reiterates until there is no more possible move.
  • Step 2 try to optimize the partition by forcing values to their adjacent classes regardless their distances to the classes means.

To find the best partition, Jenks introduced three errors criteria :

  • Overview Accuracy Index (OAI) based on area of entities
  • Tabular Accuracy Index (TAI) based on sum of absolute deviations about class means
  • Boundary Accuracy Index (BAI) based on spatial neighborhood relationships of entities

In it's original paper, Jenks himself have focused on tabular error, primarily because of its simplicity.

Nowadays, another index named Goodness of Variance Fit (GVF) is more widelly used.

  • GVF is based on sum of square deviation between values and mean.
  • TAI is based on sum of absolute deviation between values and mean. The best GVF will not necessarily be the best TAI and vice versa.

Notes about K-means

K-means is generally used to classify multi dimensional data like 2D points. Traditionals K-means implementations can be widely optimized to process 1 dimensional values which is completely ordered.

There have been many variations and studies on K-means algorithm over the past 50 years. One of the most extensively discussed problem concerns the initialization method because K-means performance is extremely sensitive to cluster center initialization. Bad initialization can lead to poor convergence speed and bad overall clustering.

One the most recent proposal to this problem is the K-means++ algorithm (Arthur and Vassilvitskii 2007). It help to initialize partition with centroids that are furthest away as possible from each other. The first cluster center is chosen uniformly at random and each subsequent center is chosen with a probability proportional to its squared distance from the closest existing center.

Note that the random based initializations will give different results for multiple runs. It is a common practise to run k-means multiple times, and choose the better resulting partition.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.