GithubHelp home page GithubHelp logo

thompsonml / pycaim Goto Github PK

View Code? Open in Web Editor NEW

This project forked from morgan243/pycaim

0.0 1.0 0.0 143 KB

Python implementation of the CAIM (class-attribute interdependence maximization) algorithm. Requires Pandas and Numpy.

Python 100.00%

pycaim's Introduction

CAIM is a supervised discretization method [1] and Python-CAIM is a Python implementation of CAIM. This is a work in progress, results should be closely inspected. The goal is to provide both a CLI to discretize data for later use as well as a class for programmatic usage. Pull requests welcome.

There is a MATLAB implementation by Guangdi Li and a Java implementation (Research->Data Mining Tool) by the author. The latter being an implementation of the currently unpublished CAIM+ version of the algorithm.

Current Python-CAIM is working on UCI's Musk1 dataset as well as other toy datasets. Results are validated against the Java implementation (see above).

On performance, the Java implementation has notably lower latency (higher performance). This may be due to Java being fundamentally faster than Python, design tricks/shortcuts, or a combination of both. Currently difficult to determine source of improved performance since source code does not appear to be included in the CAIM JAR file. The MatLab version is comparable and often faster for very small datasets. However, Python-CAIM can parallelize discretization, and can thus scale better for datasets with many features.

CLI Options

usage: caim.py [-h] [-t TARGET_FIELD] [-o OUTPUT_PATH] [-H] [-q] input_file

CAIM Algorithm Command Line Tool and Library

positional arguments:
  input_file            CSV input data file

optional arguments:
  -h, --help            show this help message and exit
  -t TARGET_FIELD, --target-field TARGET_FIELD
                        Target field as an integer (0-indexed) or string
                        corresponding to column name. Negative indices (e.g.
                        -1) are allowed.
  -o OUTPUT_PATH, --output-path OUTPUT_PATH
                        File path to write discretized form of data in CSV
                        format
  -H, --header          Use first row as column/field names
  -q, --quiet           Minimal information is printed to STDOUT

Example Usages

Discretize IRIS data

python3 ./caim.py datasets/iris.data -t -1 -H

Discretize IRIS data and save discrete results to iris_caim_data.csv

python3 ./caim.py datasets/iris.data -t -1 -H -o iris_caim.csv

Discretize musk1

python3 ./caim.py datasets/musk_clean1.csv -t -1

Interval Output

Intervals are printed in the form:

[ 0.13  0.34  0.39  0.66]

Which should be interpretted as:

[0.13, 0.34](0.34, 0.39](0.39, 0.66]

The output dataset will use the right-end of each interval as the discretized value.

TODO

  • Fix Unit Tests
  • Continue to re-implement in Pandas/NumPy for speed (avoid loops)
  • Add more test data and corresponding unittests
  • Clean-up API and document

[1] Kurgan, L. and Cios, K.J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145-153

pycaim's People

Contributors

morgan243 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.